Target search in the CRISPR/Cas9 system: Facilitated diffusion with target cues

We study how Cas9, a central component of the CRISPR/Cas9 system, searches for a target sequence on the DNA. We propose a model that includes as key ingredients 3D diffusion, 1D sliding along the DNA, and the effect of short binding sequences preceding the target (protospacer adjacent sequences -- PAMs). This latter aspect constitutes the main difference with traditional facilitated diffusion of transcription factors. We solve our model, obtaining an expression for the average search time of Cas9 for its target. We find that experimentally measured kinetic parameters are close to the values yielding an optimal search time. Our results rationalize the role of PAMs in guiding the search process, and show that Cas9 searches for its targets in a nearly optimal way.


I. INTRODUCTION
The CRISPR/Cas9 is an important system found in many bacterial species [1].It is commonly considered to be the bacterial immune system [1,2], although this view has been recently questioned [3].The CRISPR/Cas9 became popular due to its potential for gene editing.Cas9 is a central protein of the CRISPR/Cas9 system.It is able to target a 20 base pair (bp) DNA target site that is complementary to a guide RNA (gRNA) that is loaded on it.After finding a matching DNA sequence, a hybrid forms between one strand of the target sequence and the gRNA, usually followed by cleavage of the target DNA sequence.Target recognition by Cas9 is necessarily triggered by a protospacer adjacent motif (PAM), that is a short DNA sequence preceding the target.For spCas9, a common variant of Cas9, the PAM sequence is "NGG", in which "N" can be any base.Such short motif is very abundant on genomes.For example, the Escherichia coli genome contains about half a million PAM sequences [4].Cas9 binds on PAM sites in a tighter way than on other sites [5].Moreover, experiments indicate that Cas9 can locate PAM sequences by sliding on the DNA [5].
Our focus is to theoretically understand how Cas9 searches for its target sequence.Previous models have focused on particular aspects of this dynamics.Farasat and Salis studied the process from PAM recognition to cleavage [6].Shvets and Kolomeisky proposed a model for the search dynamics of the target sequence, including binding/unbinding on PAM sequences but not sliding [7].Our previous work analyzed binding on PAM sequences by sliding, but without considering the target sequence [8].
A key aspect of the dynamics of Cas9 is the alternance of 3D diffusion and 1D sliding on the DNA.Such alternance has been widely studied in the biophysics literature, and it is commonly referred to as "facilitated diffusion".Facilitated diffusion was originally proposed [9-11]    * simone.pigolotti@oist.jp to explain how some transcription factors (TFs) are able to locate their target on the DNA faster than predicted by 3D diffusion alone [12].Indeed, the combination of 3D diffusion and sliding on the DNA can significantly speed up the search process [9-11, 13, 14].Energetics plays an important role in facilitated diffusion.Proper recognition requires a DNA-binding protein to bind its target site tightly, i.e., a large binding energy differences between the target and other binding sequences.However, such large energy differences result in a rough binding energy landscape, leading to a much less effective sliding [15].There is therefore a trade-off between speed and stability, which can be alleviated by different possible mechanisms [13,14,16,17].
It is instructive to compare facilitated diffusion process of TFs and Cas9, see Fig. 1.Transcription factors usually have one or a few target sequence on the genome.The target of Cas9 is usually a single complementary sequence, but its recognition is triggered by one of the many PAM sequences on the genome.TFs can have a relatively long sliding length, usually estimated on the order of 100 bp [11] or even more [13].In contrast, the sliding length of Cas9 is on the order of ten base pairs [5] or less [8].
One could expect that PAM sequences could alleviate the limitation of a short sliding length, by providing some kind of guidance in finding the right target.This mechanism could be associated with a speed-stability trade-off, as in the case of TFs.Binding too strongly to each PAM sequence would slow the searching task, but skimming too quickly over PAMs might cause a risk of missing the target sequence.

Diffusion
In this work, we theoretically study the average time it takes for Cas9 to find its target sequence on the DNA.Our model includes sliding, the role of PAMs, and realistic sliding energy landscapes.We analytically solve our model, obtaining an expression for the average search time.We use this solution, combined with numerical simulations, to reveal the role of PAMs and sliding energetics on the efficiency of search.Our model reveals that, indeed, search in the presence of PAMs is characterized by a speed-stability tradeoff.We find that the optimal tradeoff is achieved when the PAM binding energy is of a few k B T , in striking agreement with direct estimate of this binding energy.The length of the PAM sequence (2 bp) is also optimal for search.The optimal value of the detachment rate also well matches experimental measurements.Our results thus support that kinetic parameters characterizing Cas9 are poised close to the optimal performance for the search process.

II. MODEL
We model the DNA as a one-dimensional lattice, in which each site represents a base pair (bp), see Fig 2 .A Cas9 molecule can bind to one of the sites at a certain time.We distinguish two types of DNA sites: PAM sites are those correspond to the starting bp of the PAM sequence, while other bp, including the remaining bp of a PAM sequence, are considered as normal sites, see Fig 2 .We call E n the binding energy of site n.Energies are expressed in units of k B T , where k B is the Boltzmann constant and T the temperature.
The Cas9 can detach or diffuse to either one of the neighbouring sites with rates respectively.Here, k and D are rate constants.We con-sider a DNA of length 2L + 1 bp, such that n ∈ [−L; L], with a unique target sequence complementary to the gRNA.We assume that the PAM next to the unique target is located at the origin n = 0.The boundary conditions are D ±(L+1),±L = 0.The binding energy landscape E n , of Cas9 on DNA is determined by the DNA sequence.For simplicity, we consider random DNA sequences, where each base pair occurs with probability 1/4, except for the PAM located at the origin.We assume that Cas9 can open a DNA bubble and form a hybrid between gRNA and a target sequence once bound to the central PAM.We call f T the rate at which this event occurs, where T stands for target (red arrow in Fig. 2).Once the target is reached, the search process is accomplished.
If Cas9 detaches before finding the unique target, it performs a 3D diffusion round, with duration given by t 3D .This time is, in principle, a random quantity.After a 3D diffusion round, Cas9 can land on any DNA site with uniform probability and start its 1D search again.A search process always starts from a 1D diffusion with a uniformly distributed initial condition.
We determine the model parameters according to experimental evidence.In particular, we estimate f T from experimental results in [18].These results show that a hybrid forms approximately 100 ms after Cas9 binding, so we take f T = 10 s −1 .
We consider two kinds of energy landscape: • Golf-course energy landscape, In the golfcourse energy landscape, non-PAM sites are characterized by a flat energy landscape, E n = 0.A PAM site has energy E n = β, where β is a parameter of the model.We estimated the rate constants k = 1.94 s −1 , D = 52 bp 2 s −1 , and the PAM energy β = −3.34from single-molecule experiments [5], see [8].
• Bp-dependent energy landscape.In this case, all possible two bp sequences have different energy.This means that E n can take 16 possible values with equal probability.These values can be found in table I.For this energy landscape, the rate constants are k = 6.57s −1 and D = 160 bp 2 s −1 , see [8].These rate constants are larger than in the golf course case because the baseline, i.e., the average energy of non-PAM sites, is different.Also in this case, we denote by β the PAM binding energy.
In both cases, we take for t 3D a constant value of 1.39s, since this value yields a roughly equal time of 1D and 3D diffusion for both kinds of landscape.This is theoretically considered to be the most efficient in facilitated diffusion [13,16,20,21].

III. AVERAGE TARGET SEARCH TIME
In this Section, we estimate the average target search time in our model.We denote the (random) target search  The first element of the matrix represents the binding energy β of a canonical PAM.Details on how the energy values are derived from experimental evidences [19] can be found in [8].
time by T tot and by T tot its average.We use the notation . . .for an average over different realizations of the energy landscape, over trajectories, and over the initial position.We here present a relatively simple derivation of T tot that holds for large L. A more formal derivation can be found in Supplementary Information.
A. Average time to reach the central PAM Our first step is to derive the average time T 0 to reach the central PAM, without necessarily transitioning into recognition mode.This expression is an important ingredient to finally compute the average target search time.Our calculation makes use of a recent approach to facilitated diffusion based on conditional probabilities [21], that we extend to a random energy landscape.
We start by taking a continuous limit, and replace the discrete coordinate n with a continuous coordinate x.We call q(x) the probability to reach the origin within a 1D round before detaching, starting at coordinate x.Its expression is given by as derived in [21].In our model, this expression is independent of the distribution of E(x).This fact is due to our choice of the diffusion and detachment rates in Eq. ( 1), such that disorder affects the timescales of diffusion and detachment, but not the probability of detaching or diffusing at a given site.
The average time to reach the central PAM is expressed by In this expression, the averages involving q(x) are interpreted as averages over x, e.g., q(x) = 1 2L L −L q(x)dx.The quantities t 1D (x) and t f 1D (x) are the durations of a 1D round that reached the origin or that failed reaching it before detaching, respectively.In facilitated diffusion, the average search time is often estimated as see for example [17].Here, Γ is the average number of 1D and 3D diffusion rounds before reaching the target.Comparing Eq. ( 3) with Eq. ( 4), 1/ q(x) plays the role of Γ; the first two terms in the right hand side represent the average time spent sliding; whereas the last term represents the average time spent performing 3D diffusion.Evaluating Eq. ( 3), we find that the average time T 0 is expressed by in which the overline represents an average over binding sequences, weighted with their frequencies on the DNA.Eq. ( 5) is formally derived in Appendix A. The first term in the right hand side of Eq. ( 5) represents the contribution from sliding and the last term the one from 3D diffusion.The factor exp(−E(x)) is the main difference between Eq. ( 5) and the main result of Ref. [21].This factor appears because, by Eq. ( 1), the characteristic dwell time on a position with energy E(x) is proportional to exp(−E(x)).Therefore, the total time spent sliding is proportional to exp(−E(x)), on average.

B. Calculation of the total search time
We now estimate the total search time T tot .We make the approximation that, after Cas9 reaches the central PAM, it either finds the target, or diffuses to one of its neighbors.In other words, we neglect the possibility of directly detaching from the central PAM, since ke β ≪ 2De β f T .As a consequence, a Cas9 at the central PAM finds the target, rather than diffusing away, with probability We then define the average time to return to the origin T 1 , starting from one of the two nearest neighbouring sites of the origin.This can occur via a single 1D round, or after detachment and 3D diffusion rounds.
We now express the average total search time as a series: To compute T 1 , we note that there are only two possibilities for a trajectory starting at n = ±1: reaching the origin before detaching, or detaching before reach the origin.We find that the probability of the former is e − √ k D and that of the latter is (1 − e − √ k D ), see Appendix B. If Cas9 detaches before revisiting the origin, it would undergo a 3D diffusion round and spend again an average time T 0 before reaching the origin.Therefore, we have Here, t is the average time before revisiting the origin in the former case, and t ′ is the the average time until detaching in the latter case.We expect T 0 to grow with L, whereas t and t ′ are on the order of the detachment time k −1 .Therefore, for large L, T 0 is much larger than t , t ′ , and t 3D .On the other hand, e − √ k D and 1 − e − √ k D are both of order 1.Therefore, we conclude that Substituting Eq. ( 9) into Eq.( 7), we finally obtain where for convenience we returned to the discrete notation E n .
The more formal calculation presented in Supplementary Information leads to an expression for T tot that reduces to Eq. (10) in the large L limit.We found that, already for L = 2500, the two solutions differ by less than 0.2% for experimentally realistic parameters.

IV. OPTIMALITY OF CAS9
We now study how the mean total search time varies at changing parameters, and how the parameters leading to an optimal (i.e., minimum) total search time compare with the experimentally measured ones.
We start our analysis by varying the PAM binding energy β.We find that the average search time, as expressed by Eq. ( 10), presents a minimum as a function of β, see Fig. 3.This result supports our hypothesis of a speed-stability trade-off, i.e., the necessity of compromising between a sufficiently high binding energy (providing enough time to permit a transition to the recognition mode) and not spending too much time bound on the many PAM sequences.Our model shows that this tradeoff is very significant for large genomes.Since bacteriophage genome ranges from 5 × 10 3 to 5 × 10 6 bp [22], a typical invading genome can be even larger than those considered in Fig. 3.In these cases, our model predicts that the average total search time T tot becomes very sensitive to the PAM energy.

FIG. 3. Optimal PAM energy minimizes the total search time.
Blue color represents the golf-course landscape and yellow the bp-dependent-kind.In the x axis, we plotted the difference between the PAM binding energy β and the average binding energy En nonPAM of non-PAM sequences.The optimal PAM energy by theory is shown by stars and vertical dashed lines.The left two vertical lines are our estimation of the real PAM energy for both kinds of energy landscapes, see Ref. [8].For L = 2500, we compare the analytical result of Eq. ( 10) with numerical simulations, finding an excellent agreement.
By taking the derivative of Eq. ( 10) with respect to β, we find the value β * of the PAM energy that minimizes the total search time: in which α is the fraction of PAM sequences in the genome (e.g., α = 1/16 for a two-bp PAM in a random DNA sequence).We note that the quantity (exp(−E n ) − αe β ) can be expressed as the average of exp(−E n ) over non-PAM sequences, and therefore does not depend on β.Since Eq. ( 10) is linear in L, β * is independent of L, at least in the large L limit.The value of β * predicted by Eq. ( 11) is very close to the experimentally determined value of β, see Fig. 3.This supports that Cas9 search for its targets in a near-optimal way.We now analyze the dependence of β * on model parameters.Varying one parameter on the right-hand side of Eq. ( 11) at one time, while keeping the others constant, we find the following relations.Firstly, β * increases with α.This is because the more abundant the PAMs are, the more time is wasted on PAMs that are not followed by the target.Increasing β can compensate for this and hence leads to a smaller total search time.Secondly, β * decreases at increasing k and t 3D .At larger k, the cost of missing the target is higher: the risk of detaching and doing another 3D and 1D round is larger.Decreasing β alleviates this effect, therefore leading to a smaller T tot .The reason is similar for t 3D : the cost of missing the target grows with the 3D diffusion time.Finally, since D affects both speed and stability, the dependence of β * on D is not monotonic.
We now study what would happen if the PAM sequence was made up of a number of base pairs other than two.For simplicity, we perform this analysis for the golf-course energy landscape only.When varying the number of base pairs in a PAM, we assume that β is proportional to the number of base pairs, whereas other sites always have energy E n = 0.Both simulations and analytical result from Eq. (10) show that a two-bp PAM leads to the smallest total search time, further supporting that the Cas9 binding mechanism is optimized to speed up search.If there were no PAM sequences at all, but Cas9 could still find recognize the target when on it, the average search time would be twice as large compared to this optimal case.Finally, we investigate the optimality of Cas9 with respect to the values of the unbinding rate k and the 1D diffusion rate D. Our theory predicts a minimum in T tot at a value of k that is very close to our estimation from experiments for both landscape kinds, see Fig. 5.
The total search time decreases monotonically with D. In the large D limit, the total search time rapidly tends to an asymptotic value Fig. 6.This rapid convergence could explain why the sliding length of Cas9 is not very large when compared to that of TFs: increasing the sliding length by increasing D might be evolutionary costly, without increasing performance very much.Vertical lines mark the values of D estimated from experiments, see [8].

V. CONCLUSION
In this work, we theoretically studied how Cas9 searches for its target on the DNA.Our model quantifies the role of PAM sequencing in guiding the search process, and reveals trade-offs leading to optimal parameter values that are very close to experimentally measured ones.In particular, we have found that two base pair PAM sequences lead to the optimal search time.Experimentally inferred detachment rate and PAM binding energies are also remarkably close to the theoretically predicted optimal values.
An in-vivo experiment [4] found that an individual Cas9 in E.coli requires about 6 hours to find one out of 36 target sequences on the E. coli DNA.Our Eq. ( 10) would predict 63.7 hours in the golf-course landscape, about one order of magnitude longer.In this estimate, we considered that the E. coli genome is 4.6 million bp long and contains about 0.5 million PAMs [4].We have accounted for the 36 targets; for the fact that DNA is doublestranded (effectively doubling the genome length); and have used other parameter values given in Section II.
There are several possible explanations for this discrepancy.Our assumption that the total time spent by 1D diffusion is roughly the same as 3D diffusion may not hold.In fact the former is often found to be much larger than the latter, see, e.g., [14].Other mechanisms not considered in our model, such as hopping [23,24], could also speed up the search process in vivo.Finally, our estimates of f T , k and D are based on in vitro experiment, and these rate constants might substantially differ values in vivo.This is consistent with the observation that the inferred estimate of the PAM residence time in vivo from Ref. [4] differs substantially from in vitro measurements.
All of these causes could lead to an overestimation of the total search time, with the latter being probably the most important.It will be important in future work to clarify how the kinetic parameters of Cas9 differ between in vivo and vitro, and how the trade-offs revealed in our work affect the optimal parameters in these two cases.
target before detachment, normalized by the number of all trajectories: where N f is the number of trajectories failed to reach the origin before detachment, and t f 1D,i (x) is the time spent by the i-th trajectory among them.By their definitions, N s + N f = N .The discrete counterparts are similarly defined.
Both T n and T f n are time-independent functions.The time T n obeys the recursion relation Using the expression for k n and D m,n , Eq. ( 1), this simplifies to The continuous equivalent of Eq. (A9) is Imposing E(x) = 0 except at the origin, Eq. (A10) becomes Given q(x) as in (A5), and the boundary conditions T (0) = 0, T (∞) = 0, Eq. (A11) is solved by For general non-constant E(x), we cannot directly solve Eq. (A10).We can however compute T (x) as follows.The spatial average over x and ensemble average commute, so we can first take the ensemble average of (A10) to get T (x).We note that the ensemble average and derivatives commute, and since q(x) = q(x), exp(−E(x))q(x) = exp(−E(x))q(x) we have Since exp(−E(x)) is a constant independent of x due our homogeneity assumption, Eq. (A13) admits the same solution as Eq.(A11), but with the average time multiplied by a factor exp(−E(x)): We now turn to the mean failed search time T (x). is the mean time of 1D diffusion contributed by those that failed to reach the target before detachment.The equation for T f (x) is the same as that of T (x) but with q(x) replaced by 1 − q(x).The counterpart of (A13) is and its solution reads (A16)

Expression for T0
We denote by x i the starting position of the i-th 1D diffusion round.We call t 1D (x i ) the time spent in the successful i-th 1D search.We call t f 1D (x i ) the time spent in a failed i-th 1D search.Finally, we call t i 3D the time spent in the 3D search after the i-th failed 1D search.These three quantities are in principle stochastic.

FIG. 1 .
FIG. 1.A comparison of TF and Cas9.Curved arrows indicate typical sliding distances in the two cases.

FIG. 2 .
FIG. 2. A sketch of the model.Circles represent states of the Cas9.Blue and yellow circles represent base pairs on the DNA, and yellow circles represent the starting base pair of a PAM.Light blue circles represent the 20 bp target sequence.

FIG. 4 .
FIG. 4. Average total search time for hypothetical PAMs with different number of base pairs.Here, a two bp PAM has binding energy β = −3.34.The x axis represents the number of bps in a PAM.The first point (zero bp) represents the case without PAM sequences guiding the search process.Points with error bars represent simulation results.Stars represent analytical predictions from Eq. (10).

FIG. 5 . 3 .
FIG.5.Dependence of the average total search time on the unbinding rate.We fixed L = 2500 since a larger L simply scales the curve as in Fig.3.Color code is the same as in Fig.3.Stars mark the theoretical minimum.Vertical lines denote our estimate of the rate k from experiments.

FIG. 6 .
FIG.6.Dependence of the total search time on the sliding rate D. The genome length is 2L + 1, with L = 2500.Color code is the same as in 3. Horizontal lines mark the theoretical Ttot values in the large D limit (95.6 s for the golf-course, and 107.3 s for the bp-dependent landscape, respectively).Vertical lines mark the values of D estimated from experiments, see[8].

TABLE I .
Cas9 binding energies ∆ǫi.Rows represent the first nucleotide and columns for the nucleotide next to the "N".