Solving the subset sum problem with a nonideal biological computer

We consider the solution of the subset sum problem based on a parallel computer consisting of self-propelled biological agents moving in a nanostructured network that encodes the NP-complete task in its geometry. We develop an approximate analytical method to analyze the effects of small errors in the nonideal junctions composing the computing network by using a Gaussian confidence interval approximation of the multinomial distribution. We concretely evaluate the probability distribution for error-induced paths and determine the minimal number of agents required to obtain a proper solution. We finally validate our theoretical results with exact numerical simulations of the subset sum problem for different set sizes and error probabilities.


Introduction
Solving complex problems requires high-performance computing methods. Since the number of computing steps typically increases exponentially with the size of a problem, multi-processor parallel algorithms are better suited to solve such tasks than single-processor serial algorithms [1,2]. In addition to traditional electronic computers, massively parallel computers have recently been realized using biological systems, such as DNA computing [3][4][5][6][7][8][9] and network-based biocomputing [10,11], which use small DNA molecules or cytoskeletal filaments as processors. These biological agents can be mass-produced at low cost and added to the computation in amounts matching the problem size. As a result, such biological architectures are easily highly parallelizable.
We here consider a nondeterministic-polynomial-time complete (NP-complete) problem, known as the subset sum problem [12,13]: given a set of positive integers and an integer T (the target sum), the general question is to determine whether there is a subset whose sum is exactly T . A proof-of-principle solution of this problem for the set {2, 5, 9} has been provided using molecular-motor-propelled agents moving in a nanostructured network that encodes the combinatorial problem into its geometry [11]. The network consists of a grid of channels which can be traversed by cytoskeletal filaments from top (entry point) to bottom (exit points) (figure 1). The grid is made of two types of junctions: pass junctions that allow an agent to continue on its previous path and split junctions that allow an agent to switch lanes, or not, with equal probability. Travelling vertically down the grid corresponds to adding 0 to the final total number, while moving diagonally corresponds to adding 1. By properly positioning the split junctions, the target sums can be determined from the position of the agents at the bottom of the grid. The network is explored stochastically by individual agents. Therefore, depending on the desired confidence level, a sufficient total number of agents must be supplied in order to obtain a significant result.
In the following, we develop a general approach based on a Gaussian confidence interval approximation of the multinomial distribution [14] that allows to analytically analyze the effects of small errors in nonideal junctions. We evaluate, in particular, the probability distribution for error-induced paths, taking into account errors both in pass and split junctions, and introduce a procedure with which the minimal number of agents required to obtain a correct solution of the problem may be determined. Both issues are essential for the successful experimental implementation of the biological computation strategy [15]. We finally compare our theoretical results with exact numerical simulations of the subset sum problem for different set sizes and error probabilities.

Necessary number of agents for ideal junctions
The protein filaments semi-randomly traverse the grid and are independent of each other. While it is clear that one has to use at least N sol = 2 s agents for a set of s integers, {N 1 , N 2 , ...}, using this number is quite unlikely to result in a complete solution of the problem. In fact, the larger the problem size, the exponentially more unlikely it is that this number of agents would result in a complete solution. We first estimate the smallest number of agents needed to solve the combinatorial problem with a sufficiently strong confidence, by assuming ideal split junctions (with a split ratio of 50%) and ideal pass junctions (the agents are guaranteed to keep on moving along their vertical/diagonal path). Under these conditions, all the paths resulting in a valid number of the N sol solutions are equally likely. Without loss of generality, we assume that all solutions are unique. There are hence N sol paths that one agent may choose from.
This multinomial problem may be solved by finding the formally exact statistics using combinatorial methods. However, the exact statistics may be formulated only indirectly (confidence intervals for the exits have to be chosen such that at least one agent exits through each of them) and solved numerically. But as all possibilities have to be 0 1 2 3 4 5 6 7 8 9 10 11 12 Exit: computed, this becomes rapidly intractable for large s. In addition, exploring all possible solutions is equivalent to solving the problem itself and, thus, makes the biological computer obsolete. We here employ another approach that approximates the problem using normal distributions [14]. The advantage of this method is that it can be applied regardless of the size s of the set, and be improved, if higher accuracy is needed, by going beyond the Gaussian approximation [14]. We denote by x i the number of agents in outcome i, p i the corresponding probability (with i p i = 1) and N = N sol i x i the total number of agents (p i = 1/N sol for ideal junctions). The multinomial distribution for k different outcomes and N independent trials is then given by [16], This distribution has meanx i = N p i and standard deviation σ i = √ N p i q i with q i = 1 − p i [16]. Since agents traverse the grid randomly, one can only state a confidence interval (which quantifies the margin of error) in which a given number of agents will use any of the possible paths. There are a variety of confidence interval approximations for the multinomial distribution [14]: the simplest is the normal approximation interval, which approximates the distribution by a Gaussian. The symmetric confidence interval is given in this case by [14]. The = 3 confidence interval for the normal distribution guarantees that the number of agents traversing through the respective paths lies in this interval with a probability of p single-success ≈ 99% for individual outcomes. The success probability of a full computation is accordingly given by p success = p N sol single-success . If we now demand at least n i ≥ 1 agents per path (in order to find a proper solution to the subset sum problem) then the lower boundary of the confidence interval gives a determining equation for the necessary total number of agents N min , Solving this equation for N min results in Using (3) with n i =1 guarantees with 99% certainty that there is at least one agent per correct path. The corresponding success probability of the full computation will then be p success = 0.92, 0.85, 0.72, 0.53 for the set sizes s = 3, 4, 5, 6. The success probability may be further improved by taking a value > 3. For concreteness, we will consider the case = 3 in the remainder. Figure 2 shows the exits of the various numbers from 0 to the sum of all numbers for the set {2, 3, 7} -green triangles indicate the number of simulated agents ending at the respective exits. In this case, the minimum of agents is N min = 79. A number is a solution of the subset sum problem if the arriving number of agents lies within the (blue) confidence interval. Since the setup is assumed to be ideal, the exits will have agents going through them only valid paths. Therefore, the error confidence intervals (red) are centered around 0 and have length  (2)), shown in blue for each possible exit. According to (3), the minimal number of agents is Nmin = 79 with ni = 1 in this case. This condition guarantees that each correct exit has a finite number of agents: each green triangle is then either in a blue (correct) confidence interval or in a red (incorrect) confidence interval, allowing to effectively distinguish the two outcomes (red confidence intervals have vanishing sizes for the ideal grid considered here). 0 in this case. Any correct exit will have agents in a blue confidence interval. The corresponding minimal numbers for agents for s = 4 and s = 5 are respectively N min = 166 and N min = 340 (for n i = 1).

Effects of errors for nonideal junctions
The assumptions of an ideal grid do not necessarily apply to an actual experiment. The pass junctions may indeed not be perfectly able to keep agents on their path and split junctions may not be perfectly symmetric. These errors may result in incorrect or missing solutions of the subset sum problem. We next analyze the effects of small errors both in the pass junctions and in the split junctions on the the minimal number of agents N min .
Errors in pass junctions mean that agents take new paths that are not actual solutions, since they act then as split junctions. On the other hand, faulty split junctions cause an imbalance of valid paths, as the split probability deviates from 50%. Errors generated by nonideal split junctions may be simply accommodated for by taking the smallest probability of the unlikeliest path for the number estimation of equation (3). This gives a good estimate as long as the probability is not too low (p i N 5) to guarantee the validity of the Gaussian approximation [17].
By contrast, errors induced by pass junctions are more complex to treat and result in the creation of new paths. The total number of split junctions is given by the cardinality of the considered set, N SJ = s. The total number of pass junctions is equal to the sum of all numbers in the set, N tot = i N i , minus the number of split junctions, N PJ = N tot − N SJ . Assuming that a pass junction has an error probability of p PJ , the probability of an agent passing through the grid without any pass error may then be evaluated as p c Errors therefore depend exponentially on the size of the problem. Even the simplest grid of just one number can be highly incorrect if the chosen number is too large in comparison to the error probability. Qualitatively speaking, pass junction errors have to be as small as possible to allow the biological computer to work for larger problems.
Quantitatively, the nonideal system can be modeled as being comprised of two parts: the correct part that has to be done by at least N min agents as given by equation (3) and the incorrect part N FP that results in a number of (noisy) agents wrongly added in the various exits. This model may be described with the help of a binomial distribution: agents either take a correct path with a given probability, or a wrong one. Thus, the minimal number of agents N non min that have to be used in a nonideal grid, such that there are N min agents that traverse it correctly, can be calculated similarly as done before with the multinomial distribution. We obtain (using the same 3σ confidence interval), The above number guarantees that there are enough agents to fulfill the computation condition of the ideal case (3). However, it does not yet incorporate the effect of agents taking faulty paths, whose number is N FP = N non min − N min . These agents create noise on the various exits that needs to be accounted for. In particular, agents originating from correct and incorrect paths ought to be distinguished in an effective manner.

Probability distribution for error-induced paths
While outcomes were considered to be equally probable in the case of an ideal grid, this is no longer true for nonideal, noisy grids. Figure 1 shows that agents can move through a large number of possible paths before reaching a given exit point. In particular, the central outcomes may be reached by more trajectories than the outer ones. This suggests that the effect of errors will generally be nonhomogeneous. We shall in this Section evaluate the distribution of error-induced paths in the limit of small error probabilities. For larger error probabilities, the probability of successful propagation through the grid becomes negligibly small: for a subset sum problem that contains 18 pass junctions, an error probability per pass junction of p PJ = 0.2 will, for example, lead to a successful propagation probability of only p c PJ = 0.018. In these cases, where the correct agents arriving at each exit are solely a small perturbation of the wrong agents, either a more involved error estimation has to be done, or incorrect agents have to be registered and discarded from the counting to still being able to correctly do the stochastic computation.
Pass and split junctions are not independent of each other in general. For wrong paths, a split junction indeed acts as a particularly bad pass junction, creating a stronger inhomogeneity in the noise outcomes. Faulty split junctions also complicate the treatment significantly. However, their effect is expected to be weak for small errors. We concretely make the following assumptions: (1) the error probability is equal for all pass junctions, (2) the error probability is related to a change of the initial direction of an agent, and (3) split junctions, with the exception of the first one, do not contribute strongly to the shape of the (weak) noise and may be incorporated in an approximative manner. We will assess the range of validity of these hypotheses by comparison with an exact numerical evaluation in Sect. 4 below. We will first consider a simplistic grid consisting only of faulty pass junctions (and no split junctions) before accounting for the average effects of split junctions at the end of this Section. Based on the above assumptions, it is possible to find a recursive formula for the number of paths A m i,Ntot for each potential solution i in dependence of the number of turns (defined as error-induced changes of paths) m. Using the compact notation Z = N tot , we have, with the modulo function, g(m) = (m − m mod2)/2 − 1, which is needed due to the fact that there is an alternating pattern in the recursive calculation, as will be explained in the following. We have here defined the quantities v A m i,Z and d A m i,Z that respectively describe contributions from vertical and diagonal paths. To understand this recursive formula, it is instructive to start with the case m = 1. There is no case m = 0, since initially an agent has to choose between going down (and potentially ending up at exit 0 by keeping this direction), or moving right (and potentially arriving at exit Z by again keeping this direction), which corresponds to m = 1. For m = 2, an agent has to do one detour while being on either the (vertical) 0-path or the (diagonal) Z-path. It therefore cannot reach these two values but any other exit may be reached via a detour from both of them. As a result, there are two possibilities for any potential solution not being either 0 or Z. This distinction between the vertical path (v) The probability distribution of error-induced outcomes follows accordingly as, with the normalization constant, It is important to note that the probability distribution (8) is only defined for i = (0, Z), since these are actual solutions of the subset sum problem and cannot be reached via wrong trajectories. We will take an equal initial split junction, p SJ = 1/2, in the following. We have so far considered pass junction errors for a setup only comprised of such junctions. However, split junctions also exist for agents traversing the grid incorrectly. Considering split junctions as particularly bad pass junctions, p PJ → p SJ , one has to incorporate their effects for the effective error distribution. A simple approach to account for these effects is to use an effective error probability for the pure pass junction errors in (8) and (9), by incorporating the split junction probability into all the pass junctions evenly: Equation (10) indicates that the existence of the split junctions enhances the effective error probability, even for small pass junction error probabilities (see Sect. 4 below). It is important to emphasize that (10) only modifies the shape of the distribution of the errors, not how many agents propagate through the grid erroneously.

Mean and standard deviation from the error probability distribution
Having determined the probability distribution (8) of an individual faulty outcome i, we may now evaluate mean and standard deviation for N FP faulty trials using the multinomial distribution as done in Sect. 2 for the ideal grid.
Since we are interested in the minimal necessary number of agents, the most relevant probability is the one that has the largest possible deviations. This is the case for the central outcome which corresponds to i max = Z/2 or, more precisely, to i max = g(Z) + 1, since this instance involves the largest number of possible paths. We obtain According to this formula, it is possible to distinguish right from wrong paths if the number of agents n i per outcome i obeys (taking again a 3σ confidence interval), A solution to the subset sum problem can be found if the total number of agents N min is such that equation (12) is satisfied. In this case, the measured signal will be either within the confidence interval of the error estimate or it will be above that value due to the additional agents coming from the correct paths. As a consequence, confidence intervals for purely wrong outcomes, and those for the sum of correct and wrong outcomes (corresponding to the stochastically independent sum of their respective average and variance) will not overlap. Expression (12) generalizes the condition n i ≥ 1 of the ideal grid. For nonideal grids, n i will in general grow with the error probability p PJ .

Comparison with numerical simulations
In this Section, we compare the above approximate theoretical results with exact numerical simulations of the problem for different set sizes and error probabilities in order to test their validity. Simulations are performed by reproducing the stochastic motion of biological agents through the grid using random numbers: at each step, agents randomly choose to either propagate along the same direction or change direction, with a given probability.  figure 2, we here have nonzero (red) 3σ confidence intervals for wrong paths. The correct (blue) intervals contain a stochastically independent sum of the confidence intervals of the theoretical description of the correct and incorrect agents (as given by (2) and (11)). A large number of pass junctions increases the effective error probability (10), leading to a stronger curvature of the overall shape of the confidence intervals. In all four cases, agents originating from correct and incorrect paths can be clearly distinguished: their total number is either in a blue interval (correct exit) or in a red interval (incorrect exit). The exits 9 and 10 can be reached by two combinations of paths in (c) and thus are above the blue confidence intervals intervals. The same happens for exits located closed to the center point in ( We begin by comparing in figure 3 the theoretical probability distribution of error-induced outcomes p non i , (8), (orange squares) for purely faulty pass junctions in a simplistic grid of length 18, consisting only of pass junctions, and the corresponding numerical simulations (green triangles), for (a) p PJ = 0.01 and (b) p PJ = 0.15. We observe a flat distribution for small error probabilities, indicating that outcomes are equiprobable. For larger error probabilities, the distribution is peaked at the position of the central exit point and outcomes at the edges of the grid are less likely. The distribution is symmetric around the center exit point 9. We have excellent agreement between theory and simulations in both cases. Figure 4 displays the analytical probability distribution of error-induced outcomes p non i , (8), for a grid consisting of pass and split junctions, neglecting the effect of split junctions (red circles), using the approximate averaging of the split junction effect as described by (10) (orange squares), and the true error probability from the simulations by discarding the correct paths (blue diamonds), for (a),(c) p PJ = 0.01 and (b),(d) p PJ = 0.15, for the sets {5, 6, 7} (top) and {2, 3, 5, 7} (bottom). The incorporation of split junctions leads to distributions peaked at the center for all error probabilities and set sizes. The approximate treatment of the interplay between the erroneous pass and split junctions via the effective probability (10) exhibits very good agreement with the exact simulations for larger error probabilities and larger set sizes. This is due to the effect that higher error probabilities for the pass junctions as well as more pass junctions increases the number of wrong turns an agent makes while traversing through the pass junctions. This results in the fact that the split junctions can be better taken into account in the averaged treatment of (10). However, simulations exhibit additional inhomogeneities not fully captured by the approximations. While for small error probabilities, the theoretical description deviates more strongly from the correct distribution, the number of wrong agents is also smaller, rather a perturbation of the correct agents. For larger error probabilities the theoretical description is closer to the true distribution, but the errors become also more dominant and thus a larger accuracy is also needed. In addition, a larger number of pass junctions increases errors and thus the size of the confidence intervals, as well as the curvature of the confidence intervals for a fixed error probability p PJ , while a large number of split junctions leads to a stronger curvature of the overall shape of the confidence intervals caused by the corresponding larger effective probability (10). In all four cases, agents originating from correct and incorrect paths can be clearly distinguished. The agents exceeding the upper bound of the confidence intervals in (c) and (d) are due to exits being reached by more than one combination and are still valid.
We finally analyze in figure 6 the effect of an increasing pass junction error probability (a) p PJ = 0.02, (b) p PJ = 0.05, (c) p PJ = 0.08 and (d) p PJ = 0.1 for the set {5, 6, 7}. The probability of an agent correctly traversing the nonideal grid significantly decreases as the error probability increases from (a) p c PJ = 0.74 (b) p c PJ = 0.46, p c PJ = 0.29 to p c PJ = 0.2. The relative number of faulty agents therefore grows markedly compared to the number of correct agents. We also see that a higher pass junction error probability causes a stronger curvature of the confidence intervals. The minimal number of agents additionally gets quickly bigger with increasing error probability compared to the ideal grid: (a) N non min /N min = 3.6 (n i = 10), (b) N non min /N min = 11.6 (n i = 28), (c) N non min /N min = 36.3(n i = 64), (d) N non min /N min = 72.97 (n i = 104). We furthermore note that effectively distinguishing wrong and correct exits becomes increasingly difficult with larger p PJ (see, for example, exits 6 and 8 in (d)), as triangles move to the edges of the respective confidence intervals. Therefore, one might use p c PJ > 0.5 as a limiting value above which the used approximations become insufficient.

Conclusions
We have performed a detailed analytical study of the influence of imperfect pass and split junctions on the ability of a parallel biological computer to solve the subset sum problem. In the ideal case, an error-free grid may in principle be used to correctly solve the problem for any size, as long as the number of agents is larger than the minimum number given by (3). For a nonideal grid, our findings indicate that the subset sum problem can still be properly solved by increasing the number of agents, provided the error probability of the different junctions is small, so that the number of erroneous agents can be seen as small compared to the correct agents. As a general rule, errors in pass junctions should be as small as possible, as their distribution gets skewed by the existence of split junctions, resulting in nonhomogenous noise patterns. Errors in split junctions alone are, however, less problematic because they simply cause outcomes to become less likely and, thus, only necessitate a larger number of agents. On the other hand, larger error probabilities make the estimation of the confidence intervals and the distinction of correct and incorrect paths increasingly difficult, requiring either more elaborate error estimations or discarding wrong agents from the computations.