Efficient decision-making by volume-conserving physical object

Song-Ju Kim; Masashi Aono; Etsushi Nameda

doi:10.1088/1367-2630/17/8/083023

1. Introduction

The computing principles in modern digital paradigms have been designed to be dissociated from the underlying physics of natural phenomena [1]. In the construction of CMOS devices, wide-band-gap materials have been employed so that physical fluctuations such as thermal noise, which often violate logically-valid behavior, could be neglected [2]. Since electron dynamics constrained by physical laws cannot be controlled when only parameters of the same degree of freedom as those of logical input–output responses are modulated, considerably complicated circuits are required for implementing relatively simple logic gates such as NAND and NOR [3]. However, these efforts to circumvent the division between physics and computation are costly in terms of energy consumption and manufacturing resources. On the other hand, when we look at the natural world, information processing in biological systems is elegantly coupled with their underlying physical laws and fluctuations [4–6]. This suggests a potential for establishing a new physics-based analog-computing paradigm. In this paper, we show that a physical constraint, the conservation law for the volume of a rigid body, allows for efficient solving of decision-making problems when subjected to suitable operations involving fluctuations.

Suppose there are M slot machines, each of which returns a reward; for example, a coin, with a certain probability that is unknown to a player. Let us consider a minimal case: two machines A and B give rewards with individual probabilities P_A and P_B, respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The multi-armed bandit problem (MBP) is used to determine the optimal strategy for finding the machine with the highest reward probability as accurately and quickly as possible by referring to past experiences.

The MBP is formulated as a mathematical problem without loss of generality and so is related to various stochastic phenomena. In fact, many application problems in diverse fields, such as communications (cognitive networks [7, 8]), commerce (advertising on the web [9]), entertainment (Monte Carlo tree search, which is used for computer games [10, 11]), can be reduced to MBPs. Particularly, the 'upper confidence bound 1 (UCB1) algorithm' for solving MBPs is used worldwide in many practical applications [17].

In the context of reinforcement learning, the MBP was originally described by Robbins [12], though the essence of the problem had been studied earlier by Thompson [13]. The optimal strategy, called the 'Gittins index', is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [14, 15]. Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the algorithms proposed by Agrawal and Auer et al, another index was expressed as a simple function of the reward sums obtained from the machines [16, 17].

2. Methods

2.1. Tug-of-war (TOW) dynamics

Kim et al proposed an MBP solution using a dynamical system, called 'TOW dynamics'; this algorithm was inspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slime mold P. polycephalum) [18–23], which maintains a constant intracellular-resource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In this nature-inspired algorithm, the decision-making function is derived from its underlying physics, resembling that of a TOW game. The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a non-local correlation among the terminal parts, that is, the volume increment in one part is immediately compensated by volume decrement(s) in the other part(s). In our previous studies [18–23], we showed that, owing to the non-local correlation derived from the volume-conservation law, TOW dynamics exhibit higher performance than other well-known algorithms such as the modified -greedy algorithm and the modified softmax algorithm, which is comparable to the UCB1-tuned algorithm (seen as the best choice among parameter-free algorithms [17]). These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it held some common physical attributes such as the conservation law. In fact, Kim et al demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be exploited for the implementation of TOW dynamics [24, 25].

In this paper, we extract the most essential ingredients of the TOW dynamics and formulated its simplified version, called the 'TOW principle', in order to summarize the mechanism of high performance by showing analytical calculations of 'regret' that can characterize the performance of the multi-armed bandit algorithms. Consider a volume-conserving physical object; for example, a rigid body like an iron bar (the slot-machine's handle), as shown in figure 1. Here, the variable X_k represents the displacement of terminal k from an initial position, where $k\in \{A,B\}$ . If X_k is a maximum, we assume that the body makes a decision to play machine k. In TOW dynamics, the MBP is represented in its inverse form: instead of 'rewarding' the player when machine k produces a coin with a probability P_k, we 'punish' the player when the machine gives no coin with a probability $1-{P}_{k}$ . In this respect, the displacement X_A ( $=-{X}_{B}$ ) is determined by the following equations:

$\begin{eqnarray}&&{X}_{A}(t+1)={Q}_{A}(t)-{Q}_{B}(t)+\delta (t),\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&{Q}_{k}(t)={N}_{k}(t)-(1+\omega ){L}_{k}(t).\end{eqnarray} \tag{ 2 }$

Here, ${Q}_{k}(t)$ ( $k\in \{A,B\}$ ) is an 'estimate' of information on past experiences accumulated from the initial time 1 to current time t, ${N}_{k}(t)$ counts the number of times that machine k has been played until time t, ${L}_{k}(t)$ counts the number of punishments when playing machine k until time t, $\delta (t)$ is an arbitrary fluctuation to which the body is subjected, and ω is a weighting parameter to be described in detail later on in this paper. Equation (2) called the 'learning rule'. Consequently the TOW dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine k is played at each time t, +1 and $-\omega$ are added to ${X}_{k}(t)$ when rewarded and non-rewarded, respectively (figure 1).

**Figure 1.** TOW dynamics. If machine k ( $k\in \{A,B\}$ ) is played at each time t, +1 and $-\omega$ are added to ${X}_{k}(t)$ for rewarding (a) and non-rewarding cases (b), respectively.
Download figure:
Standard image High-resolution image

3. Results

3.1. Solvability of the MBP

To explore the MBP solvability of TOW dynamics, let us consider a random-walk model. As shown in figure 2(a), α (right flight when rewarded) and β (left flight when non-rewarded) are the parameters. We assume that P_A > P_B for simplicity. After time step t, the displacement ${R}_{k}(t)$ ( $k\in \{A,B\}$ ) can be described by

$\begin{eqnarray}{R}_{k}(t) & = & \alpha ({N}_{k}-{L}_{k})-\beta {L}_{k}\\ & = & \alpha {N}_{k}-(\alpha +\beta ){L}_{k}.\end{eqnarray} \tag{ 3 }$

The expected value of R_k can be obtained from the following equation:

$\begin{eqnarray}&&E({R}_{k}(t))=\{\alpha {P}_{k}-\beta (1-{P}_{k})\}{N}_{k}.\end{eqnarray} \tag{ 4 }$

In the overlapping area between the two distributions shown in figure 2(b), we cannot accurately estimate which is larger. The overlapping area should decrease as N_k increases so as to avoid incorrect judgments. This requirement can be expressed by the following forms:

$\begin{eqnarray}&&\alpha {P}_{A}-\beta (1-{P}_{A})\gt 0,\end{eqnarray} \tag{ 5 }$

$\begin{eqnarray}&&\alpha {P}_{B}-\beta (1-{P}_{B})\lt 0.\end{eqnarray} \tag{ 6 }$

These expressions can be rearranged into the form

$\begin{eqnarray}&&{P}_{B}\lt \displaystyle \frac{\beta }{\alpha +\beta }\lt {P}_{A}.\end{eqnarray} \tag{ 7 }$

In other words, the parameters α and β must satisfy the above conditions so that the random walk correctly represents the larger judgment.

We can easily confirm that the following form satisfies the above conditions:

$\begin{eqnarray}&&\displaystyle \frac{\beta }{\alpha +\beta }=\displaystyle \frac{{P}_{A}+{P}_{B}}{2}.\end{eqnarray} \tag{ 8 }$

From ${R}_{k}(t)/\alpha$ = ${Q}_{k}(t)$ , we obtain $\omega =\frac{\beta }{\alpha }$ . From this and equation (8), we obtain

$\begin{eqnarray}&&{\omega }_{0}=\displaystyle \frac{\gamma }{2-\gamma },\end{eqnarray} \tag{ 9 }$

$\begin{eqnarray}&&\gamma ={P}_{A}+{P}_{B}.\end{eqnarray} \tag{ 10 }$

Here, we have set the parameter ω to ${\omega }_{0}$ . Therefore, we can conclude that the algorithm using the learning rule Q_k with the parameter ${\omega }_{0}$ can solve the MBP correctly.

3.2. TOW principle: origin of the high performance

In many popular algorithms such as the -greedy algorithm, at each time t, an estimate of reward probability is updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the sum of the reward probabilities γ = P_A + P_B is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.

The top and bottom rows of table 1 provide estimates based on the knowledge that machine A was played N_A times and that machine B was played N_B times, respectively. Note that we can also update the estimate of the machine that was not played, owing to the given γ.

Table 1. Estimates for each reward probability based on the knowledge that machine A was played N_A times and that machine B was played N_B times—on the assumption that the sum of the reward probabilities γ = P_A + P_B is known.

A:	$\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}$	B:	$\gamma \;-$ $\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}$
A:	$\gamma -$ $\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}$	B:	$\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}$

From the above estimates, each expected reward ${Q}_{k}^{\prime }$ ( $k\in \{A,B\}$ ) is given as follows:

$\begin{eqnarray}{Q}_{A}^{\prime } & = & {N}_{A}\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}+{N}_{B}\left(\gamma -\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}\right)\\ & = & {N}_{A}-{L}_{A}+(\gamma -1){N}_{B}+{L}_{B},\end{eqnarray} \tag{ 11 }$

$\begin{eqnarray}{Q}_{B}^{\prime } & = & {N}_{A}\left(\gamma -\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}\right)+{N}_{B}\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}\\ & = & {N}_{B}-{L}_{B}+(\gamma -1){N}_{A}+{L}_{A}.\end{eqnarray} \tag{ 12 }$

These expected rewards, ${Q}_{j}^{\prime }$ s, are not the same as those given by the learning rules of TOW dynamics, Q_js in equation (2). However, what we use substantially in TOW dynamics is the difference

$\begin{eqnarray}&&{Q}_{A}-{Q}_{B}=({N}_{A}-{N}_{B})-(1+\omega )({L}_{A}-{L}_{B}).\end{eqnarray} \tag{ 13 }$

When we transform the expected rewards ${Q}_{j}^{\prime }{\rm{s}}$ into ${Q}_{j}^{\prime\prime }={Q}_{j}^{\prime }/(2-\gamma )$ , we can obtain the difference

$\begin{eqnarray}&&{Q}_{A}^{\prime\prime }-{Q}_{B}^{\prime\prime }=({N}_{A}-{N}_{B})-\displaystyle \frac{2}{2-\gamma }({L}_{A}-{L}_{B}).\end{eqnarray} \tag{ 14 }$

Comparing the coefficients of equations (13) and (14), the differences in their constituent terms are always equal when $\omega ={\omega }_{0}$ (equation (9)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter ${\omega }_{0}$ in terms of γ.

This derivation implies that the learning rule for TOW dynamics is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously. In other words, TOW dynamics imitates the imaginary system that determines its next move at time $t+1$ in referring to the estimates of the two machines, even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact that the sum of reward probabilities is given in advance, may be one of the origins of the high performance of TOW dynamics.

Monte Carlo simulations were performed it was verified that TOW dynamics with ${\omega }_{0}$ exhibits an exceptionally high performance, which is comparable to its peak performance—achieved with the optimal parameter ${\omega }_{\mathrm{opt}}$ . To derive the optimal value ${\omega }_{\mathrm{opt}}$ accurately, we need to take into account the fluctuation and other dynamics of terminals [21].

In addition, the essence of the process described here can be generalized to K-machine cases. To separate distributions of the top Mth and top $(M+1)$ th machine, as shown in figure 2(b), all we need is the following ${\omega }_{0}:$

$\begin{eqnarray}&&{\omega }_{0}=\displaystyle \frac{{\gamma }^{\prime }}{2-{\gamma }^{\prime }},\end{eqnarray} \tag{ 15 }$

$\begin{eqnarray}&&{\gamma }^{\prime }={P}_{(M)}+{P}_{(M+1)}.\end{eqnarray} \tag{ 16 }$

Here, ${P}_{(M)}$ denotes the top Mth reward probability, and M is any integer from 1 to $K-1$ . The MBP is a special case where M = 1. In fact, for K-machine and M-player cases, we have designed a physical system that can determine the overall optimal state, called the 'social maximum', quickly and accurately [26].

3.3. Performance characteristics

To characterize the high performance of TOW dynamics, let us consider the imaginary model for solving the MBP, called the 'cheater algorithm'. The cheater algorithm selects a machine to play according to the following estimate S_k ( $k\in \{A,B\}$ )

$\begin{eqnarray}&&{S}_{A}={X}_{A,1}+{X}_{A,2},+\cdots +{X}_{A,N},\end{eqnarray} \tag{ 17 }$

$\begin{eqnarray}&&{S}_{B}={X}_{B,1}+{X}_{B,2},+\cdots +{X}_{B,N}.\end{eqnarray} \tag{ 18 }$

Here, ${X}_{k,i}$ is a random variable that takes either 1 (rewarded) or 0 (non-rewarded). If S_A > S_B at time t = N, machine A is played at time $t=N+1$ . If S_B > S_A at time t = N, machine B is played at time $t=N+1$ . If S_A = S_B at time t = N, a machine is played randomly at time $t=N+1$ . Note that the algorithm refers to results of both machines at time t without any attention to which machine was played at time $t-1$ . In other words, the algorithm 'cheats' because it plays both machines and collects both results, but declares that it plays only one machine at a time.

The expected value and the variance of X_k are defined as $E({X}_{k})={\mu }_{k}$ and $V({X}_{k})={\sigma }_{k}^{2}$ . Here, ${\mu }_{k}$ is the same as the P_k defined earlier. From the central-limit theorem, S_k has a Gaussian distribution with $E({S}_{k})={\mu }_{k}N$ and $V({S}_{k})={\sigma }_{k}^{2}N$ . If we define a new variable $S={S}_{A}-{S}_{B}$ , S has a Gaussian distribution and carries the following values:

$\begin{eqnarray}&&E(S)=({\mu }_{A}+{\mu }_{B})N,\end{eqnarray} \tag{ 19 }$

$\begin{eqnarray}&&V(S)=({\sigma }_{A}^{2}+{\sigma }_{B}^{2})N,\end{eqnarray} \tag{ 20 }$

$\begin{eqnarray}&&\sigma (S)=\sqrt{{\sigma }_{A}^{2}+{\sigma }_{B}^{2}}\sqrt{N}.\end{eqnarray} \tag{ 21 }$

From figure 3, the probability of playing machine B, which has a lower reward probability, can be described as ${\bf{Q}}(\frac{E(S)}{\sigma (S)})$ . Here, ${\bf{Q}}(x)$ is a Q-function. We obtain

$\begin{eqnarray}&&P(t=N+1,B)={\bf{Q}}(\phi \sqrt{N}).\end{eqnarray} \tag{ 22 }$

Here,

$\begin{eqnarray}&&\phi =\displaystyle \frac{{\mu }_{A}-{\mu }_{B}}{\sqrt{{\sigma }_{A}^{2}+{\sigma }_{B}^{2}}}.\end{eqnarray} \tag{ 23 }$

**Figure 3.** ${\bf{Q}}\left(\displaystyle \frac{E(S)}{\sigma (S)}\right)$ : probability of selecting the lower-reward machine using the cheater algorithm.
Download figure:
Standard image High-resolution image

**Figure 3.** ${\bf{Q}}\left(\displaystyle \frac{E(S)}{\sigma (S)}\right)$ : probability of selecting the lower-reward machine using the cheater algorithm.
Download figure:
Standard image High-resolution image

Using the Chernoff bound ${\bf{Q}}(x)\leqslant \frac{1}{2}\mathrm{exp}(-\frac{{x}^{2}}{2})$ , we can calculate the upper bound of a measure, called the 'regret', which quantifies the accumulated losses of the cheater algorithm.

$\begin{eqnarray}&&\mathrm{regret}=({\mu }_{A}-{\mu }_{B})E({N}_{B}).\end{eqnarray} \tag{ 24 }$

$\begin{eqnarray}E({N}_{B}) & = & {\sum }_{t=0}^{N-1}{\bf{Q}}(\phi \sqrt{t})\\ & \leqslant & {\sum }_{t=0}^{N-1}\displaystyle \frac{1}{2}\mathrm{exp}(-\displaystyle \frac{{\phi }^{2}}{2}t)\\ & = & \displaystyle \frac{1}{2}+{\sum }_{t=1}^{N-1}\displaystyle \frac{1}{2}\mathrm{exp}(-\displaystyle \frac{{\phi }^{2}}{2}t)\\ & \leqslant & \displaystyle \frac{1}{2}+{\displaystyle \int }_{0}^{N-1}\displaystyle \frac{1}{2}\mathrm{exp}(-\displaystyle \frac{{\phi }^{2}}{2}t){\rm{d}}t\\ & = & \displaystyle \frac{1}{2}-\displaystyle \frac{1}{{\phi }^{2}}\left(\mathrm{exp}(-\displaystyle \frac{{\phi }^{2}}{2}(N-1))-1\right)\end{eqnarray} \tag{ 25 }$

$\begin{eqnarray}&&\to \displaystyle \frac{1}{2}+\displaystyle \frac{1}{{\phi }^{2}}.\end{eqnarray} \tag{ 26 }$

Note that the regret becomes constant as N increases.

Using the 'cheated' results, we can also calculate the regret of TOW dynamics in the same way. In this case,

$\begin{eqnarray}&&{S}_{A}={X}_{A,1}+{X}_{A,2},+\cdots +{X}_{A,{N}_{A}}-\omega {L}_{A},\end{eqnarray} \tag{ 27 }$

$\begin{eqnarray}&&{S}_{B}={X}_{B,1}+{X}_{B,2},+\cdots +{X}_{B,{N}_{B}}-\omega {L}_{B}.\end{eqnarray} \tag{ 28 }$

${X}_{k,i}$ is also a random variable that takes either 1 (rewarded) or 0 (non-rewarded). Here, we use ${L}_{k}=(1-{\mu }_{k}){N}_{k}$ . Then, we obtain $E({S}_{k})=\{{\mu }_{k}-(1-{\mu }_{k})\omega \}{N}_{k}$ and $V({S}_{k})={\sigma }_{k}^{2}{N}_{k}$ .

Using the new variables $S={S}_{A}-{S}_{B}$ , $N={N}_{A}+{N}_{N}$ , and $D={N}_{A}-{N}_{N}$ , we also obtain

$\begin{eqnarray}&&E(S)=\displaystyle \frac{{\mu }_{A}-{\mu }_{B}}{2}(1+\omega )N+\{\displaystyle \frac{{\mu }_{A}+{\mu }_{B}}{2}(1+\omega )-\omega \}D,\end{eqnarray} \tag{ 29 }$

$\begin{eqnarray}&&V(S)=\displaystyle \frac{{\sigma }_{A}^{2}+{\sigma }_{B}^{2}}{2}N+\displaystyle \frac{{\sigma }_{A}^{2}-{\sigma }_{B}^{2}}{2}D.\end{eqnarray} \tag{ 30 }$

If the conditions $\omega ={\omega }_{0}$ and ${\sigma }_{A}={\sigma }_{B}$ $\equiv \sigma$ are satisfied, we then obtain

$\begin{eqnarray}&&E(S)=\displaystyle \frac{{\mu }_{A}-{\mu }_{B}}{2}(1+{\omega }_{0})N,\end{eqnarray} \tag{ 31 }$

$\begin{eqnarray}&&V(S)={\sigma }^{2}N,\end{eqnarray} \tag{ 32 }$

and

$\begin{eqnarray}&&P(t=N+1,B)={\bf{Q}}({\phi }_{T}\sqrt{N}).\end{eqnarray} \tag{ 33 }$

Here,

$\begin{eqnarray}&&{\phi }_{T}=\displaystyle \frac{({\mu }_{A}-{\mu }_{B})(1+{\omega }_{0})}{2\sigma }.\end{eqnarray} \tag{ 34 }$

We can then calculate the upper bound of the regret for TOW dynamics

$\begin{eqnarray}E({N}_{B}) & = & {\sum }_{t=0}^{N-1}{\bf{Q}}({\phi }_{T}\sqrt{t})\\ & \leqslant & \displaystyle \frac{1}{2}-\displaystyle \frac{1}{{\phi }_{T}^{2}}\left(\mathrm{exp}(-\displaystyle \frac{{\phi }_{T}^{2}}{2}(N-1))-1\right)\end{eqnarray} \tag{ 35 }$

$\begin{eqnarray}&&\to \displaystyle \frac{1}{2}+\displaystyle \frac{1}{{\phi }_{T}^{2}}.\end{eqnarray} \tag{ 36 }$

Note that the regret for TOW dynamics also becomes constant as N increases.

It is known that optimal algorithms for the MBP, defined by Auer et al, have a regret proportional to $\mathrm{log}(N)$ [17]. The regret has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero. A constant regret means that the probability of incorrect judgment remains non-zero in TOW dynamics, although this probability is nearly equal to zero. However, it would appear that the reward probabilities change frequently in actual decision-making situations, and their long-term behavior is not crucial for many practical purposes. For this reason, TOW dynamics would be more suited to real-world applications.

4. Conclusion and discussion

In this paper, we proposed TOW dynamics for solving the MBP and analytically validated that their high efficiency in making a series of decisions for maximizing the total sum of stochastically obtained rewards is embedded in any volume-conserving physical object when subjected to suitable operations involving fluctuations. In conventional decision-making algorithms for solving the MBP, the parameter for adjusting the 'exploration time' must be optimized. This exploration parameter often reflects the difference between the rewarded experiences, i.e., $| {P}_{A}-{P}_{B}|$ . In contrast, TOW dynamics demonstrates that a higher performance can be achieved by introducing a weighting parameter ${\omega }_{0}$ that refers to the sum of the rewarded experiences, i.e., P_A + P_B. Owing to this novelty, the high performance of TOW dynamics can be reproduced when implementing these dynamics with various volume-conserving physical objects. Thus, our proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for understanding the biological information-processing principles that exploit their underlying physics.

Acknowledgments

This work was partially undertaken when the authors belonged to the RIKEN Advanced Science Institute, which was reorganized and integrated into RIKEN as of the end of March 2013. We thank Professor Masahiko Hara for valuable discussions.

Efficient decision-making by volume-conserving physical object

Article metrics

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Methods

2.1. Tug-of-war (TOW) dynamics

3. Results

3.1. Solvability of the MBP

3.2. TOW principle: origin of the high performance

3.3. Performance characteristics

4. Conclusion and discussion

Acknowledgments

Efficient decision-making by volume-conserving physical object

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Methods

2.1. Tug-of-war (TOW) dynamics

3. Results

3.1. Solvability of the MBP

3.2. TOW principle: origin of the high performance

3.3. Performance characteristics

4. Conclusion and discussion

Acknowledgments