This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Open access

Efficient decision-making by volume-conserving physical object

, and

Published 11 August 2015 © 2015 IOP Publishing Ltd and Deutsche Physikalische Gesellschaft
, , Citation Song-Ju Kim et al 2015 New J. Phys. 17 083023 DOI 10.1088/1367-2630/17/8/083023

1367-2630/17/8/083023

Abstract

Decision-making is one of the most important intellectual abilities of not only humans but also other biological organisms, helping their survival. This ability, however, may not be limited to biological systems and may be exhibited by physical systems. Here we demonstrate that any physical object, as long as its volume is conserved when coupled with suitable operations, provides a sophisticated decision-making capability. We consider the multi-armed bandit problem (MBP), the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. Efficient MBP solvers are useful for many practical applications, because MBP abstracts a variety of decision-making problems in real-world situations in which an efficient trial-and-error is required. These decisions are made as dictated by a physical object, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war (TOW) game. This method, called 'TOW dynamics', exhibits higher efficiency than conventional reinforcement learning algorithms. We show analytical calculations that validate statistical reasons for TOW dynamics to produce the high performance despite its simplicity. These results imply that various physical systems in which some conservation law holds can be used to implement an efficient 'decision-making object'. The proposed scheme will provide a new perspective to open up a physics-based analog computing paradigm and to understanding the biological information-processing principles that exploit their underlying physics.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The computing principles in modern digital paradigms have been designed to be dissociated from the underlying physics of natural phenomena [1]. In the construction of CMOS devices, wide-band-gap materials have been employed so that physical fluctuations such as thermal noise, which often violate logically-valid behavior, could be neglected [2]. Since electron dynamics constrained by physical laws cannot be controlled when only parameters of the same degree of freedom as those of logical input–output responses are modulated, considerably complicated circuits are required for implementing relatively simple logic gates such as NAND and NOR [3]. However, these efforts to circumvent the division between physics and computation are costly in terms of energy consumption and manufacturing resources. On the other hand, when we look at the natural world, information processing in biological systems is elegantly coupled with their underlying physical laws and fluctuations [46]. This suggests a potential for establishing a new physics-based analog-computing paradigm. In this paper, we show that a physical constraint, the conservation law for the volume of a rigid body, allows for efficient solving of decision-making problems when subjected to suitable operations involving fluctuations.

Suppose there are M slot machines, each of which returns a reward; for example, a coin, with a certain probability that is unknown to a player. Let us consider a minimal case: two machines A and B give rewards with individual probabilities PA and PB, respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The multi-armed bandit problem (MBP) is used to determine the optimal strategy for finding the machine with the highest reward probability as accurately and quickly as possible by referring to past experiences.

The MBP is formulated as a mathematical problem without loss of generality and so is related to various stochastic phenomena. In fact, many application problems in diverse fields, such as communications (cognitive networks [7, 8]), commerce (advertising on the web [9]), entertainment (Monte Carlo tree search, which is used for computer games [10, 11]), can be reduced to MBPs. Particularly, the 'upper confidence bound 1 (UCB1) algorithm' for solving MBPs is used worldwide in many practical applications [17].

In the context of reinforcement learning, the MBP was originally described by Robbins [12], though the essence of the problem had been studied earlier by Thompson [13]. The optimal strategy, called the 'Gittins index', is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [14, 15]. Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the algorithms proposed by Agrawal and Auer et al, another index was expressed as a simple function of the reward sums obtained from the machines [16, 17].

2. Methods

2.1. Tug-of-war (TOW) dynamics

Kim et al proposed an MBP solution using a dynamical system, called 'TOW dynamics'; this algorithm was inspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slime mold P. polycephalum) [1823], which maintains a constant intracellular-resource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In this nature-inspired algorithm, the decision-making function is derived from its underlying physics, resembling that of a TOW game. The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a non-local correlation among the terminal parts, that is, the volume increment in one part is immediately compensated by volume decrement(s) in the other part(s). In our previous studies [1823], we showed that, owing to the non-local correlation derived from the volume-conservation law, TOW dynamics exhibit higher performance than other well-known algorithms such as the modified epsilon-greedy algorithm and the modified softmax algorithm, which is comparable to the UCB1-tuned algorithm (seen as the best choice among parameter-free algorithms [17]). These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it held some common physical attributes such as the conservation law. In fact, Kim et al demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be exploited for the implementation of TOW dynamics [24, 25].

In this paper, we extract the most essential ingredients of the TOW dynamics and formulated its simplified version, called the 'TOW principle', in order to summarize the mechanism of high performance by showing analytical calculations of 'regret' that can characterize the performance of the multi-armed bandit algorithms. Consider a volume-conserving physical object; for example, a rigid body like an iron bar (the slot-machine's handle), as shown in figure 1. Here, the variable Xk represents the displacement of terminal k from an initial position, where $k\in \{A,B\}$. If Xk is a maximum, we assume that the body makes a decision to play machine k. In TOW dynamics, the MBP is represented in its inverse form: instead of 'rewarding' the player when machine k produces a coin with a probability Pk, we 'punish' the player when the machine gives no coin with a probability $1-{P}_{k}$. In this respect, the displacement XA ($=-{X}_{B}$) is determined by the following equations:

Equation (1)

Equation (2)

Here, ${Q}_{k}(t)$ ($k\in \{A,B\}$) is an 'estimate' of information on past experiences accumulated from the initial time 1 to current time t, ${N}_{k}(t)$ counts the number of times that machine k has been played until time t, ${L}_{k}(t)$ counts the number of punishments when playing machine k until time t, $\delta (t)$ is an arbitrary fluctuation to which the body is subjected, and ω is a weighting parameter to be described in detail later on in this paper. Equation (2) called the 'learning rule'. Consequently the TOW dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine k is played at each time t, +1 and $-\omega $ are added to ${X}_{k}(t)$ when rewarded and non-rewarded, respectively (figure 1).

Figure 1.

Figure 1. TOW dynamics. If machine k ($k\in \{A,B\}$) is played at each time t, +1 and $-\omega $ are added to ${X}_{k}(t)$ for rewarding (a) and non-rewarding cases (b), respectively.

Standard image High-resolution image

3. Results

3.1. Solvability of the MBP

To explore the MBP solvability of TOW dynamics, let us consider a random-walk model. As shown in figure 2(a), α (right flight when rewarded) and β (left flight when non-rewarded) are the parameters. We assume that PA > PB for simplicity. After time step t, the displacement ${R}_{k}(t)$ ($k\in \{A,B\}$) can be described by

Equation (3)

The expected value of Rk can be obtained from the following equation:

Equation (4)

Figure 2.

Figure 2. (a) Random walk: flight α when rewarded with Pk or flight $-\beta $ when non-rewarded with $1-{P}_{k}$. (b) Probability distributions of two random walks.

Standard image High-resolution image

In the overlapping area between the two distributions shown in figure 2(b), we cannot accurately estimate which is larger. The overlapping area should decrease as Nk increases so as to avoid incorrect judgments. This requirement can be expressed by the following forms:

Equation (5)

Equation (6)

These expressions can be rearranged into the form

Equation (7)

In other words, the parameters α and β must satisfy the above conditions so that the random walk correctly represents the larger judgment.

We can easily confirm that the following form satisfies the above conditions:

Equation (8)

From ${R}_{k}(t)/\alpha $ = ${Q}_{k}(t)$, we obtain $\omega =\frac{\beta }{\alpha }$. From this and equation (8), we obtain

Equation (9)

Equation (10)

Here, we have set the parameter ω to ${\omega }_{0}$. Therefore, we can conclude that the algorithm using the learning rule Qk with the parameter ${\omega }_{0}$ can solve the MBP correctly.

3.2. TOW principle: origin of the high performance

In many popular algorithms such as the epsilon-greedy algorithm, at each time t, an estimate of reward probability is updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the sum of the reward probabilities γ = PA + PB is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.

The top and bottom rows of table 1 provide estimates based on the knowledge that machine A was played NA times and that machine B was played NB times, respectively. Note that we can also update the estimate of the machine that was not played, owing to the given γ.

Table 1.  Estimates for each reward probability based on the knowledge that machine A was played NA times and that machine B was played NB times—on the assumption that the sum of the reward probabilities γ = PA + PB is known.

A: $\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}$ B: $\gamma \;-$ $\displaystyle \frac{{N}_{A}-{L}_{A}}{{N}_{A}}$
A: $\gamma -$ $\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}$ B: $\displaystyle \frac{{N}_{B}-{L}_{B}}{{N}_{B}}$

From the above estimates, each expected reward ${Q}_{k}^{\prime }$ ($k\in \{A,B\}$) is given as follows:

Equation (11)

Equation (12)

These expected rewards, ${Q}_{j}^{\prime }$ s, are not the same as those given by the learning rules of TOW dynamics, Qjs in equation (2). However, what we use substantially in TOW dynamics is the difference

Equation (13)

When we transform the expected rewards ${Q}_{j}^{\prime }{\rm{s}}$ into ${Q}_{j}^{\prime\prime }={Q}_{j}^{\prime }/(2-\gamma )$, we can obtain the difference

Equation (14)

Comparing the coefficients of equations (13) and (14), the differences in their constituent terms are always equal when $\omega ={\omega }_{0}$ (equation (9)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter ${\omega }_{0}$ in terms of γ.

This derivation implies that the learning rule for TOW dynamics is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously. In other words, TOW dynamics imitates the imaginary system that determines its next move at time $t+1$ in referring to the estimates of the two machines, even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact that the sum of reward probabilities is given in advance, may be one of the origins of the high performance of TOW dynamics.

Monte Carlo simulations were performed it was verified that TOW dynamics with ${\omega }_{0}$ exhibits an exceptionally high performance, which is comparable to its peak performance—achieved with the optimal parameter ${\omega }_{\mathrm{opt}}$. To derive the optimal value ${\omega }_{\mathrm{opt}}$ accurately, we need to take into account the fluctuation and other dynamics of terminals [21].

In addition, the essence of the process described here can be generalized to K-machine cases. To separate distributions of the top Mth and top $(M+1)$ th machine, as shown in figure 2(b), all we need is the following ${\omega }_{0}:$

Equation (15)

Equation (16)

Here, ${P}_{(M)}$ denotes the top Mth reward probability, and M is any integer from 1 to $K-1$. The MBP is a special case where M = 1. In fact, for K-machine and M-player cases, we have designed a physical system that can determine the overall optimal state, called the 'social maximum', quickly and accurately [26].

3.3. Performance characteristics

To characterize the high performance of TOW dynamics, let us consider the imaginary model for solving the MBP, called the 'cheater algorithm'. The cheater algorithm selects a machine to play according to the following estimate Sk ($k\in \{A,B\}$)

Equation (17)

Equation (18)

Here, ${X}_{k,i}$ is a random variable that takes either 1 (rewarded) or 0 (non-rewarded). If SA > SB at time t = N, machine A is played at time $t=N+1$. If SB > SA at time t = N, machine B is played at time $t=N+1$. If SA = SB at time t = N, a machine is played randomly at time $t=N+1$. Note that the algorithm refers to results of both machines at time t without any attention to which machine was played at time $t-1$. In other words, the algorithm 'cheats' because it plays both machines and collects both results, but declares that it plays only one machine at a time.

The expected value and the variance of Xk are defined as $E({X}_{k})={\mu }_{k}$ and $V({X}_{k})={\sigma }_{k}^{2}$. Here, ${\mu }_{k}$ is the same as the Pk defined earlier. From the central-limit theorem, Sk has a Gaussian distribution with $E({S}_{k})={\mu }_{k}N$ and $V({S}_{k})={\sigma }_{k}^{2}N$. If we define a new variable $S={S}_{A}-{S}_{B}$, S has a Gaussian distribution and carries the following values:

Equation (19)

Equation (20)

Equation (21)

From figure 3, the probability of playing machine B, which has a lower reward probability, can be described as ${\bf{Q}}(\frac{E(S)}{\sigma (S)})$. Here, ${\bf{Q}}(x)$ is a Q-function. We obtain

Equation (22)

Here,

Equation (23)

Figure 3.

Figure 3.  ${\bf{Q}}\left(\displaystyle \frac{E(S)}{\sigma (S)}\right)$: probability of selecting the lower-reward machine using the cheater algorithm.

Standard image High-resolution image

Using the Chernoff bound ${\bf{Q}}(x)\leqslant \frac{1}{2}\mathrm{exp}(-\frac{{x}^{2}}{2})$, we can calculate the upper bound of a measure, called the 'regret', which quantifies the accumulated losses of the cheater algorithm.

Equation (24)

Equation (25)

Equation (26)

Note that the regret becomes constant as N increases.

Using the 'cheated' results, we can also calculate the regret of TOW dynamics in the same way. In this case,

Equation (27)

Equation (28)

${X}_{k,i}$ is also a random variable that takes either 1 (rewarded) or 0 (non-rewarded). Here, we use ${L}_{k}=(1-{\mu }_{k}){N}_{k}$. Then, we obtain $E({S}_{k})=\{{\mu }_{k}-(1-{\mu }_{k})\omega \}{N}_{k}$ and $V({S}_{k})={\sigma }_{k}^{2}{N}_{k}$.

Using the new variables $S={S}_{A}-{S}_{B}$, $N={N}_{A}+{N}_{N}$, and $D={N}_{A}-{N}_{N}$, we also obtain

Equation (29)

Equation (30)

If the conditions $\omega ={\omega }_{0}$ and ${\sigma }_{A}={\sigma }_{B}$ $\equiv \sigma $ are satisfied, we then obtain

Equation (31)

Equation (32)

and

Equation (33)

Here,

Equation (34)

We can then calculate the upper bound of the regret for TOW dynamics

Equation (35)

Equation (36)

Note that the regret for TOW dynamics also becomes constant as N increases.

It is known that optimal algorithms for the MBP, defined by Auer et al, have a regret proportional to $\mathrm{log}(N)$ [17]. The regret has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero. A constant regret means that the probability of incorrect judgment remains non-zero in TOW dynamics, although this probability is nearly equal to zero. However, it would appear that the reward probabilities change frequently in actual decision-making situations, and their long-term behavior is not crucial for many practical purposes. For this reason, TOW dynamics would be more suited to real-world applications.

4. Conclusion and discussion

In this paper, we proposed TOW dynamics for solving the MBP and analytically validated that their high efficiency in making a series of decisions for maximizing the total sum of stochastically obtained rewards is embedded in any volume-conserving physical object when subjected to suitable operations involving fluctuations. In conventional decision-making algorithms for solving the MBP, the parameter for adjusting the 'exploration time' must be optimized. This exploration parameter often reflects the difference between the rewarded experiences, i.e., $| {P}_{A}-{P}_{B}| $. In contrast, TOW dynamics demonstrates that a higher performance can be achieved by introducing a weighting parameter ${\omega }_{0}$ that refers to the sum of the rewarded experiences, i.e., PA + PB. Owing to this novelty, the high performance of TOW dynamics can be reproduced when implementing these dynamics with various volume-conserving physical objects. Thus, our proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for understanding the biological information-processing principles that exploit their underlying physics.

Acknowledgments

This work was partially undertaken when the authors belonged to the RIKEN Advanced Science Institute, which was reorganized and integrated into RIKEN as of the end of March 2013. We thank Professor Masahiko Hara for valuable discussions.

Please wait… references are loading.