Statistics of the Kolkata Paise Restaurant Problem

We study the dynamics of a few stochastic learning strategies for the 'Kolkata Paise Restaurant' problem, where N agents choose among N equally priced but differently ranked restaurants every evening such that each agent tries get to dinner in the best restaurant (each serving only one customer and the rest arriving there going without dinner that evening). We consider the learning strategies to be similar for all the agents and assume that each follow the same probabilistic or stochastic strategy dependent on the information of the past successes in the game. We show that some 'naive' strategies lead to much better utilization of the services than some relatively 'smarter' strategies. We also show that the service utilization fraction as high as 0.80 can result for a stochastic strategy, where each agent sticks to his past choice (independent of success achieved or not; with probability decreasing inversely in the past crowd size). The numerical results for utilization fraction of the services in some limiting cases are analytically examined.


Introduction
The Kolkata Paise Restaurant (KPR) problem [1]- [3] is a repeated game, played between a large number N of agents having no interaction with each other. In the KPR problem, prospective customers (agents) choose from N restaurants each evening simultaneously (in parallel decision mode). N is fixed. Each restaurant has the same price for a meal but a different rank (agreed upon by all customers) and can serve only one customer any evening. Information regarding customer distributions for earlier evenings is available to everyone. Each customer's objective is to go to the restaurant with the highest possible rank while avoiding the crowd so as to be able to get dinner there. If more than one customer arrives at any restaurant on any evening, one of them is randomly chosen (each of them is treated anonymously) and is served. The rest do not get dinner that evening.
In Kolkata, there were very cheap and fixed-rate 'Paise Restaurants', popular among the daily laborers in the city. During lunch hours, the laborers used to walk (to save transport costs) to one of these restaurants, and would miss lunch if they got to a restaurant where there were too many customers. Walking down to the next restaurant would mean failing to report back to work on time! Paise is the smallest Indian coin, and there were indeed some well-known rankings of these restaurants because some of them would offer tastier items compared to the others. A more general example of such a problem would be when society provides hospitals (and beds) in every locality but local patients go to hospitals of better rank (commonly perceived) elsewhere, thereby competing with the local patients of those hospitals. Unavailability of treatment in time may be considered as lack of service for those people and consequently as (social) wastage of service by those unattended hospitals.
A social planner's (or dictator's) solution to the KPR problem is the following. The planner (or dictator) asks everyone to form a queue and then assigns each one a restaurant with rank matching the sequence of the person in the queue on the first evening. Then each person is told to go to the next ranked restaurant on the following evening (for the person in the last ranked 3 restaurant, this means going to the first ranked restaurant). This shift process then continues for successive evenings. Call this solution the fair social norm. This is clearly one of the most efficient solutions (with utilization fractionf of services by the restaurants equal to unity) and the system arrives at this solution immediately (from the first evening). However, in reality, this cannot be the true solution to the KPR problem, where each agent decides on his own (in parallel or democratically) every evening, based on complete information about past events. In this game, customers try to evolve a learning strategy to eventually get dinners at the best possible ranked restaurant, avoiding the crowd. It is seen that the evolution of these strategies takes considerable time to converge, and even then the eventual utilization fractionf is far below unity. The KPR problem has some basic features similar to the minority game problem [4,5] in that diversity is encouraged (compared to herding behavior) in both. However, it differs from (two-choice) minority games in terms of the macroscopic size of the choices.
As already shown in [1], a simple random-choice algorithm, if adapted by all the agents, can lead to a reasonable value of utilization fraction (f 0.63). Compared to this, several seemingly 'more intelligent' stochastic algorithms lead to lower utilization of the services. Ghosh et al [3] studied a few more such 'smarter' algorithms, having several attractive features (including analytical estimate possibilities), but still failed to improve the overall utilization fraction beyond its random choice value. Here we develop a stochastic strategy that maintains a naive tendency (probability decreasing with past crowd size) to stick to any agent's own past choice (successful or not), leading to a maximum, so far, value of the utilization fraction f ( 0.80) in the KPR problem. We also estimate analytically thef values for several such strategies.

Stochastic learning strategies
Let the symmetric stochastic strategy chosen by each agent be such that at any time t the probability p k (t) of arriving at the kth ranked restaurant is given by where n k (t) denotes the number of agents arriving at the kth ranked restaurant in period t, T > 0 is a scaling factor and α 0 is an exponent. Note that under (1) the probability of selecting a particular restaurant increases with its rank and decreases with its popularity in the immediate past (given by the number n k (t − 1)). Certain properties of the strategies given by (1) are the following: (i) For α = 0 and T → ∞, p k (t) = 1/N corresponds to the complete random choice case for which we know [1] that the utilization fraction is around 0.63, that is, on an average there is 63% utilization of the restaurants (see appendix A). (ii) For α = 0 and T → 0, the agents avoid those restaurants visited last evening and choose again randomly from the remaining restaurants [1]. With appropriate simulation it was shown that the distribution of the fraction f of utilization of the restaurants is Gaussian around 0.46 (see section 2.2).

Rank-dependent strategies
For any natural number α and T → ∞, an agent goes to the kth ranked restaurant with probability p k (t) = k α / k α , which means in the limit T → ∞ in (1) gives p k (t) = k α / k α . Let us discuss the results for such a strategy here.
If an agent selects any restaurant with equal probability P, then the probability that a single restaurant is chosen by m agents is given by Therefore, the probability that a restaurant with rank k is not chosen by any of the agents will be given by where Therefore, the average fraction of agents getting dinner in the kth ranked restaurant is given bȳ and the numerical estimates off k are shown in figure 1. Naturally for α = 0, the problem corresponds to random choicef k = 1 − e −1 givingf = f k /N 0.63 and, for α = 1, f k = 1 − e −2k/N givingf = f k /N 0.58, as already obtained analytically earlier (see appendix B).

Strict crowd-avoiding case
We consider here the case (see also [3]) where each agent chooses on any evening (t) randomly among the restaurants to which nobody had gone the last evening (t − 1). This corresponds to the case where α = 0 and T → 0 in equation (1). Our numerical simulation results for the distribution D( f ) of the fraction f of utilized restaurants are again Gaussian with a most probable value atf 0.46. This can be explained in the following way. As the fractionf of restaurants visited by the agents on the last evening is avoided by the agents this evening, the number of available restaurants is N (1 −f ) for this evening and is chosen randomly by all N agents. Hence, when fitted to equation (A.1) in appendix A, λ = 1/(1 −f ). Therefore, following equation (A.1), we can write the equation forf as The solution of this equation givesf 0.46. This result agrees well with the numerical results for this limit (α = 0, T → 0).

Stochastic crowd-avoiding case
In this section, we start with the following stochastic strategy. If an agent goes to restaurant k in period (t − 1), then the agent goes to the same restaurant in the next period with probability p k (t) = (1/n k (t − 1)) and to any other restaurant k ( = k) with probability p k (t) = (1 − p k (t))/(N − 1). In this process, the average utilization fraction isf 0.8 and the distribution D( f ) is a Gaussian around f 0.8 (see figure 2).  Here we make an approximate estimate off . Let a i denote the fraction of restaurants where exactly i agents (i = 0, . . . , N ) appeared on any evening and assume that a i = 0 for i 3. Therefore, a 0 + a 1 + a 2 = 1, a 1 + 2a 2 = 1 and hence a 0 = a 2 . Given the strategy, a 2 fraction of agents will make attempts to leave their respective restaurants the next evening (t + 1), while no intrinsic activity will occur in the restaurants where nobody came (a 0 ) or only one came (a 1 ) the previous evening (t). This a 2 fraction of agents will now get equally divided (each in the remaining N − 1 restaurants). Of these a 2 , the fraction going to the vacant restaurants (a 0 in the earlier evening) is a 0 a 2 . Hence, the new fraction of vacant restaurants is a 0 − a 0 a 2 . In restaurants having exactly two agents (a 2 per cent in the last evening), some vacancies will be created due to this process, and this is equal to (a 2 /4) − a 2 (a 2 /4). Steady state implies that a 0 − a 0 a 2 + (a 2 /4) − a 2 (a 2 /4) = a 0 and hence, using a 0 = a 2 , we obtain a 0 = a 2 = 0.2, giving a 1 = 0.6 andf = a 1 + a 2 = 0.8. Of course, the above calculation is approximate as none of the restaurants is assumed to get more than two customers on any evening (a i = 0 for i 3). The advantage in assuming a 0 , a 1 and a 2 only to be non-vanishing on any evening is that the activity of redistribution on the next evening starts from this a 2 fraction of restaurants. This, of course, affects a 0 and a 1 for the next evening, and for steady state these changes must balance. The computer simulation results also confirm that a i 0.03 for i 3, and hence the above approximation does not lead to serious error.

Evolving stochastic strategy
In this section, we assume that agents have two possible exogenously given values of α : α = 0 or α = 1. We start by taking some random allocation of α over the set of N agents. The strategy followed by each agent thereafter is the following. If an agent starts with an α = 0(1) and fails to get dinner for the successive τ evenings, then the next evening the agent shifts to α = 1(0). The steady state distribution of α values in the population of agents does not depend on the initial allocation of α values in the population (see figure 3). However, as is obvious, for large values of τ N , the stability of the distribution disappears.

Convergence to a fair social norm with deterministic strategies
In the KPR problem, if the rational agents interact, then a fair social norm that can evolve is a periodically organized state with periodicity N , where each agent in turn gets served in all the N restaurants and all agents get served every evening. Can we find deterministic strategies (in the absence of a dictator) such that society achieves this fair social norm? There is one variant of Pavlov's win shift lose stay strategy (see [6]- [8]) that can be adopted to achieve the fair social norm and another variant that can be adopted to achieve the fair social norm in an asymptotic sense. Of course, these strategies are deterministic in nature.

Fair strategy
The fair strategy works as follows: It is easy to verify that this strategy gives a convergence to the fair social norm in less than or equal to N periods. Moreover, after convergence is achieved, the fair social norm is retained ever after. The difficulty with this strategy is that a myopic agent will find it hard to justify the action of going to the restaurant ranked last after getting served in the best ranked restaurant. However, if the agent is not that myopic and observes the past history of strategies played by all the agents, and can figure out that this one evening's loss is a tacit commitment devised for this kind of symmetric strategy to work, then this voluntary loss is not that implausible. Therefore, one needs to run experiments before arguing for or against this kind of symmetric deterministic strategy. More importantly, the fair strategy can be modified to take care of this justification problem, provided that one wants to achieve the fair social norm in an asymptotic sense.

Asymptotically fair strategy
The asymptotically fair strategy works as follows: (i) At time (evening) t = 0, agents can choose any restaurant either randomly or deterministically. (ii) If at time t agent i was in a restaurant ranked k and was served, then at time t + 1 the agent moves to the restaurant ranked k − 1 if k > 1 and goes to the same restaurant if k = 1. (iii) If agent i was in a restaurant ranked k at time t and was not served, then at time t + 1 the agent goes to the restaurant ranked N .

Summary and discussion
We consider the KPR problem where the decision made by each agent in each time period t is independent and is based on the information about the rank k of the restaurants and their occupancy given by the numbers n k (t − 1) . . . n k (0). We consider in section 2 several stochastic strategies where each agent chooses the kth ranked restaurant with probability p k (t) given by equation (1). The utilization fraction f k of the kth ranked restaurants on every evening is studied and their average (over k) distributions D( f ) are shown in figure 1 for some special cases. From numerical studies, we find their distributions to be Gaussian with the most probable utilization fractionf 0.63, 0.58 and 0.46 for the cases with α = 0, T → ∞; α = 1, T → ∞; and α = 0, T → 0, respectively. For the stochastic crowd-avoiding strategy discussed in section 2.3, we get the best utilization fractionf 0.8. The analytical estimates forf in these limits are also given and they agree very well with the numerical observations. Finally, we suggest ways of achieving the fair social norm either exactly in the presence of the incentive problem or asymptotically in the absence of such an incentive problem. Implementing or achieving such a norm in a decentralized way is impossible when N → ∞. The KPR problem has similarity with the minority game problem [5], as in both the games, herding behavior is punished and diversity is encouraged. Also, both involve learning of the agents from past successes. Of course, KPR has some simple exact solution limits, a few of which are discussed here. In none of the cases considered here are learning strategies individualistic; rather, all the agents choose following the probability given by equation (1). In a few different limits of such a learning strategy, the average utilization fractionf and their distributions are obtained and compared with the analytic estimates, which are reasonably close. Needless to mention, the real challenge is to design algorithms of learning mixed strategies (e.g. from the pool discussed here) by the agents so that the fair social norm emerges eventually, even when everyone decides on the basis of their own information independently. As we have seen, some naive strategies give better values off compared to most of the 'smarter' strategies like strict crowd-avoiding strategies (section 2.2). This observation in fact compares well with the earlier observation in minority games (see e.g. [9]).
It may be noted that all the stochastic strategies, being parallel in computational mode, have the advantage that they converge to a solution at smaller time steps (∼ √ N or weakly dependent on N ), while for deterministic strategies the convergence time is typically of the order of N , which renders such strategies useless in the truly macroscopic (N → ∞) limits. However, deterministic strategies are useful when N is small and rational agents can design appropriate punishment schemes for the deviators (see [6]).
In brief, the study of the KPR problem shows that a dictated solution leads to one of the best possible solutions to the problem, with each agent getting his dinner at the best ranked restaurant with a period of N evenings, and with the best possible value off (= 1), starting from the first evening. The parallel decision strategies (employing evolving algorithms by the agents, and past information, e.g. of n(t)), which are necessarily parallel among the agents and stochastic (as in democracy), are less efficient (f 1; the best one discussed here in section 2.3, givingf 0.8 only). We also note that most of the 'smarter' strategies lead to much lower efficiency.
Is there an upper bound for the value of utilization fractionf (less than unity; easily achieved in the dictated solution) for such stochastic strategies employed in parallel (democratically) by the agents in KPR? If so, what is this upper bound value? Also, what is the learning time required to arrive at such a solution (compared to zero waiting time to arrive at the most efficient dictated solution) in KPR? These questions are to be investigated in future.