The chain rule implies Tsirelson's bound: an approach from generalized mutual information

In order to analyze an information theoretical derivation of Tsirelson's bound based on information causality, we introduce a generalized mutual information (GMI), defined as the optimal coding rate of a channel with classical inputs and general probabilistic outputs. In the case where the outputs are quantum, the GMI coincides with the quantum mutual information. In general, the GMI does not necessarily satisfy the chain rule. We prove that Tsirelson's bound can be derived by imposing the chain rule on the GMI. We formulate a principle, which we call the no-supersignaling condition, which states that the assistance of nonlocal correlations does not increase the capability of classical communication. We prove that this condition is equivalent to the no-signaling condition. As a result, we show that Tsirelson's bound is implied by the nonpositivity of the quantitative difference between information causality and no-supersignaling.


Introduction
One of the most counterintuitive phenomena that quantum mechanics predicts is nonlocality.The statistics of the outcomes of measurements performed on an entangled state at two space-like separated points can exhibit strong correlations that cannot be described within the framework of local realism.This can be formulated in terms of the violation of Bell inequalities [1].On the other hand, it is also known that quantum correlations still satisfy the no-signalling condition, i.e., they cannot be used for superluminal communication, which is prohibited by special relativity.The amount that quantum mechanics can violate the Clauser-Horne-Shimony-Holt (CHSH) inequality [2] is limited by Tsirelson's bound [9].In a seminal paper [3], Popescu and Rohrlich showed that Tsirelson's bound is strictly lower than the limit imposed by the no-signalling condition alone.This result raises the question of why the strength of nonlocality is limited to Tsirelson's bound in the quantum world.If we could find an operational principle rather than mathematical one to answer this question, it would help us better understand why quantum mechanics is the way it is [6,7,8].
From an information theoretical point of view, it is natural to ask if superstrong?nonlocality, i.e., nonlocal correlations exceeding Tsirelson's bound, can be used to increase the capability of classical communication [4].Suppose that Alice is trying to send classical information to distant Bob with the assistance of nonlocal correlations shared in advance.The no-signalling condition implies that, if no classical communication from Alice to Bob is performed, Bob's information gain is zero bits.In other words, zero bits of classical communication can produce no more than zero bits of classical information gain for the receiver.On the other hand, the no-signalling condition does not eliminate the possibility that m > 0 bits of classical communication produces more than m bits of classical information gain for the receiver.Whether such an implausible situation can occur would be depending on the strength of nonlocal correlations.In particular, one might expect that Tsirelson's bound could be derived from the impossibility of such a situation.
Motivated by the foregoing considerations, information causality has been proposed as an answer to the question [4].Information causality is the condition that in bipartite nonlocality-assisted random access coding protocols, the receiver's total information gain cannot be greater than the amount of classical communication allowed in the protocol.This condition is never violated in classical or quantum theory, whereas it is violated in all "supernonlocal" theories, i.e., theories that predict supernonlocal correlations [4].It implies that Tsirelson's bound is derived from this purely information theoretical principle.Thus information causality is regarded as one of the basic informational principles at the foundations of quantum mechanics.
In [4], it is proved that information causality is never violated in any no-signalling theory in which we can define a mutual information satisfying five particular properties.This implies that in supernonlocal theories, we cannot define a function like the mutual information that satisfies all five.On the other hand, both the classical and quantum mutual information satisfy all of the five properties.It is therefore natural to ask another question: which of the five properties is lost in supernonlocal theories?We address this question to better understand the informational features of supernonlocal theories in comparison with quantum theory.
In order to answer this question, we need to define a generalization of the quantum mutual information that is applicable to general probabilistic theories.Several investigations have been made along this line.In [19,20], a generalized entropy H is defined, and then a mutual information is defined in terms of this by I(A : B) := H(A) + H(B) − H(A, B).Using this mutual information, it is proved that the data processing inequality is not satisfied in supernonlocal theories.Similar results are obtained in [21,22].However, the definitions of the entropies in their approaches are mathematical, and do not have clear operational meanings.Note that in classical and quantum information theory, the operational meaning of entropy and mutual information is given by the source coding and channel coding theorems.In [20], a coding theorem analogous to Schumacher's quantum coding theorem [12] is investigated using generalized entropy.However, their consideration is only applicable under several restrictions.As discussed in [19], we need to seek generalizations based on the analysis of data compression or channel capacity.Such an approach is also studied in [11].
Motivated by these discussions, we introduce an operational definition of generalized mutual information (GMI) that is applicable to any general probabilistic theory.This is a generalization of the quantum mutual information between a classical system and a quantum system.Unlike the previous entropic approaches, we directly address the mutual information.The generalization is based on the channel coding theorem.Thus the GMI inherently has an operational meaning as a transmission rate of classical information.Our definition does not require mathematical notions such as state space or fine-grained measurement.The GMI is defined between a classical system and a general probabilistic system -it is not applicable to two general probabilistic systems, but it is sufficient for analyzing the situation describing information causality.The GMI satisfies four of the five properties of the mutual information, the exception being the chain rule.We will show that violation of Tsirelson's bound implies violation of the chain rule of the GMI.
Using the GMI, we further investigate the derivation of Tsirelson's bound in terms of information causality.We formulate a principle, which we call the no-supersignalling condition, stating that the assistance of nonlocal correlations does not increase the capability of classical communication.We prove that this condition is equivalent to the no-signalling condition, and thus it is different from information causality.This result is similar to the result obtained in [20], but now becomes operationally supported.It implies that Tsirelson's bound is not derived from the condition that "m bits of classical communication cannot produce more than m bits of information gain".We show that Tsirelson's bound is derived from the nonpositivity of the quantitative difference between information causality and no-supersignalling.Our results indicate that the chain rule of the GMI imposes a strong restriction on the underlying physical theory.As an example of this fact, we show that we can derive a bound on the state space of one gbit from the chain rule.
This paper is organized as follows.In Section 2, we introduce a minimal framework for general probabilistic theories.In Section 3, we give a brief review of information causality.In Section 4, we define the generalized mutual information, and show that Tsirelson's bound is derived from the chain rule.In Section 5, we prove that the GMI is a generalization of the quantum mutual information.In Section 6, we formulate the nosupersignalling condition, and prove that the condition is equivalent to the no-signalling condition.In Section 7, we clarify the relation among no-supersignalling, information causality and Tsirelson's bound.In Section 8, we show that we can limit the state space of one gbit by assuming the chain rule.We conclude with a summary and discussion in Section 9.

General probabilistic theories
In this section we introduce a minimal framework for general probabilistic theories based on [20,23].
We associate a set of allowed states S S with each physical system S.We assume that any probabilistic mixture of states is also a state, i.e., if φ 1 ∈ S S and φ 2 ∈ S S then φ mix = pφ 1 + (1 − p)φ 2 ∈ S S , where pφ 1 + (1 − p)φ 2 denotes the state that is a mixture of φ 1 with probability p and φ 2 with probability 1 − p.
We also associate a set of allowed measurements M S with each system S.A set of outcomes R e is associated with each measurement e ∈ M S .The state determines the probability of obtaining an outcome r ∈ R e when a measurement e ∈ M S is performed on the system S. Thus we associate each outcome r ∈ R e with a functional e r : S → [0, 1], such that e r (φ) is the probability of obtaining outcome r when a measurement e is performed on a system in the state φ.Such a functional is called an effect.In order that the statistics of measurements on mixed states fits into our intuition, we require the linearity of each effect, i.e., e r (φ mix ) = pe r (φ 1 ) + (1 − p)e r (φ 2 ).
It may be possible to perform transformations on a system.A transformation on the system S is described by a map E : S S → S S ′ , where S ′ denotes the output system.We assume the linearity of transformations, i.e., E(φ mix ) = pE(φ 1 ) + (1 − p)E(φ 2 ).A measurement e ∈ M S is represented by a transformation E M : S S → S T S , where T S represents a classical system corresponding to the register of the measurement outcome.We assume that the composition of two allowed transformations is also an allowed transformation, and that any allowed transformation followed by an allowed measurement is an allowed measurement.
We assume that a composition of two systems is also a system.If we have two systems A and B, we can consider a composite system AB which has its own set of allowed states S AB and that of allowed measurements M AB .Suppose that measurements e A ∈ M A and e B ∈ M B are performed on the system A and B, respectively.Such a measurement is called a product measurement and is included in M AB .We assume that a global state ψ ∈ S AB determines a joint probability for each pair of effects (e A,r , e B,r ′ ).We may also assume that the global state is uniquely specified if the joint probabilities for all pairs of effects (e A,r , e B,r ′ ) are specified.Such an assumption is called the global state assumption.However, it is known that there exists general probabilistic theories which do not fit into this assumption, such as quantum theory in a real Hilbert space.The arguments presented in the following sections of this paper are developed under the global state assumption, although the main results are valid without this assumption.The generalization for theories without this assumption is given in Appendix B.

Review of information causality
Information causality, introduced in [4], is the principle that the total amount of classical information gain that the receiver can obtain in a bipartite nonlocality-assisted random access coding protocol cannot be greater than the amount of classical communication that is allowed in the protocol.Suppose that a string of n random and independent bits X = X 1 , • • • , X n is given to Alice, and a random number k ∈ {1, • • • , n} is given to distant Bob.The task is for Bob to correctly guess X k under the condition that they can use a resource of shared correlations and a m bit one way classical communication from Alice to Bob (see Figure 1).To accomplish this task, Alice first performs a measurement on her part of the resource (denoted by A in the figure), depending on X.She then constructs a m bit message M from X and the measurement outcome, and sends it to Bob.Bob, after receiving M , performs a measurement on his part of the resource (denoted by B in the figure), depending on M and k.From the outcome of the measurement he computes his guess G k for X k .The efficiency of the protocol is quantified by where Information causality is the condition that, whatever strategy they take and whatever resource of shared correlation allowed in the theory they use, must hold for all m ≥ 0. The derivation of Tsirelson's bound in terms of information causality consists of the following two theorems that are proved in [4].Theorem 3.1 guarantees that both classical and quantum theory satisfy information causality.Theorem 3.2 implies that information causality is violated in all supernonlocal theories.These two theorems imply that, in any supernonlocal theory, we cannot define a function of the mutual information that satisfies all five properties.

Generalized mutual information
Suppose that there are a classical system X and a system S that is described by a general probabilistic theory.The states of X are labeled by a finite alphabet X .For each state x of X, the corresponding state of S denoted by φ x is determined.The state of the composite system XS is determined by a probability distribution p(x) = Pr(X = x), which represents the probability that the system X is in the state x, and the corresponding state φ x of S. Thus the state of the composite system XS is identified with an ensemble {p(x), φ x } x∈X .To define a generalized mutual information I G (X : S) between the system X and the system S in the state {p(x), φ x } x∈X , we analyze the classical information capacity of a channel that outputs the system S in the state φ x according to the input X = x (Figure 2).As usually considered in information theory, the sender Alice, who has access to X, tries to send classical information to the receiver Bob, who has access to S, by using the channel many times.Suppose that they use l identical and independent copies of this channel.Let X 1 , • • • , X l be the inputs of the l channels and S 1 , • • • , S l be the corresponding output systems.The channel defining the mutual information between the system X and the system S.It has a classical system as the input system and a general probabilistic system as the output system.
Alice's encoding scheme is determined by a codebook.Let w ∈ {1, • • • , N} be a message that Alice tries to communicate, and the codeword x l (w) = x 1 (w) • • • x l (w) be the corresponding input sequence to the channels.The codebook C is defined as the list of the codewords for all messages by . . . . . . . . .
The letter frequency f (x) for the codebook is defined by For a given probability distribution {p(x)} x∈X , the tolerance τ of the code is defined by By performing a decoding measurement on the output systems S 1 , • • • , S l , Bob tries to guess what the original message w is.Let D denote the decoding measurement.Note that, in general, the decoding measurement is not one in which Bob performs a measurement on each of S 1 , • • • , S l individually, but one in which the whole of the composite system S 1 • • • S l is subjected to a measurement.Let W , Ŵ be Alice's original message and Bob's decoding outcome, respectively.The average error probability P e is defined by The pair of the codebook C and the decoding measurement D is called an (N, l) code.
The ratio log N/l is called the rate of the code, and represents how many bits of classical information is transmitted per use of the channel.
Definition 4.1 A rate R is said to be achievable with p(x) if there exists a sequence of (2 lR , l) codes (C (l) , D (l) ) such that The mutual information between a classical system X and a general probabilistic system S, denoted by I G (X : S), is the function which satisfies the condition that We also define I G (S : X) by I G (S : X) := I G (X : S).Theorem 4.3 I G (X : S) exists and satisfies I G (X : S) ≤ H(X).Here, H(X) is the Shannon entropy of the system X defined by H(X) := − x∈X p(x) log p(x).
Proof.First we prove the existence of R * := sup {R|R is achievable with p(x)}.Consider a (2 lR , l) code and suppose that Alice's message W = 1, • • • , 2 lR is uniformly distributed.Let I ′ , H ′ be the mutual information and the entropy when the input sequence is the codeword corresponding to the uniformly distributed message W .By Fano's inequality, we have e lR + 1 (7) where Here, we use the data processing inequality in the first inequality.By introducing a classical variable K that indicates k with the probability distribution P (K = k) = 1/l, we also have where X is a random variable defined by Pr(X = x k (w)) = 2 −lR /l.From ( 8) and ( 9), we obtain If R is achievable with p(x), there exists a sequence of (2 lR , l) codes satisfying Next we prove that any rate R < R * is also achievable with p(x).Let {(C * (l) , D * (l) )} l be a sequence of (2 lR * , l) codes that satisfies P * (l) e → 0 and τ * (l) → 0. For arbitrary 0 ≤ λ < 1, define another codebook C (l) by using C * (λl) for the first λl codeletters and by choosing the last (1 − λ)l codeletters arbitrarily so that the total tolerance is sufficiently small.Also define the corresponding decoding measurement D (l) as the measurement in which the output system S 1 • • • S λl is subjected to the decoding measurement D * (l) and the output systems S λl+1 , • • • , S l are ignored.The code sequence {(C (l) , D (l) )} l constructed in this way is a sequence of (2 lλR * , l) codes that satisfies P (l) e → 0 and τ (l) → 0. Thus R = λR * is achievable with p(x).Hence we obtain R * = I G (X : S).
Note that I G (X : S) is a function of the state Γ := {p(x), φ x } x∈X of the composite system XS.To emphasize this, we sometimes use the notation I G (X : S) Γ .Since R = 0 is always achievable, I G (X : S) is nonnegative.Shannon's noisy channel coding theorem guarantees that I G (X : S) coincides with the classical mutual information I C (X : S) if S is a classical system [15].The generalized mutual information satisfies the data processing inequality as follows.
Property 4.4 Let E S→S ′ be any local transformation that maps states of a general probabilistic system S into states of another general probabilistic system S ′ .If E S→S ′ contains no post-selection, the generalized mutual information does not increase under this transformation, i.e., I G (X : S) ≥ I G (X : S ′ ).Similarly, I G (X : S) ≥ I G (X ′ : S) under any local transformation E X→X ′ that maps states of a classical system X into states of another classical system X ′ without post-selection.
Proof.Here we only prove the former part.For the latter part, see Appendix A. Consider two channels, channel I and channel II (see Figure 3).Depending on the input X = x, channel I emits the system S in the state φ x , and channel II emits the system S ′ in the state φ ′ x = E S→S ′ (φ x ).It is only necessary to verify that if a rate R is achievable with p(x) by channel II, R is also achievable with p(x) by channel I. Let {(C ′(l) , D ′(l) )} l be a sequence of (2 lR , l) codes for channel II with the average error probability P ′(l) e and the tolerance τ ′(l) .From the code (C ′(l) , D ′(l) ), construct a (2 lR , l) code (C (l) , D (l) ) for channel I by C (l) = C ′(l) and D (l) = D ′(l) • E ⊗l S→S ′ .Here, D ′(l) • E ⊗l S→S ′ represents a process in which first E S→S ′ is applied to each of S 1 , • • • , S l individually and then the decoding measurement D ′(l) is performed on the total output system S ′ 1 • • • S ′ l .The average error probability and the tolerance of this code are given by P (l) e = P ′(l) e and τ (l) = τ ′(l) , respectively.Hence, if P ′(l) e → 0 and τ ′(l) → 0, we also have P (l) e → 0 and τ (l) → 0, and thus R is achievable with p(x) by channel I.
In general probabilistic theories, a measurement on a system S without postselection is described by a probabilistic map E M that maps states of S into states of a classical system T S .T S represents the register of the measurement outcomes.As a special case for Property 4.4, we have I G (X : T S ) ≤ I G (X : S) under E M , which is a generalization of Holevo's inequality.Let us define the accessible information I acc (X : S) by where the maximization is taken over all possible measurements on S. Then we have 0 ≤ I acc (X : S) ≤ I G (X : S).
To summarize, the generalized mutual information satisfies the following properties.
• Nonnegativity: I G (X : S) ≥ 0 • Consistency: When S is a classical system, I G (X : S) = I C (X : S). • Data Processing Inequality: I G (X : S) ≥ I G (X ′ : S ′ ) under local stochastic maps E X→X ′ and E S→S ′ that contain no post-selection.
Thus, from Theorem 3.1 and Theorem 3.2, we conclude that the chain rule of the generalized mutual information should be violated in any supernonlocal theory.
Throughout the rest of this paper, we use the generalized mutual information (GMI) given by Definition 4.2.

Quantum mutual information
The quantum mutual information between a classical system X and a quantum system S is defined by and H(S) is the von Newmann entropy.Note that, in quantum theory, a classical system is described by a Hilbert space in which we only consider a set of orthogonal pure states.With a slight generalization of the Holevo-Schumacher-Westmoreland theorem, it is shown that the GMI is a generalization of the quantum mutual information.
Proof.To prove this, it is only necessary to verify the following two statements: The first statement is proved in [13,14] by using random code generation, and the second statement is proved in the following way.Consider a (2 lR , l) code and suppose that Alice's message W = 1, • • • , 2 lR is uniformly distributed.Similarly to (8), we have e lR + 1 .(17) Here, we use the data processing inequality.We also have In the first line, we use the fact that the state of S k depends only on X k .The first inequality is from the subadditivity of the von Neumann entropy.The last equality holds since K → X → S forms a Markov chain.From ( 17) and ( 18), we obtain If R is achievable with p(x), there exists a sequence of (2 lR , l) codes satisfying P (l) e → 0 and I ′ Q (X : S) → I Q (X : S) ρ when l → ∞.Thus R ≤ I Q (X : S) ρ .

No-supersignalling condition
In this section, to further investigate the derivation of Tsirelson's bound from information causality, we formulate a principle that we call the no-supersignalling condition by using the GMI.Suppose that Alice is trying to send to distant Bob information about n independent classical bits X 1 , • • • , X n , under the condition that they can only use a m bit classical communication M from Alice to Bob and a supplementary resource of correlations shared in advance (see Figure 4).The situation is similar to the setting of information causality described in Section 3, but now, we do not introduce random access coding.Instead, we evaluate Bob's information gain by I G ( X : M , B).
We say that the no-supersignalling condition is satisfied if holds for all m ≥ 0. The condition indicates that the assistance of correlations cannot increase the capability of classical communication.It is a direct formulation of the original concept of information causality that "m bits of classical communication cannot produce more than m bits of information gain".In what follows, we prove that the nosupersignalling condition is equivalent to the no-signalling condition.It indicates that information causality and no-supersignalling are different.Proof.Consider a channel with an input system X and two output systems S Y (see Figure 5).Let Z be the set of all measurements on S, and p(t|x, y, z) be the probability of obtaining the outcome t when the measurement z ∈ Z is performed on the system S in the state φ xy .To achieve I acc (X : S, Y ), the receiver performs a measurement on S possibly depending on Y .Let z(y) be the optimal choice of the measurement when Y = y.The probability of obtaining the outcome t when X = x and Y = y is given by p 1 (t|x, y) := p(t|x, y, z(y)) .
The condition I acc (X : S) = 0 implies that for all z ∈ Z, y p(x, y)p(t|x, y, z) = p(x)p 2 (t|z) , where Thus we obtain p 1 (t, x, y) = p(x, y)p(t|x, y, z(y)) The accessible information I acc (X : S, Y ) is equal to the mutual information I C (X : T, Y ) calculated for the probability distribution p 1 (t, x, y).Therefore The channel that we consider to prove Lemma 6.1.For each pair of the input X = x and the output Y = y, the corresponding state φ xy of the output system S is determined.
In the first inequality, we used (25).In the next equality we defined a probability distribution p 2 (t, y) := p 2 (t|z(y))p(y).The last inequality is from the nonnegativity of the relative entropy.
Theorem 6.2 The no-supersignalling condition defined in terms of the GMI ( 20) is equivalent to the no-signalling condition.
Proof.Consider a (2 lR , l) code for the channel presented in Figure 5 and let X = X, Y = M and S = B. Suppose that Alice's message is uniformly distributed.By Fano's inequality, we have By the data processing inequality, we also have From the no-signalling condition, we have I ′ acc (X l : S l ) = 0. From Lemma 6.1, we obtain and thus Hence we obtain If R is achievable with p(x), there exists a sequence of (2 lR , l) codes that satisfies e → 0 and H ′ (Y ) → H(Y ) when l → ∞.Thus, for any R that is achievable with p(x), we have R ≤ H(Y ).It implies I G (X : Y, S) ≤ H(Y ) and thus I G ( X : M , B) ≤ m.Conversely, for m = 0, the no-supersignalling condition I G (X : B) = 0 implies the no-signalling condition.

The difference between no-supersignalling and information causality
In this section, we discuss the relation among information causality, no-supersignalling, Tsirelson's bound and the chain rule.Let us define ∆ NSS quantifies how much the capability of classical communication is increased by the assistance of nonlocal correlations.No-supersignalling is equivalent to ∆ NSS ≤ 0, and information causality is equivalent to ∆ IC ≤ 0. ∆ ′ quantifies the difference between no-supersignalling and information causality.Theorem 3.2 states that, if Tsirelson's bound is violated, we have ∆ IC > 0. Therefore violation of Tsirelson's bound implies at least either ∆ NSS > 0 or ∆ ′ > 0. Then which does violation of Tsirelson's bound imply, ∆ NSS > 0 or ∆ ′ > 0 ?As we proved in Section 6, ∆ NSS ≤ 0 is satisfied by all no-signalling theories.Thus violation of Tsirelson's bound only implies ∆ ′ > 0. Therefore, Tsirelson's bound is not derived from the condition that the assistance of nonlocal correlations does not increase the capability of classical communication.Instead, Tsirelson's bound is derived from the nonpositivity of ∆ ′ (see Figure 6).Let us further define The chain rule is equivalent to ∆ CR = 0.By the data processing inequality, we always have ∆ CR ≥ ∆ ′ .Thus the chain rule implies Tsirelson's bound ‡ through imposing ∆ ′ ≤ ∆ CR = 0. Let X and Y be two classical systems and S be a general probabilistic system.The chain rule of the GMI is given by Each term in (35) has an operational meaning as an information transmission rate by definition.The relation is satisfied in both classical and quantum theory, but is violated in all supernonlocal theories.Thus we can conclude that this highly nontrivial relation gives a strong restriction on the underlying physical theories.However, the operational meaning of this relation is not clear so far.‡ Another way to show this is to observe that the data processing inequality and the no-supersignalling condition imply ∆ CR ≥ ∆ IC .

Restriction on one gbit state space
To investigate how the chain rule of the GMI imposes a restriction on physical theories, we consider a gbit -the counterpart of a qubit in general probabilistic theories [18].
Here, we do not make assumptions about a gbit such as the dimension of the state space, or the possibility or impossibility of various measurements and transformations.Instead, we define a gbit as the minimum unit of information in the theory, and require that the classical information capacity of one gbit is not more than one bit.Thus we require for any classical system X.When X is a classical system composed of two independent and uniformly random bits X 0 and X 1 , we have By the chain rule, we have By the data processing inequality, we also have Thus the chain rule implies We consider success probabilities of the decoding measurements on S 1gb for X 0 and X 1 .
For simplicity, we assume that the optimal measurement performed on S 1gb to decode X 0 or X 1 has two outcomes t = 0, 1.Let P (t|m, x 0 , x 1 ) be the probability of obtaining the outcome t when X 0 = x 0 , X 1 = x 1 and the measurement m is performed.The index m = 0, 1 corresponds to the optimal measurement for decoding X 0 , X 1 , respectively.The list of all probabilities {P (t|m, x 0 , x 1 )} t,m,x 0 ,x 1 =0,1 can be regarded as representing a "state".We compare the state space of a qubit and the state space determined by (40).For further simplicity, we assume that for all x 0 and x 1 , Then we have and Here, h(x) is the binary entropy defined by h(x) := −x log x − (1 − x) log (1 − x).From (40), ( 41) and (42), we have This inequality gives a restriction on the state space of one gbit (see Figure 7).It is shown in Appendix B that in the case of one qubit, the obtainable region is given by α 2 + β 2 ≤ 1.

Conclusions and discussions
We have defined a generalized mutual information (GMI) between a classical system and a general probabilistic system.Since the definition is based on the channel coding theorem, the GMI inherently has an operational meaning as an information transmission rate.We showed that the GMI coincides with the quantum mutual information if the output system is quantum.The GMI satisfies nonnegativity, symmetry, the data processing inequality, and the consistency with the classical mutual information, but does not necessarily satisfy the chain rule.
Using the GMI, we have analyzed the derivation of Tsirelson's bound from information causality defined in terms of the efficiency of nonlocality-assisted random access coding.We showed that the chain rule of the GMI, which is satisfied in both classical and quantum theory, is violated in any theory in which the existence of nonlocal correlations exceeding Tsirelson's bound is allowed.Thus we conclude that the chain rule of the GMI implies Tsirelson's bound.
We formulated a condition, the no-supersignalling condition, which states that the assistance of nonlocal correlations does not increase the capability of classical communication.We proved that this condition is equivalent to the no-signalling condition.We also clarified the relation among no-supersignalling, information causality, Tsirelson's bound and the chain rule.
The derivation of Tsirelson's bound from information causality proposed in [4] is remarkable in that the Tsirelson's bound is exactly derived and that to do so we only need the five properties of the mutual information.However, information causality is different from the condition that "m bits of classical communication cannot produce more than m bits of information gain".This derivation shows that several laws of Shannon theory §, represented by the five properties of the mutual information, taken together impose a strong restriction on the underlying physical theory.If we take the GMI as the definition of the mutual information, it reduces to the statement that "a law of Shannon theory, namely the chain rule of the GMI, imposes a strong restriction on the underlying physical theory".
Although the operational meaning of the GMI is clear, we have not yet succeeded in finding a clear operational meaning of the chain rule.In classical and quantum Shannon theory, the chain rule appears in a lot of proofs of coding theorems.Therefore, investigation of the meaning of the chain rule would lead us to a better understanding of the informational foundations of quantum mechanics.On the other hand, our definition of the generalized mutual information is not the only way to generalize the quantum mutual information.It would also be fruitful to seek out other operationally motivated definitions of the generalized mutual information and compare them.Lemma A.2 τ (l) → 0 in probability in the limit of l → ∞.
Proof.Let f (x) (l) and f (x ′ ) (l) be the letter frequency of the codebook C (l) and C ′(l) , respectively.We have Define f (x, x ′ ) (l) := |{(k, w)|x k (w) = x, x ′ k (w) = x ′ , 1 ≤ k ≤ l, 1 ≤ w ≤ 2 lR }| l • 2 lR for x ∈ X , x ′ ∈ X ′ .By using the relation f (x ′ ) (l) , (A.13) φ x according to the input X = x.If the input sequence is x 1 • • • x l , the state of the output system S 1 • • • S l is φ x 1 • • • φ x l .However, without the global state assumption, this does not specify the "global" state of the composite system: it only specifies the state of the composite system for product measurements.Thus it is not sufficient to determine the rate of the channel.To avoid this difficulty, we introduce the notion of "consistency" of the states.Let Φ x 1 •••x l be a global state of S 1 • • • S l .We say Φ x 1 •••x l is consistent with φ x 1 • • • φ x l if the two states exhibit the same statistics for any product measurement.Φ (l) := {Φ With a slight abuse of terminology, we say Φ := {Φ (l) } ∞ l=1 is consistent with {φ x } x∈X if Φ (l) is consistent with l=1 be the sequence of the channel Γ (l) Φ that outputs the system S 1 • • • S l in the state Φ x 1 •••x l ∈ Φ (l) ∈ Φ according to the input Definition B.1 A rate R is said to be achievable with p(x) for Φ if there exists a sequence of (2 lR , l) codes (C (l) , D (l) ) for Γ (l) Φ ∈ Γ Φ such that (i) P (l) e → 0 when l → ∞, (ii) τ (l) → 0 when l → ∞.Definition B.2 A rate R is said to be achievable with p(x) if R is achievable with p(x) for all Φ that is consistent with {φ x } x∈X .
We define the generalized mutual information by Definition 4.2 and its existence is proved by Theorem 4.3.The data processing inequality (Property 4.4) is proved as follows.
Proof.The inequality I G (X : S) ≥ I G (X : S ′ ) under local transformation E S→S ′ is proved as follows.
I G (X : S ′ ) = sup{R|R is achievable for all Φ ′ that is consistent with {E(φ x )} x∈X } ≤ sup{R|R is achievable for E(Φ) for all Φ that is consistent with {φ x } x∈X } ≤ sup{R|R is achievable for all Φ that is consistent with {φ x } x∈X } = I G (X : S) . (B.1) Here, E(Φ) := {E ⊗l (Φ (l) )} ∞ l=1 and E ⊗l (Φ (l) ) := {E ⊗l (Φ The first inequality comes from the fact that E(Φ) is consistent with {E(φ x )} x∈X if Φ is consistent with {φ x } x∈X .The second inequality is proved in the same way as the proof presented in page 9.
The inequality I G (X : S) ≥ I G (X ′ : S) under local transformation E X→X ′ is proved as follows.
I G (X ′ : S) = sup{R|R is achievable for all Φ ′ that is consistent with {φ x ′ } x ′ ∈X ′ }

Theorem 3 . 1
If we can define a function I(A : B) satisfying the following five properties in the general probabilistic theory, J ≤ m holds for all m ≥ 0. The properties are • Symmetry : I(A : B) = I(B : A) for any systems A and B. • Nonnegativity : I(A : B) ≥ 0 for any systems A and B.

Figure 1 .Theorem 3 . 2
Figure 1.Nonlocality-assisted random access coding.The task is for Bob to correctly guess X k , where k is a random number unknown to Alice.

Figure 2 .
Figure2.The channel defining the mutual information between the system X and the system S.It has a classical system as the input system and a general probabilistic system as the output system.

Figure 3 .
Figure 3. Channel II defined as the combination of channel I and E S→S ′ .

Figure 4 .
Figure 4.The situation that the no-supersignalling condition refers to.The amount of information about X contained in M and B is quantified by I G ( X, M , B).

Figure 6 .
Figure 6.The relation between no-supersignalling and information causality, and the chain rule.Information causality refers to the gap in (1) represented by ∆ IC .No-supersignalling to the gap in (2) represented by ∆ NSS , and is irrelevant to Tsirelson's bound.The gap in (3) represented by ∆ ′ is crucial in the derivation of Tsirelson's bound.∆ ′ is bounded above by zero if the chain rule is satisfied.

Figure 7 .
Figure 7.Comparison of the state space of a qubit and the boundary given by the chain rule.The grey region indicates the state space of a qubit given by α 2 + β 2 ≤ 1.The black region in addition to the grey region indicates the region defined by (43).

Figure A1 .
Figure A1.Channel III defined as the combination of E X→X ′ and channel I.This channel as a whole is equivalent to a channel with the input x ′ and the output φ x ′ .