Quantum learning: asymptotically optimal classification of qubit states

Pattern recognition is a central topic in learning theory, with numerous applications such as voice and text recognition, image analysis and computer diagnosis. The statistical setup in classification is the following: we are given an i.i.d. training set (X1, Y1), … , (Xn, Yn), where Xi represents a feature and Yi∊{0, 1} is a label attached to that feature. The underlying joint distribution of (X, Y) is unknown, but we can learn about it from the training set, and we aim at devising low error classifiers f: X→Y used to predict the label of new incoming features. In this paper, we solve a quantum analogue of this problem, namely the classification of two arbitrary unknown mixed qubit states. Given a number of ‘training’ copies from each of the states, we would like to ‘learn’ about them by performing a measurement on the training set. The outcome is then used to design measurements for the classification of future systems with unknown labels. We found the asymptotically optimal classification strategy and show that typically it performs strictly better than a plug-in strategy, which consists of estimating the states separately and then discriminating between them using the Helstrom measurement. The figure of merit is given by the excess risk equal to the difference between the probability of error and the probability of error of the optimal measurement for known states. We show that the excess risk scales as n−1 and compute the exact constant of the rate.


Introduction
Statistical learning theory [1,2,3,4] is a broad research field stretching over statistics and computer science, whose general goal is to devise algorithms which have the ability to learn from data.One of the central learning problems is how to recognise patterns [5], with practical applications in speech and text recognition, image analysis, computer-aided diagnosis, data mining.The paradigm of Quantum Information theory is that quantum systems carry a new type of information with potentially revolutionary applications such as faster computation and secure communication [6].Motivated by these theoretical challenges, Quantum Engineering is developing new tools to control and accurately measure individual quantum systems [7].In the process of engineering exotic quantum states, statistical validation has become a standard experimental procedure [8,9] and Quantum Statistical Inference has passed from its purely theoretical status in the 70's [10,11] to a more practically oriented theory at the interface between the classical and quantum worlds [12,13,14,15].In this paper we put forward a new type of quantum statistical problem inspired by learning theory, namely quantum state classification.Similar ideas have already appeared in the physics [16,17,18,19] and learning [20,21,22] literature but here we emphasise the close connection with learning and we aim at going beyond the special models based on group symmetry and pure states.However, we limit ourselves to a two dimensional state which could be regarded as a toy model from the viewpoint of learning theory, but hope that more interesting applications will follow.Before explaining what quantum classification is, let us briefly mention the classical set-up we aim at generalising.In supervised learning the goal is to learn to predict an output y ∈ Y, given the input (object) x ∈ X , where input and output are assumed to be correlated and have an unknown joint distribution P over X × Y.To do this, we are first provided with a set of n previously observed inputs with known output variables (called training examples), i.e. independent random pairs (X i , Y i ), i = 1, . . ., n drawn from P. Using the training set, we construct a function h n : X → Y to predict the output for future, yet unseen objects.When Y = {0, 1}, i.e. the output is a binary variable, this is called binary classification and is the typical set-up in pattern recognition.The input space is usually considered to be a subset of p-dimensional space R p , so that the object x can be described by p measurement values often called features.This description is very general as it allows e.g. to handle categorical (non-numerical) values (encoded as integer numbers), images (e.g.measured brightness of each pixel corresponds to a separate feature), time series (features corresponds to the values of the signal at given times), etc.In this paper, we consider the classification problem in which the objects to be classified are quantum states.Simply, we have a quantum system prepared in either of two unknown quantum states and we want to know which one it is.As in the classical case, this only makes sense if we are also provided with training examples from both states, with their respective labels, from which we can learn about the two alternatives.How could such a scenario occur?Suppose we send one bit of information through a noisy quantum channel which is not known.To decode the information (the input in this case) we need to be able to classify the output states corresponding to the two inputs.Alternatively, the binary variable may be related to a coupling of the channel which we want to detect.Needless to say, quantum systems are intrinsically statistical and can be 'learned' only by repeated preparation, so that the problem is really the quantum extension of the classical classification problem.On the other hand this is related to the problem of state discrimination which in the case of two hypotheses, has an explicit solution known as the Helstrom measurement [11].The point is that when the states are unknown, the Helstrom measurements is itself unknown and has to be learned from the training set.An intuitive solution would be a plug-in procedure: first estimate the two states, and then apply the Helstrom measurement corresponding to the estimates on any new to-be-classified state.This indeed gives a reasonable classification strategy, but as we will see, this is not the best one.The optimal strategy in the asymptotic framework is to directly estimate the Helstrom measurement without intermediate states estimation.The optimality is defined by the natural figure of merit called excess risk, which is the difference between the expected error probability and the error probability of the Helstrom measurement.We show that the excess risk converges to zero with the size of the training set as n −1 and the ratio between the optimal and state estimation plug-in risk is a constant factor.Our analysis is valid for arbitrary mixed states and is performed in a pointwise, local minimax (rather than Bayesian) setting which captures the behaviour of the risk around any pair of states.The key theoretical tool is the recently developed theory of local asymptotic normality (LAN) for quantum states [23,24,25,26] which is an extension of the classical concept in mathematical statistics introduced by Le Cam [27].Roughly, LAN says that the collective state ρ ⊗n θ of n i.i.d.quantum systems can be approximated by a simple Gaussian state of some classical variables and quantum oscillators.This was used to derive optimal state estimation strategies for arbitrary mixed states of arbitrary finite dimension, and also in finding quantum teleportation benchmarks for multiple qubit states [28].In this paper, LAN is used to identify the (asymptotically) optimal measurement on the training set as linear measurement on two harmonic oscillators.Similarly to the case of state estimation such collective measurements perform strictly better than the local ones [29,30].Moreover, optimal learning collective measurement is different from the optimal measurement for state estimation, showing once again that generically, different quantum decision problems cannot be solved optimally simultaneously.Related work.Sasaki and Carlini [16] defined a quantum matching machine which aims at pairing a given 'feature' state with the closest out of a set of 'template' states.The problem is formulated in a Bayesian framework with uniform priors over the feature and template pure states which are considered to be unknown.Bergou and Hillery [17] introduced a discrimination machine, which corresponds to our set-up in the special case when the training set is of size n = 1.The papers [18,19] deal with the problem of quantum state identification as defined in this paper.The special case of Bayesian risk with uniform priors over pure states was solved in [18], with the small difference that the learning and classification steps are done in a single measurement over n + 1 systems.However, as in the case of state estimation [31], the proof relies on the special symmetry of the prior and does not cover mixed states.Finally, the concept of quantum classification was already proposed in a series of papers [20,21,22].However, the authors mostly focused on problem formulation, reduction between different problem classes and general issues regarding learnability.Other related papers which fall outside the scope of our investigation are [32,33].This paper is organised as follows.Section 2 gives a short overview of the classical classification set-up and introduces its quantum analogue.Section 3 discusses the LAN theory with emphasis on the qubit case.In section 4 we reformulate the classification problem in the asymptotic (local) framework, as an estimation problem with quadratic loss for the training set.The main result is Theorem 5.1 of Section 5 which gives the mimimax excess risk for the case of known priors.The case of unknown priors is treated Section 5.2.The optimal classifier is compared to the plug-in procedure based on optimal state estimation in Section 5.1.The geometry of the problem is captured by the Bloch ball illustrated in Figure 4. We conclude the paper with discussions.

Classical Learning
Let (X, Y ) be a pair of random variables with joint distribution P over the measure space (X × {0, 1}, Σ).In the classical setting X is usually a subset of R p and Y is a binary variable.
In a first stage we are given a training set of n i.i.d.pairs {(X 1 , Y 1 ), . . ., (X n , Y n )} with distribution P, from which we would like to 'learn' about P. In the second stage we are presented with a new sample X and we are asked to guess its unseen label Y .For this we construct a (random) classifier ĥn : X → {0, 1} which depends on the data (X 1 , Y 1 ), . . ., (X n , Y n ).Its overall accuracy is measured in terms of the expected error rate according the data distribution P, where 1 C is the indicator function equal to 1 if C is true, and 0 otherwise.However the error rate itself does not give a good indication on the performance of the learning method.Indeed, even an 'oracle' who knows P exactly has typically a non-zero error: in this case the optimal ĥ is the Bayes classifier which chooses the label that is more probable with respect to conditional distribution P(y|x) where η(x) := P(Y = 1|x).The Bayes risk is An alternative view of the Bayes classifier which fits more naturally in the quantum set-up is the following.We are given data X whose probability distribution is either P 0 (X) := P(X|Y = 0) or P 1 (X) := P(X|Y = 1) and we would like to test between the two hypotheses.We are in a Bayesian set-up where the hypotheses are chosen randomly with prior distributions π i = P(Y = i).The optimal solution of this problem is the well known likelihood ratio test: we choose the hypothesis with higher likelihood which can be easily verified to be identical to the previously defined Bayes classifier.The Bayes risk can be written as where p i are the densities of P(X|Y = i) with respect to some common reference measure.Returning to the classfication set-up where P is unknown, we see that a more informative performance measure for ĥn is the excess risk: R( ĥn ) = P e ( ĥn ) − P e (h * ) ≥ 0 which measures how much worse the procedure ĥn performs compared to the performance of the oracle classifier.In statistical learning theory one is primarily interested in consistent classifiers, for which the excess risk converges to 0 as n → ∞, and then in finding classifiers with fast convergence rates [2,3].But how to compare different learning procedures?One can always design algorithms which work well for certain distributions and badly for others.
Here we take the statistical approach and consider that all prior information about the data is encoded in the statistical model {P θ : θ ∈ Θ} i.e. the data comes from a distribution which depends on some unknown parameter θ belonging to a parameter space Θ.The later may be a subset of R k (parametric) or a large class of distributions with certain 'smoothness' properties (non-parametric).One can then define the maximum risk of ĥn where R θ denotes the excess risk when the underlying distribution is P θ .A procedure hn is called minimax if its maximum risk is smaller than that of any other procedure Alternatively one can take a Bayesian approach and optimise the average risk with respect to a given prior over Θ.
The crucial feature leading to exponentially small risk was the fact that the regression function η(X) is bounded away from the critical value 1/2.This situation is rather special but shows that the behaviour of the excess risk depends on the properties of η around the value 1/2.Let us look at another simple example with a different behaviour.
for some unknown means a < b, and P(Y = 0) = 1/2.From Figure 2.1 we can see that p 0 (x) ≤ p 1 (x) if and only if x ≥ (a + b)/2 so that the Bayes classifier is The Bayes risk is equal to the orange area under the two curves.Again a natural classifier is obtained by estimating the midpoint (a + b)/2 and plugging into the above formula.The additional error is the area of the green triangle.Since and it can be shown that this rate of convergence is optimal [34].
From this example we see that the rate is determined by the behaviour of the regression function η around 1/2, namely in this case which is called the margin condition.Roughly speaking, in a parametric model satisfying the margin condition, the excess risk goes to zero as O n −1 .In non-parametric models (which are the main focus of learning theory), arbitrarily slow rates are possible depending on the complexity of the model and the behaviour of the regression function [34].According to Vapnik [3], one of the principles of statistical learning is: "when solving a problem of interest, do not solve a more general problem as an intermediate step."This is interpreted as saying that learning procedures which estimate first the statistical model (or Likelihood functions for two normal distributions with means a, b.The Bayes risk is the area of the orange triangle.The excess risk is the area of the green triangle regression function) and then plug this estimate into the Bayes classifier, are less efficient than methods which aim at constructing ĥ(x) directly.Recently it has been shown [34] that this is not necessarily the case if some type of margin condition is assumed, and that plug-in estimators ĥPLUG- can perform close to, or at 'fast n −1 rates'.In this paper we show that at least in what concerns the constant in front of the rate, direct quantum learning performs better than plug in methods based on optimal state estimation.This is a purely quantum phenomenon which stems from the incompatibility between the optimal measurements for estimation and learning.

Quantum Learning
We now consider the quantum counterpart of the learning problem, the classification of quantum states.In this case, X is replaced by a Hilbert space of dimension d.To find the counterpart of P we write P(dx, y) = P(dx|y)P(y) and replace the conditional distributions P(dx|y = 0) and P(dx|y = 1) by density matrices ρ and σ, while P(y) describes prior probabilities over the states, usually denoted by π y := P(Y = y).There is no direct counterpart of the object x, since the quantum state is identified with its description in terms of a density matrix; however, one can think of x as a set of values obtained by measuring the state ρ.
The training set consists of n i.i.d.pairs Thus we are randomly given copies of ρ and σ together with their labels, but we do not know what ρ and σ are.After a permutation the joint state of the training set can be concisely written as ρ ⊗n0 ⊗ σ ⊗n1 , where n y is the number of copies for which Y j = y.
The experimenter is allowed to make any physical operations on the training set (such as unitary evolution or measurements) and outputs a binary-valued measurement C 2 with POVM elements M n := ( P n , 1 − P n ).This (random) POVM plays the role of the classical classifier ĥn : given a new copy of the quantum state whose label is unknown, we apply the measurement M n to guess whether the state is ρ or σ.The accuracy is measured in terms of the expected misclassification error: where the expectation is taken over the outcomes P n .
The Bayes classifier M * is nothing but the Helstrom measurement [11] which optimally discriminates between known states ρ, σ with priors π 0 , π 1 .In this case M * = (P * , 1 − P * ) where P * is the projection onto the subspace of positive eigenvalues of the operator π 0 ρ−π 1 σ, i.e.P * = [π 0 ρ − π 1 σ] + .Note that if both eigenvalues are of the same sign, the optimal procedure is to choose the state with higher π i without making any measurement at all.The Helstrom risk can be expressed as: . which is the quantum extension of (2).As before, the performance of an arbitrary classifier M n is measured by the excess risk: which is expected to vanish asymptotically with n.
In Table 1 we summarise the analogous concepts in the classical and the quantum learning set-up.Besides these obvious correspondences we would like to point out some interesting differences.Based on the coin toss example 2.1 one may expect that the classification of two qubit states should exhibit similar exponentially fast rates.In fact as we will show in this paper, the rate is n −1 as in example 2.2 where the data is not discrete but continuous and the regression function is not bounded away from 1/2.A possible explanation is the fact that in the quantum case the 'data' to be labelled is a quantum system and the distribution of the outcome depends on the measurement.A helpful way to think about it is illustrated in Figure 2.2.The unknown label is the input of a black box which outputs the data X with conditional distribution P(X|Y ).In the quantum case the box has an additional input, the measurement choice which appears as a parameter in the conditional distribution and is controlled by the experimenter.The game is to learn from the training set the optimal value of this parameter, for which the identification of the label Y is most facile.This set-up resembles that of active learning [35] where the training data X i are actively chosen rather than collected randomly.

Local minimax formulation of optimality
We unknown parameters of the problem: the two states ρ, σ and the prior π 0 .We denote these parameters collectively by θ which belongs to a parameter space Θ ⊂ R k .When some prior information is available about the model, it can be included by restricting to a sub-model of the general one.As in the classical case we denote by R θ ( M n ), the risk of M n at θ, and we can define the maximum risk as in (4).However, assuming for the moment that that the optimal rate of classification is n −1 , we use a more refined performance measure which is the local version of the maximum risk R max around a fixed parameter θ 0 where > 0 is a small number.Note that in the above definition the usual risk was multiplied by the inverse of its rate n so that we can expect R max to have a non-trivial limit when n → ∞.The reason for choosing the local maximum risk is that it reflects better the difficulty of the problem in different regions of the parameter space while the maximum risk captures the worst possible behavior over the whole parameter space.We can think of the local ball θ − θ 0 ≤ n −1/2+ as the intrinsic parameter space when the training set consists of n samples.Indeed a simple estimator θ 0 on a small proportion ñ = n 1− of the sample locates the true parameter in such a ball with high probability (see Lemma 2.1 in [24]).
Definition 2.1.The local minimax risk at θ 0 is defined as A sequence of classifiers { Mn : n ∈ N} is called locally asymptotic minimax if We identify two general learning strategies.The first one consists in estimating the states ρ, σ and prior π 0 (optimally) to get ρ, σ, π0 and then constructing the classifier (measurement) as: The second strategy aims at estimating the Helstrom projection P * directly from the training set without passing through state estimation.As we will see, it turns out that in general the latter performs better than the former.
In section 3 we review the concept of local asymptotic normality which means that locally, the training set can be efficiently approximated by a simple Gaussian model consisting of displaced thermal equlibrium states and classical Gaussian random variables.In section 4 we show how to reduce the local classification risk for qubits to an expectation of a quadratic form in the local parameters.This will simplify the problem of finding the optimal measurement of the training set, to that of finding the optimal measurement of a Gaussian state for a quadratic loss function [10].

Local asymptotic normality
In a series of papers [23,24,25] Gut ¸ȃ and Kahn and Gut ¸ȃ and Jencova [26] developed a new approach to state estimation based on the extension of the classical statistical concept of local asymptotic normality [27].Using this tool one can cast the problem of (asymptotically) optimal state estimation into a much simpler one of estimating the mean of a Gaussian state with known variance.
Local asymptotic normality provides a convenient description of quantum statistical models involving i.i.d.quantum states which can also be applied to the present learning problem.In this section we will give a brief introduction to this subject in as much as it is necessary for this paper and we refer to [25] for proofs and a more in depth analysis.

Local asymptotic normality in classical statistics
A typical statistical problem is the estimation of some unknown parameter θ from a sample X 1 , . . ., X n ∈ X of independent, identically distributed random variables drawn from a distribution P θ over a measure space (X , Σ).If θ belongs to an open subset of R k for some finite dimension k and if the map θ → P θ is sufficiently smooth, then widely used estimators θn (X 1 , . . ., X n ) such as the maximum likelihood are asymptotically optimal in the sense that they converge to θ at a rate n −1/2 and the error has an asymptotically normal distribution where the right side is the lower bound set by the Cramér-Rao inequality for unbiased estimators.To give a simple example, if X i ∈ {0, 1} is the result of a coin toss with P[X i = 1] = θ and P[X i = 0] = 1 − θ then the sufficient statistic satisfies ( 9) by the Central Limit Theorem (CLT).Naturally, the first inquiries into quantum statistics concentrated on generalising the Cramér-Rao inequality to unbiased measurements, and on finding asymptotically optimal estimators which achieve the quantum version of the Fisher information matrix [11,10,36].However it was found that due to the additional uncertainty introduced by the non-commutative nature of quantum mechanics the situation is essentially different from the classical case.A summary of these finding is (i) the multi-dimensional version of the Cramér-Rao bound is in general not achievable; (ii) the optimal measurement depends on the loss function, i.e. the quadratic form ( θ − θ) t G( θ − θ) and different weight matrices G lead in general to incompatible measurements.
As we will see, these issues can be overcome by adopting a more modern perspective to asymptotic statistics provided by the technique of local asymptotic normality [27,37].Instead of analysing particular estimation problems, the idea is to consider the structure of the statistical model underlying the data and to approximate it by a simpler model for which the statistical problems are easy to solve.In order to obtain a non-trivial limit model it makes sense to rescale the parameters according to their uncertainty, so we assume that θ is localised in a region of size n −1/2 and we can write θ = θ 0 + h/ √ n with θ 0 known and h ∈ R k the local parameter to be estimated.Such an assumption does not restrict the generality of the problem since one can use an adaptive two-steps procedure where a rough estimate θ 0 is obtained in the first step using a small part of the sample, and the rest is used for the accurate estimation of the local parameter h.Local asymptotic normality means that the sequence of (local) statistical models depending 'smoothly' on h, converges to the Gaussian shift model where we observe a single Gaussian variable with mean h and fixed and known variance.The convergence has a precise mathematical definition in terms of the Le Cam distance between two statistical models which quantifies the extent to which each model can be 'simulated' by randomising data from the other.Definition 3.1.A positive linear map is called a stochastic operator (or randomisation) if T (p) 1 = p 1 for every p ∈ L 1 + (X ).For simplicity we consider only dominated models for which all distributions have densities with respect to some fixed reference distribution.In this case a randomisation is the classical analogue of a quantum channel.Definition 3.2.Let P := {P θ : θ ∈ Θ} and Q := {Q θ : θ ∈ Θ} be two dominated statistical models with distributions having probability densities p θ := dP θ /dP and q θ := dQ θ /dQ.The deficiencies δ(P, Q) and δ(Q, P) are defined as where the infimum is taken over all randomisations T, S. The Le Cam distance between P and Q is With this definitions the local asymptotic normality for i.i.d.parametric models can be formulated as Theorem 3.This statement can be extended to slowly increasing local neighbourhoods h ≤ n with precise convergence rate for the Le Cam distance.

Local asymptotic normality in quantum statistics
We will now describe the quantum version of local asymptotic normality for the simplest case of a family of spin states.The general result valid for arbitrary finite dimensional systems can be found in [25].
We are given n spins independent identically prepared in the state where r is the unknown Bloch vector of the state and σ = (σ x , σ y , σ z ) are the Pauli matrices in M (C 2 ).Following the methodology of the previous section, we concentrate on the structure of the statistical model itself rather than optimal state estimation.The latter, and other statistical problems can be solved easily once the convergence to a Gaussian model is established.
By measuring a small proportion n 1− n of the systems we can devise an initial rough estimator ρ 0 := ρ r0 so that with high probability the state is in a ball of size n −1/2+ around ρ 0 [23].We label the states in this ball by the local parameter u and define the local statistical model by By choosing a coordinate system ( a 1 , a 2 , a 3 ) with a 3 along r 0 and writing u = u 1 a 1 + u 2 a 2 + u 3 a 3 we observe that ρ u/ √ n is essentially obtained by perturbing the eigenvalues of ρ 0 by u 3 /2 √ n and rotating it with a 'small' unitary The splitting into 'classical' and 'quantum' parameters u 3 and (u 1 , u 2 ) can be intuitively explained through the 'big Bloch sphere' picture commonly used to describe spin coherent [38] and spin squeezed states [39].Let be the collective spin components along the directions a j .By the Central Limit Theorem, the distributions of L i with respect to ρ ⊗n 0 converge as so that the joint spins state can be pictured as a vector of length nr 0 whose tip has a Gaussian blob of size √ n representing the uncertainty in the collective variables (see Figure 3.2).Furthermore, by a law of large numbers heuristic we estimate the commutators This suggests that L 1 / √ 2r 0 n and L 2 / √ 2r 0 n converge to the canonical coordinates Q and P of a quantum harmonic oscillator in a thermal equilibrium state where {|k : k ≥ 0} represents the Fock basis.Moreover the (rescaled) component ) which is independent of the quantum state.Note that the Gaussian limit state has both quantum and classical components and should be identified with the state Φ ⊗ N on the von Neumann algebra What is the Gaussian state when the spins are in the 'perturbed' state ρ n u ?By applying the same argument we obtain that the variables Q, P, X pick up expectations which (in the first order in n −1/2 ) are proportional to the local parameters (u 1 , u 2 , u 3 ) while the variances remain unchanged.More precisely the oscillator is in a displaced thermal equilibrium state and the classical bit has distribution N u := N (u 3 , 1 − r 2 0 ).Definition 3.4.The quantum Gaussian shift model G is defined by the family of quantumclassical states Having defined the sequence of local models Q n and the Gaussian shift model, we need to define the quantum counterparts of randomisations and convergence of models.The natural analogue of a classical randomisation is a quantum channel, i.e. completely positive, trace preserving map C : T 1 (H) → T 1 (K) where T 1 (H) represents the trace class operators on H.However, as we saw above, a sequence of quantum statistical models may converge to a quantum-classical one.The mathematical framework covering randomisations of both classical and quantum statistical models is that of von Neuman algebras and channels between their preduals.In finite dimensions this simply means that we deal with channels between block diagonal matrix algebras.We can now define the Le Cam distance between two quantum models in the same way as in definition 3.2 with classical randomisation replaced by quantum ones and the • 1 representing the norm on the predual, which is the trace norm in the case of density matrices.
Theorem 3.5.Let Q n be the sequence of statistical models (12) for n i.i.d.local spin states.and let G n be the restriction of the Gaussian shift model (13) to the range of parameters u ≤ n .Then i.e. there exist sequences of channels T n and S n such that To conclude this section we would like to make a few comments on the significance of the above result.The first point is that although it was intuitively illustrated using the Central Limit Theorem, the concept of local asymptotic normality provides a stronger characterisation of the 'Gaussian approximation'.Indeed the convergence in Theorem 3.5 is strong (in L 1 ) rather than weak (in distribution), it is uniform over a range of local parameters rather than at a single point, and has an operational meaning based on quantum channels.Secondly, one can exploit these features to devise asymptotically optimal measurement strategies for state estimation and prove that the Holevo bound [10] is asymptotically attainable [40].Thirdly, the result can be applied to other quantum statistical problems involving i.i.d.qubit states such as cloning, teleportation benchmarks, quantum learning, and can serve as a mathematical framework for analysing quantum state transfer protocols.

Local formulation of the classification problem
In this section we reformulate the problem of quantum state classification in the 'local' set-up.This allows us to replace, on the one hand the excess error probability by a quadratic form in local parameters, and on the other hand the training set consisting of i.i.d.spins by a simpler Gaussian shift model.
Throughout the section we restrict to the case where the priors π 0 , π 1 are known.In Section 5.2 we show that the results for known priors can easily be extended to unknown ones by simply estimating them from the counts of ρ and σ states in the training sample.

The loss function
Recall that the classification problem is to discriminate between two unknown states ρ and σ by learning from a training set of n labelled systems prepared randomly in one of the states with probabilities π 0 and π 1 .For this we measure the training set and produce an outcome which is itself a measurement M n := ( P n , 1 − P n ) on C 2 .The accuracy of the procedure is measured by the excess risk (6): with Since any binary measurement is a mixture of projective POVM's [41], we can assume without loss of generality that P n is a projection and pull back the randomness into the definition of the training set measurement.As explained in section 3.2, the a priori unknown states ρ and σ can be localised with high probability in n −1/2+ neighbourhoods of ρ 0 and σ 0 by sacrificing a small proportion of the training set systems; this means that ρ 0 and σ 0 are known and can be used by the classification procedure.Let r 0 and s 0 be the Bloch vectors of ρ 0 and σ 0 and let us parametrise their neighbourhoods as follows Let P 0 := [π 0 ρ 0 − π 1 σ 0 ] + be the optimal projection corresponding to the pair (ρ 0 , σ 0 ) and note that it can have dimension one, or it can be zero or identity.In the second case, the optimal measurement is trivial, one can guess the state without measuring by checking whether the operator π 0 ρ 0 − π 1 σ 0 is positive or negative.for some c > 0.
Proof.Note that the inequality is satisfied only if π 0 = π 1 and it implies that it a positive or negative operator depending on the sign of π 0 − π 1 .
Since both eigenvalues of π 0 ρ 0 − π 1 σ 0 are non-zero, there exists a constant η > 0 such that implies that A is also a positive or negative operator.In fact, when n is large enough all π 0 ρ u/ which means that with exponentially small probability error the plug-in estimator of √ n ] + will be equal to P * which is zero or identity.
From now on we will work under the assumption that so that P 0 := [π 0 ρ 0 − π 1 σ 0 ] + is a one dimensional projection whose Bloch vector is The Helstrom projection P * for the pair of unknown states (ρ, σ) has Bloch vector where z := π 0 u − π 1 v is a relative parameter and d := d 0 + z √ n .As discussed before, we can take the estimator M n to be a projective measurement M n := ( P n , 1 − P n ), so to minimise the risk (15) we aim at producing an estimator P n which is close to P * .Since the latter is obtained by rotating P 0 with angle of order n −1/2+ , we can assume without loss of generality that P n has a Bloch vector ˆ p n which is a small rotation of p 0 so that with ˆ z n = O(n ) a vector in the plane orthogonal to p 0 .
Expanding (18) and (19) in powers of n −1/2 we get We now plug these expressions back into into (15) taking into account that ˆ z n is perpendicular to d 0 and obtain where is the projection of z onto the plane orthogonal to d 0 .
It is clear now that the rate of convergence of the excess risk ( 15) is n −1 , so it is meaningful to optimise the quantity nR (l) max ( M n ), and the contribution coming from the o(n −1 ) term can be dropped.
Since M n is uniquely determined by ˆ z n by (19), we define the quadratic loss function for the measurement on the training set in terms of local variables and the associated renormalised risk is In conclusion, we need to find the optimal measurement strategy on the training set with respect to the above quadratic form of the local parameters.

The training set
To solve the above problem we employ the machinery of local asymptotic normality.As before, let ρ and σ be states in local neighbourhood of ρ 0 and respectively σ 0 described by (16).We write their local Bloch vectors ( u, v) as where ( a 1 , a 2 , a 3 ) and ( b 1 , b 2 , b 3 ) are two coordinate systems which satisfy the conditions (see Figure 4) (i) a 3 is parallel to r 0 , (ii) b 3 is parallel to s 0 , (iii) a 1 , b 1 are in the plane ( r 0 , s 0 ), (iv) a 2 = b 2 is perpendicular to the plane ( r, s).
With these notations the local statistical model for the training set is and the corresponding Gaussian shift model is where and Φ(q, p, v) is a displaced thermal equilibrium state with means (q, p) and variance v.
The following technical lemma shows that local asymptotic normality can be used to transfer the problem of the optimal classification from a training set consisting of qubits, to a Gaussian one.The arguments are rather standard though tedious, and since the same method has been used for finding the optimal estimation procedure for qubits [24], we refer to that paper for the proof.
Lemma 4.2.Consider the problems of finding asymptotically optimal strategies for the models T n and respectively G n with respect to the loss function (20).Then the local minimax risks of both problems converge to the same constant which is the the minimax risk of the unrestricted Gaussian shift model G (2) .
In conclusion, the measurement of the training set should be aimed at optimally estimating the two parameter vector z ⊥ directly, rather than using a 'plug-in' strategy where the three dimensional local parameters ( u, v) are first (optimally) estimated and then the measurement P n is constructed as in (8).We will come back to this point later on when the two methods will be compared.

Optimal classifier
In this section we formulate our main result characterising the asymptotically optimal measurement on the training set and derive the expression of the optimal excess risk.Summarising the previous section, we transformed the original problem into a parameter estimation one for the Gaussian shift model (22) with parameters ( u, v) ∈ R 3 × R 3 .The parameter to be estimated z ⊥ ∈ R 2 is a linear transformation of ( u, v) .Since the local parameters contain both classical and quantum components it is convenient to express the loss function L(( u, v), ˆ z) in terms of these components.Let ( p 0 , l 0 , k 0 ) be the reference frame with l 0 in the plane ( r 0 , s 0 ).Denote by ϕ 0 , ϕ 1 the angles between ( r 0 , l 0 ) and respectively ( s 0 , l 0 ) (see Figure 4).Then z ⊥ = z l l 0 + z k k 0 with components z l = (π 0 cos ϕ 0 u 3 − π 1 cos ϕ 1 v 3 ) + (π 0 sin ϕ 0 u 1 + π 1 sin ϕ 1 v 1 ) := z where z l was split into a contribution coming from the 'classical' parameters (u 3 , v 3 ), and another one from the 'quantum' parameters.Since the classical and quantum parts of the Gaussian model are independent it is easy to verify that the optimal estimator ˆ z can be written as ˆ z = (ẑ (c) l + ẑ(q) l ) l 0 + ẑk k 0 where ẑ(c) l is the optimal estimator of z l , ẑk ) are optimal estimators of (z  The green equatorial plane is orthogonal to p 0 and contains the estimator ˆ zn and the vector to be estimated z ⊥ (coloured in purple).

Plug-in classifier based on optimal state estimation
Here we compute the asymptotics of the renormalised risk of the plug-in classifier based on optimal state estimation.The problem of optimal state estimation for mixed i.i.d.qubits was solved in the asymptotic local minimax setting in [24].The optimal measurement procedure is adaptive and the first two steps are identical to those of Theorem 5.1 (i) construct rough estimators of ρ and σ by measuring n 1− systems; (ii) transfer the localised spins state by T n as in Theorem 3.5 ; (iii) Perform separate heterodyne measurements on the modes (Q 1 , P 1 ) and (Q 2 , P 2 ) and observe the classical components to obtain the estimators ˜ u n and ˜ v n .
Once the states (local parameters) have been estimated we can classify new states by applying

Figure 2 .
Figure 2. Quantum learning seen as classical learning with data distribution depending on on additional parameter controlled by the experimenter

3 .
The sequence of local models(10) converges in the Le Cam distance to the Gaussian shift model(11) lim n→∞ ∆(P n , G) = 0.

Figure 3 .
Figure 3. Big ball picture of the collective state of identical mixed spins.The total spin is represented as a vector of length nr 0 with a 3D uncertainty blob of size √ n in the x, y directions and n(1 − r 2 0 ) in the z direction.

√ n − π 1
σ v/ √ n with u , v ≤ n have this property for some other constant η.Consider a simple measurement on the training set where the states are measured separately in the three bases of the Pauli matrices and the outcomes averages are used to construct a estimators of the states ρ u/ √ n and σ v/ √ n .Then by basic concentration inequalities we get

l
, z k ) obtained by (jointly) measuring the two quantum Gaussian components.The excess risk can

Figure 4 .
Figure 4. Bloch ball geometry of the learning problem.The unknown states are localised in the two yellow balls centred at r 0 and s 0 and have local vectors u/ √ n and v/ √ n coloured in purple.The three reference systems ( a 1 , a 2 , a 3 ), ( b 1 , b 2 , b 3 ) and ( p 0 , l 0 , k 0 ) are coloured in red.

Table 1 .
Comparison of classical and quantum learning.