Quantum mixing of Markov chains for special distributions

The preparation of the stationary distribution of irreducible, time-reversible Markov chains (MCs) is a fundamental building block in many heuristic approaches to algorithmically hard problems. It has been conjectured that quantum analogs of classical mixing processes may offer a generic quadratic speed-up in realizing such stationary distributions. Such a speed-up would also imply a speed-up of a broad family of heuristic algorithms. However, a true quadratic speed up has thus far only been demonstrated for special classes of MCs. These results often presuppose a regular structure of the underlying graph of the MC, and also a regularity in the transition probabilities. In this work, we demonstrate a true quadratic speed-up for a class of MCs where the restriction is only on the form of the stationary distribution, rather than directly on the MC structure itself. In particular, we show efficient mixing can be achieved when it is known beforehand that the distribution is monotonically decreasing relative to a known order on the state space. Following this, we show that our approach extends to a wider class of distributions, where only a fraction of the shape of the distribution is known to be monotonic. Our approach is built on the Szegedy-type quantization of transition operators.


I. INTRODUCTION
Quantum walks have, amongst other reasons, been long investigated for their capacity to speed up mixing processes -that is, speeding up the task of preparing stationary distributions of a given Markov chain (MC).Efficient mixing is a much coveted property in the context Markov Chain Monte Carlo (MCMC) approaches to many algorithmic methods for hard combinatorial problems and problems arising in statistical physics [1].In the case of time-reversible Markov chains it is well-known that the bound on mixing times is tight relative to the spectral gap δ of the Markov chain -both the lower and the upper bounds on the (approximate) mixing times are proportional to 1/δ, whereas other quantities (e.g. the allowed error of the approximation) appear only as logarithmic factors.Improvements in attaining target distributions then, in the classical case, always stem from additional constructions: e.g. by utilizing sequences of slowly evolving MCs in simulated annealing, or by using alternative MCs which mix faster toward the same distribution (e.g.graph lifting [2]).Such approaches are not proven to help generically, and their utility is established, essentially, on a case-by-case basis.
However, even without resorting to additional structures it is possible that here quantum mechanics may help generically.By employing quantum analogs of transition operators of MCs, speed-ups of mixing times already been proven [3,4] in the cases where the underlying transition graph corresponds to periodic lattices and the torus.In these works, the quantum operator employed was U t = exp(−iD P t), where D P is the discriminant operator of the (time-reversible) transition operator P , which is equal to P itself in the cases when the stationary distribution of P is uniform.The exact definition of D P will be given later.These results contribute towards a working conjecture that quantum transition operators may offer a generic quadratic speed-up in mixing times [3].
Alternative approaches to quantum mixing, based on the Szegedy quantum transition operator, have been already proposed by Richter 1 , based on observations by Childs [3,5].In particular, it has been observed that so-called hitting algorithms, which attempt to find a particular element in the state space, by starting from the stationary distribution of a MC, and using the transition operator, may be run in reverse to realize a mixing algorithm.However, to our knowledge, these approaches were not pursued further due to their inefficiency.In such an approach, the mixing time has a prohibitive dependance on the probabilities occurring in the stationary 0 , such that i π i = 1.Irreducible and aperiodic MCs mix: a sequential application of P onto any initial state in the limit yields the stationary distribution.More precisely, it holds that lim t→∞ P t σ = π, for all initial distributions σ.This property is sometimes referred to as the fundamental theorem of Markov chains.
In this work we will focus on time-reversible, irreducible and aperiodic Markov chains.A Markov chain P with stationary distribution π is said to be time-reversible if it satisfies detailed balance: More generally, for an irreducible, aperiodic Markov chain P, over the state space of size N with stationary distribution π, we define the time-reversed Markov chain P * with P * = M (π)P T M (π) −1 , where M is the diagonal matrix M = diag(π 1 , . . ., π N ). 3 Then, P is time-reversible if P = P * .The discriminant matrix D P is then defined with D P = M 1/2 (π)P T M (π) −1/2 , and it can be shown that it is always symmetric for timereversible Markov chains.Since a time-reversible transition matrix P is similar to a symmetric matrix, its eigenvalues are real, and also by the Perron-Frobenius theorem they are less or equal to 1 in absolute value (value 1 being reserved for the stationary distribution which is also the +1 eigenvector).If λ 2 denotes the second largest eigenvalue (in absolute value) then with δ = 1 − |λ 2 | we will denote the spectral gap of the Markov chain P .
Next, the mixing time τ (ǫ), within error ǫ, for P is defined as: where D(π, σ) denotes the total variational distance on distributions π, σ, so D(π, σ) = 1/2 j |π j − σ j |, or, more generally, the trace distance on density matrices π, σ: The mixing time (for the MC P , with stationary distribution π) has a tight bound, in the time-reversible case, proven by Aldous in [10], but which we present in a more detailed form derived from [11]: for π min = min i π i .
For more details on Markov chains, we refer the reader to [12].Next, we present the basic elements of Szegedy-style approaches to quantum walks.Part of the presentation follows the approach given in [7].
Szegedy-style quantum walks can be viewed as walks over a bipartite graph, realized by duplicating the graph of the original MC, specified by the transition matrix P. The basic building block is a diffusion operator U P 4 which acts on two quantum registers of N states and satisfies It is easy to see that U P establishes a walk step from the first copy of the original graph to the second.The operator U P , and its adjoint are then used to construct the following operator: where  ).In [6,7] it was shown that the operator W (P ) and P are closely related, in particular in the case P is time-reversible, which we clarify next.Given a distribution π, we will denote the coherent encoding of the distribution π with |π = N i=1 √ π i |i .For us it is convenient to define a one-step diffused version of the encoding above, specific to a particular Markov chain: |π ′ = U P |π I ⊗ |0 II , where U P is the Szegedy diffusion operator.It is easy to see that |π and |π ′ are trivially related via the diffusion map (more precisely, the isometry |π → U P |π ⊗ |0 ) and moreover that the computational measurement of the first register of |π ′ also recovers the distribution π.By abuse of notation, we shall refer to both encodings as the coherent encoding of the distribution π, and denote them both |π , where the particular encoding will be clear from context.Next, we clarify the relationship between the classical transition operator P and the Szegedy walk operator W (P ).Let π be the stationary distribution of P so P π = π.Then the coherent encoding of the stationary distribution π of P , given with |π = U P i √ π i |i |0 , is also a +1 eigenstate of W (P ), so W (P ) |π = |π .Moreover, on the subspace A + B, so-called busy subspace, it is the unique +1 eigenstate.On the orthogonal complement of the busy subspace, W (P ) acts as the identity.Moreover, the spectrum of P and W (P ) is intimately related, and in particular the spectral gap δ is quadratically smaller than the phase gap where θ denote the arguments of the eigenvalues, i.e. eigenphases, of W (P ).In other words, we have that This relationship is at the very basis of all speed up which stems from employing Szegedy-style quantum walks.We refer the reader to [6,7] for further details on Szegedy-style quantum walks.
A useful central tool in the theory of Szegedy-style quantum walks is the so-called Approximate Reflection Operator ARO(P ) ≈ 2 |π π| − ½, which approximately reflects over the state |π .The basic idea for the construction is as follows: By applying Kitaev's phase detection algorithm on W (P ) (with precision O(log(∆))), applying a phase flip to all states with phase different from zero, and by undoing the phase detection algorithm, we obtain an arbitrary good approximation of the reflection operator R(P ) = 2 |π π| − ½, for any state within The errors of the approximation can again be efficiently suppressed by iteration (by the same arguments as for the |π measurement) [7], so the cost of the approximate reflection operator is in Õ(1/∆) = Õ(1/ √ δ).Thus, the second gadget in our toolbox is the operator ARO(P ), which approximates a perfect reflection R(P ) on A + B, while incurring a cost of Õ(1/ √ δ) calls to the walk operator W (P ).The ARO(P ), along with the capacity to flip the phase of a chosen subset of the computational basis elements, suffices for the implementation of an amplitude amplification [13] algorithm which allows us to find the chosen elements.To illustrate this, assume we are given the state |π , the (ideal) reflector R(P ), and assume we are interested in finding some set of elements M ⊆ {1, . . ., N }.The subset M is typically specified by an oracular access to a phase flip operator defined withZ M = ½ − 2 i∈S |i i|.The element searching then reduces to iterated applications of Z M R(P ) (which can be understood as a generalized Grover iteration, more precisely amplitude amplification) onto the initial state |π .Let π denote the conditional probability distribution obtained by post-selecting on elements being in M from π, so π = with ǫ = j∈M π j .Let |π = U P i √ πi |i |0 denote the coherent encoding of π.Note that the measurement of the first register of |π outputs an element in M with probability 1.Thus successfully preparing this state implies that we have found a desired element from M .
As it was shown in [7], applications of Z M , and R(P ) maintain the register state in the two-dimensional subspace span({|π , |π }), and moreover, using Õ(1/ √ ǫ) applications of the two reflections will suffice to produce a state |ψ ∈ span{|π , |π }, such that | ψ| π | 2 is a large constant (say above 1/4).Measuring the first register of such a state will result in an element in M with a constant probability, which means that by iterating this process k times ensures an element in M is found with an exponentially increasing probability in k.However, since the state |ψ is also in span{|π , |π }, it is easy to see that the measured state, conditional on being in M, will also be distributed according to π.This observation was used in [14], and also in [8] to produce an element sampled from the truncated stationary distribution π, in time Õ where the δ term stems from the cost of generating the approximate reflection operator ARO(P ), and Õ(1/ √ δ) corresponds to the number of iterations which have to be applied.This is a quadratic improvement relative to using classical mixing, and position checking processes which would result in the same distribution.
However, the same process can be used in reverse to generate the state |π starting from some fixed basis state The resulting state of the reverse process is constantly close to the state |π , which is our target state.This basic idea was already observed in [3]  is the standard fidelity.This follows as the search/unsearch algorithms are in fact amplitude amplification [13] algorithms.
Finally we point out that having a constant-distance approximation of the stationary distribution is effectively all we need.Given an approximation, which is constantly far from |π , an arbitrarily good approximation of distance ǫ can be achieved in time O(1/ √ δ × log(1/ǫ)), by again running phase estimation (iteratively) of W (P ) on the approximate state, and this time measuring the phase-containing register.This |π -projective measurement is described in more detail in [15], and it follows from Theorem 6 in [7] .
We have previously used this idea in [15], to achieve more efficient mixings in the context of slowly evolving sequences of Markov chains, where the initial states were either basis states, or a state encoding the uniform distribution.
In this work, we will take this idea significantly further, by intrinsic properties of monotonically decaying distributions.Let Ω be a finite state space, and let ≤ Ω be a total order on Ω.Then a distribution d is monotonically decaying, relative to the order ≤ Ω , if for its probability mass function In this work we will be representing the state space elements with integers, and the order will be the standard order so a distribution π is monotonically decaying if i ≤ j ⇒ π i ≥ π j , monotonically increasing if i ≤ j ⇒ π i ≤ π j .Finally, a distribution π is strongly unimodal if there exists a k ∈ Ω such that for i ≤ j ≤ k we have π i ≤ π j and for k ≤ i ≤ j we have π i ≥ π j .

III. MAIN RESULT
The results of the previous section already establish that if we have an (perhaps oracular) access to good approximations of the targeted stationary distribution, arbitrarily good mixing by unsearching is efficient.Here, we will show that, for the case of monotonically decaying distributions we can construct good initial states efficiently, in time O(log(N )), independently from the shape of the distribution.Our first result is given with the following theorem: Theorem 1.Let N = 2 n , n ∈ AE be an integer, and let D ≥ N ⊆ Ê N be the convex space of monotonically decaying distributions over {1, . . ., N }.Then there exists a set of distributions S ⊆ D ≥ N , |S| = n, such that for every π ∈ D ≥ N , there exists ν ∈ S satisfying Proof.We begin by constructing an N -sized set S ′ = {σ i } N i=1 of "ladder" distributions which are extreme points of the convex space D ≥ N .They are defined as: Any distribution π ∈ D ≥ N can be represented as a convex combination of distributions in S ′ as follows: as the parentheses above contain the i th Kronecker-delta distribution.The expression above can be, by reshuffling, restated as where, for 1 ≤ i < N q i = i(π i − π i+1 ) and q N = N π N .Since the distribution π is monotonically decaying, we have that q i ≥ 0, and it is also easy to see that i q i = 1.In other words, q = (q i ) i is a distribution as well.Using the representation above, we can express the distance between π and σ k ∈ S ′ as follows: where the last inequality follows from the strong convexity of the trace distance, and in fact is a strict equality in our case.The distance between individual distributions in S ′ is easy to express explicitly: If we define the matrix V as V ij = min{i, k} max{i, k} , we can express the trace distance between π and σ k as: that is 1 minus the standard inner product between the k th column of V , denoted v k , and the probability vector q which uniquely specifies π.Next, we focus on the log-sized subset S ⊆ S ′ given with S = {σ i |i = 2 k , k = 0, . . ., n − 1}, and establish a lower bound on min q max k (v 2 k ) T q by coarse-graining.This will then yield an upper bound on the distance between an arbitrary decaying distribution π and the set S.
From the definition of the vectors v 2 k , k < n it is easy to see that the following holds: We can see this more generally as, per definition, for v l , l ≤ N/2 = 2 n−1 and l ≤ j ≤ 2l we have that Next, we introduce the coarse-graining operator L, mapping real vectors over 2 n elements to real vectors over n + 1 elements: L(q) = (q j ) n+1 j=1 with (16) q1 = q 1 , and for j > 1, qj = It is clear that L also maps distributions to coarse-grained distributions.By Eq. ( 15) we have that where the inequality is taken element-wise.The ancillary vectors w 2 k just capture the positions where the vector v 2 k has entries larger or equal to 1/2, setting those to 1/2 and the rest to zero.But then it follows, for 0 where the inequality is taken element-wise.Next, with ∆ k we denote the Kronecker-delta distribution with unit support only over the k th element, and the inequalities are element-wise, on the coarse-grained n + 1 element space.It holds that To see the first claim, note that (w T 2 0 )q = 1/2(q 1 + q 2 ), whereas ∆ 1 • L(q) = q 1 and ∆ 2 • L(q) = q 2 .For the second inequality, note that the left hand side of the inequality sums up all elements from q which lie between positions 2 k and 2 k+1 (boarder points included), and multiplies it with 1/2.The right hand side of the inequality picks out the (k + 2) nd element of the coarse-grained distribution L(q), which, by definition, sums the entries of q between the same boundaries, but not including the lower boundary.Then we have that max k∈{1,...,n+1} Then, the target min-max expression is also lower bounded by min q max k∈{1,...,n+1} which is easy to evaluate: L(q) is an arbitrary distribution over n + 1 elements, and we are free to optimize the overlap of this distribution with all Kronecker-delta distributions on the same space.The minimum is attained when all the overlaps are the same, so when L(q) is uniform over the space of n + 1 elements, and we have . This also lower bounds our target expression min q max k (v 2 k ) T .q,and proves our claim.In the proof above we have explicitly constructed the log(N ) distributions from the set S. They are the n = log 2 (N ) distributions ν k := σ 2 k , for 0 ≤ k ≤ n − 1 which have uniform support from the first element up to element at the (2 k ) th position.For them to be useful for the quantum mixing algorithm the coherent encodings of these distributions have to have an efficient construction, which is the case.Start by initializing the n−qubit register (sufficient for encoding distributions over the N = 2 n state-space) in the "all-zero" state.Then, the k th distribution is achieved by applying the Hadamard gate to the last k qubits.This realizes the state , which encodes the desired distributions, and the reverse of this process, along with the reflection over the "all zero" state realizes the reflection over ν k efficiently as well.
A few remarks are in order.First, although we have phrased the result for the case when the state space is a power of 2, this is without loss of generality -any decaying distribution over N elements is trivially a decaying distribution over the larger set of ⌈log 2 (N )⌉ elements, where we assign zero probability to the tail of the distribution.The trace distance result remains the same, once the ceiling function is applied to the log term, hence yields the same scaling.
Next, note that the Theorem 1, along with the given simple method for preparing the log-sized set of initial states already yields an efficient algorithm for the preparation of decaying stationary distributions.To see this, assume first we know which distribution ν k out of S minimizes the trace distance, bounded by 1− 1 2(log(N ) + 1) .
If |π is the coherent encoding of the target stationary distribution, by the known inequalities between the trace distance and the fidelity 5 we have that: But then, by the results of section II, we can attain the stationary distribution in time Õ , where the soft-O ( Õ) part suppresses the logarithmically contributing factor stemming from the acceptable error term.However, since we do not know which distribution minimizes the distance, the trivial solution is to sequentially run the algorithm for each one.This yields an overall log(N ) factor, yielding Õ(1/ √ δ) × O(log 2 (N )) as the total complexity.We can do slightly better by encasing the entire procedure of "searching for the correct initial distribution" in a Grover-like search algorithm, more precisely, an amplitude amplification prodecure.
To see this is possible, note that whether or not a particular distribution was the correct initial choice can be heralded -the |π −projective measurement, for instance, reveals whether we succeeded or did not.Moreover, the ARO(P ) operator itself will help realize the Grover oracle, which flips the phase of all states which are not the target distribution.The overall procedure can then be given as follows: First, we initialize the system in the state n−1 j=0 |j I |ψ j II , where |ψ j is the coherent encoding of the j th distribution from the set S. This has a complexity of O(log 2 (N )) in the state-space size, but is independent from δ.Then, in quantum parallel, we run the quantum mixing algorithm on register II (with complexity Õ(1/ √ δ) × O(log(N ))), followed by one application of the ARO(P ), followed by an un-mixing (the running of the mixing algorithm in reverse).This will, approximately (and up to a global phase of −1), introduce a relative phase of −1 at those |j terms, where the searching procedure yielded a success.This constructs the phase-flip operator.
The remainder is the operator which flips over the state n−1 j=0 |j I |ψ j II , which has a cost of O(log 2 (N )).Since at least one distribution, by the correctness of our mixing algorithm, yields the target distribution, this extra layer of amplitude amplification needs to be run in a randomized fashion, (since only the lower bound is known) [13], on the order of log(N ) times, until the correct initial distribution is found.The overall complexity, is then given with O 1/ √ δ × log 3/2 (N ) + O log 5/2 (N ) .The the error factor (multiplying both additive terms) which we have for simplicity omitted, and which guarantees that the distance from the target distribution is within ǫ in the trace distance, is given with O(log(1/ǫ) + log log(N )).The additional log log(N ) term stems from the fact that the ARO(P ) operator is applied O(log(N )) many times which accumulates errors.However, since the effective total error is given by the union bound, it will suffice to rescale the target precision by ǫ := ǫ/ log(N ), which yields the log log term [7].In practice, 1/ √ δ tends to dominate log(N ), thus we have the complexity O 1/ √ δ × log 3/2 (N ) , omitting the logarithmically contributing error terms.One of the features of our approach is that the actual output of the protocol is a particular coherent encoding of the target probability distribution.The classical probability distribution can then be recovered by a measurement of the output state.Having such a quantum output is desirable if our protocol is to be embedded in a larger algorithm where the preparation is just an initial step.Examples where this is assumed include hitting algorithms [7,8], and algorithms which require where from a (renormalized) part of the distribution [8,14].We point out that this property is not a necessary feature of all quantum algorithms for mixing -there are promising approaches which utilize decoherence to speed up mixing [3], which may preclude a coherent output.The property that the output is a coherent encoding of the target distribution is also maintained in extensions of our protocol, which we describe in the next section.

IV. LOWER BOUNDS AND EXTENSIONS
The approach we have described in the previous section trivially extends to monotonically increasing distributions as well -since the trace distance is invariant under the simultaneous permutations of the elements of the inputs, the same proof holds, where we use "ladder distributions" which are reversed in the order of the probabilities.However, the approach can be further extended to strongly unimodal distributions, and beyond, if additional knowledge about the target stationary distribution is assumed.At the end of this section, we will explain how such extensions can be obtained.Before this we will address two natural theoretical questions which arise from our approach.
First, in the previous section we have only provided an upper bound of the distance of the set S and an unknown monotonically decaying distribution.A-priori, it is not inconceivable that, for the restricted case of monotonically decaying distributions, it may be possible that there exists a significantly better choice, with a better bound -perhaps achieving a constant distance, instead of a log(N ) dependance.Here we will show that a significant improvement of our result is not possible.
Second, in our setting we have assumed a specific type of prior knowledge of the target stationary distribution.It is a fair question whether such knowledge, along with the capacity to prepare particular initial distributions, may already offer a significant speed up in the case of classical mixing.If this were the case, our result should not be considered as a true speed up of classical mixing.However, we show that the type of assumption we impose for the quantum algorithm does not help in the classical approach.

A. Lower bounds
The cornerstone of our result relies on the fact that there exists a log(N )-sized set of distributions in D ≥ N , which is no more than log 2 (N ) far (in terms of fidelity) from any distribution π in D ≥ N .It is a fair question whether the log(N ) dependence can be dropped altogether, and be replaced by a constant, in the complexity of the mixing algorithm.A necessary precondition for this, in the case of our approach, is the following claim: Claim 1 There exists a constant 0 ≤ η < 1, and a family of (arbitrary) distributions {µ (N ) } N , one for each state space size N , such that for every N ∈ AE, and for every π ∈ D ≥ N we have that D(µ (N ) , π) ≤ η.
If Claim 1 were to be true, and if the coherent encodings of distributions µ (N ) were efficiently constructable (say in time O(polylog(N ))), then this would constitute a significant improvement over our result.To get a bit of intuition, consider a generalization of Claim 1, where π k are arbitrary distributions.In this case the claim clearly does not hold.Consider any family {µ (N ) } N .Then for N ∈ AE, let µ min = min j (µ (N ) ) j be the smallest probability occurring in ν (N ) , and let j min = argmin j (µ (N ) ) j be the position of the smallest probability.Then we can choose π to be the Kronecker delta distribution at position j min which yields the distance 1 − µ min ≥ 1 − 1/N, which converges to 1 with the state space size N .
Unfortunately, similar simple arguments cannot be straightforwardly utilized in the case when π is in D ≥ N , and the proof is a bit more involved.Our theorem (stated later) is the negation of Claim 1, and we prove it by contradiction.
First, we show that Claim 1 implies a seemingly stronger claim denoted Claim 2: Claim 2 There exists a constant 0 ≤ η < 1, and a family of distributions {µ (N ) } i , µ (N ) ∈ D ≥ N defined for any state space size N , such that for every N ∈ AE, and for every π ∈ D ≥ N we have that D(µ (N ) , π) ≤ η.
The difference between Claim 1 and Claim 2 is that in Claim 2, the family µ (N ) is in the sets of monotonically decaying distributions.We prove this generically by sorting.Let Sort be a map from the set of distributions to the set of decaying distributions, which, for any distribution π outputs a distribution π ′ where the entries of π ′ are sorted in a decreasing fashion.While the sorted distributions may not be unique, we may, assume Sort picks a unique sorting which preserves the original ordering in the case multiple positions (states) have the same probability.Then the following lemma holds: Lemma 1.Let µ be an arbitrary N −state distribution and let π ∈ D ≥ N be a monotonically decaying distribution.Then D(π, µ) ≥ D(π, Sort(µ)).
Proof.We will prove that switching any two (adjacent) elements i, j in the distribution µ, which do not obey µ i ≥ µ j can only decrease the distance from π.This suffices for our lemma, as iterating such switching upon the distribution (in an, for us unimportant order) µ constitutes the well-known Bubble sort algorithm for sorting, which converges to a sorted distribution (list) in at most N 2 steps.Since each step only decreases the distance, by transitivity we have our claim.
We prove this by direct verification.Let i, j = i + 1 be such that µ i ≤ µ j , and let µ ′ be the distribution obtained from µ by switching labels i and j.Then we have: where x = µ i − π i , ǫ = µ j − µ i ≥ 0 and δ = π i − π j ≥ 0. We can without the loss of generality assume ǫ ≤ δ.
We have five possibilities: a) In the case a) the difference is zero, so the claim holds.In the case b) we have In the case c) we have: In the case d) we get Finally, in e) we have −x − x − ǫ − δ + x + ǫ + x + δ = 0.This proves the lemma.Thus, for our main goal, it will suffice to show that Claim 2 does not hold.Next, note that Claim 2 also implies the claim that the distributions ν are at most constantly far from all "ladder" distributions {σ i } i defined over the same state space.Moreover, by the extremity of these distributions, the inverse holds as well -ν being at most η far from all the {σ i } i distributions also implies that µ is at most η far from any monotonically decaying distribution.However, we shall not be using the inverse claim.
In the remainder we will assume µ is a decaying distribution over N states, and also that {σ i } i are the "ladder" distributions we have described in Eq. (9).
Recall that µ (since it is in D ≥ N ) can be represented in the convex-linear basis of {σ i } i (c.f.Eq (11)): where for 1 ≤ i < N q i = i(π i − π i+1 ) and q N = N π N .As we have seen, as µ is decaying, we have that q = (q i ) i is a probability distribution as well, uniquely specified for every µ.Retracing the steps of Theorem 1, we have also defined vectors {v k } N k=1 with (v k ) i = min{i, k} max{i, k} using which we can express the trace distance between µ and σ k as: Recall, the matrix V collects the vectors v k as rows (and columns, since it is symmetric).Then the k th row of the vector V q equals 1 − D(µ, σ k ).The claim we seek to show is that max q min k v T k q, that is the maximal overlap of the v k vectors, optimized over all probability distributions q depends on N and decays to zero in the limit of infinite state space.To establish this, we will use the specific properties of the matrix V and a few simple results from the theory of convex spaces.First, to remind the reader, the matrix V is defined as , thus it is a symmetric matrix.Moreover, as we will show later, it is also also invertible.In the language of convex spaces this simply means that a point in a convex set is uniquely specified by the distance of that point from the extreme points of the convex set.Next, intuitively, the minimum of distances from the points v k (understood as points in an N −dimensional Euclidian space) is attained by a point in the convex hull of those points, which is equally far from all of them.However, the validity of this claim can depend on the choice of the distance measure.We shall thus prove this claim for our case.For the remainder of the proof, we will have to evaluate the distance attained, that is, the value of max q min k V.q.For this, we will use the explicit inverse of the matrix V , which we give as a separate lemma for clarity.
Lemma 2. Let N ∈ AE, and let V be an N × N matrix defined with V ij = min{i, j}/ max{i, j}.Then V −1 exists, and it is a symmetric, diagonally strictly dominant, tri-diagonal matrix specified with: , and Proof.To see that V multiplied with V −1 , as defined above, is the identity can be easily checked by examining the diagonal and non-diagonal elements of V V −1 separately, so we omit this here.What remains to be seen is that V −1 is diagonally dominant.For the column (also row, since it is symmetric) i = 1 it is obvious.For 1 < i < N we have that: so the the inequality is strict.Finally, for i = N , we have that which is also a strict inequality.Thus the lemma holds.We now claim that since V −1 is strictly diagonally dominant the optimal value max q min k v T k q is attained at a q for which V q = α(1, . . ., 1), α ∈ Ê + -in this case, the distribution encoded by q is equally far from all ladder distributions.We prove this claim by contradiction.The negation of this claim implies that there exists a probability distribution q ′ such that V q ′ ≥ V q (that is, the distribution encoded by q ′ is closer to all ladder distributions), where the inequality is taken element -wise.Let V (q ′ − q) = y with y = (y 1 , . . ., y N ) T .Note, y is by assumption a non-negative vector, with at least one positive entry.Then (q − q ′ ) = V −1 y.Note that since q and q ′ are distributions, the sum of all elements of the vector (q − q ′ ) is 0.Then, by multiplying both sides of the last equality with the row (1, . . ., 1) from the left we get 0 = (1, . . ., 1)V −1 y.
Thus q such that V q = α(1, . . ., 1) T maximizes the minimal overlap (inner product between rows of V and q).Finally, we evaluate α.By applying V −1 from the left from the last equality, and multiply both sides with the row (1, . . ., 1) from the left, since q is a probability vector, we get: In other words, α is the inverse of the sum of all the elements in V −1 .We then have that: The expression above can be much simplified.We have that: where H x is the (generalized) Harmonic number function, for x > 0 defined with Thus we have that Harmonic numbers and the natural logarithm have a well-understood relationship, and for our purposes it will suffice that H N −1/2 ≥ log(N ).It follows that The last inequality immediately implies our claim as, putting all the observations together, we then have that: where D N denotes the set of all N −state distributions.That is, the optimal family of distributions, one for each state space size, will be at least 1 − 2 log(N ) far from the worst case distance from a distribution in D ≥ N .This shows not only that the Claim 1 does not hold, but that our algorithm is not far from optimal, and that if one had access to an oracle (or an efficient construction), of the coherent encodings optimal distributions, the overall algorithm would have the efficiency depending on log 1/2 (N ).In contrast, we constructively achieve log 3/2 (N ).While the first scaling is clearly better, the difference is moderate, given that both are polylog.For completeness, we phrase the results above as a theorem.
Theorem 2. For any family of distributions {µ (N ) } N ∈AE , each defined over the state space of size 1 < N ∈ AE, and for every N, there exists a monotonically decaying distribution π ∈ D ≥ N , such that We end this subsection with a remark on the V matrices, as a curiosity.We have in the proof above, shown that the inverse of a V matrix is, technically speaking, an inverse of a strictly diagonally dominant Stieltjes matrix, hence also an M-matrix.The problem of characterizing non-negative matrices, whose inverse is an M-matrix has attained significant interest in the field of matrix analysis [16].One result presented in that research is a characterization of so-called generalized ultrametric matrices, the subset of which is shown to, by inversion, yield diagonally dominant Stieltjes matrix, as is V −1 in our case.It is curious to note that, however, the V matrices we use do not fall into the class of the characterized generalized ultrametric matrices, thus may have independent interest in the field of matrix analysis.

B. Classical mixing for decaying distributions
The results of Section II show that, in the case of quantum mixing through un-searching, the overall mixing time strongly depends on the initial state -the mixing time is proportional to the inverse of the square-root of the fidelity between the initial state and the targeted stationary distribution state.A similar statement holds when we consider the trace distances between the initial distribution (encoded by the quantum state) and the stationary distribution.This is clear as the trace distance (of classical distributions), and fidelity (of their coherent encodings) are tightly connected.Moreover, this dependence of the mixing time on the distance between the initial and target state is robust -regardless of what particular initial state we pick, the mixing time just depends on the distance.
In the classical case, intuitively it is also clear that starting from a distribution, which is close to the target distribution, must speed up mixing.As an extreme example, if we wish to achieve mixing within ǫ, and we are given an initial state which is already within ǫ from the target, the mixing time (in the sense of the number of required applications of the walk operator) is zero.The question is, is the improvement as robust in the classical case, as it is in the quantum?Here we show that in the classical case, it is not, and being close to the target distribution helps just moderately.To show this we first clarify a fact about classical mixing times, the definition of which we repeat for the benefit of the reader.The mixing time τ (ǫ), within error ǫ, for Markov chain P, with stationary distribution π is defined as: where D(π, σ) denotes the total variational distance on distributions π, σ, so D(π, σ) = 1/2 j |π j − σ j |.The mixing time asks that the state P t σ be ǫ close to π for all initial states σ, that is, it looks for the worst case initial σ.By convexity, and the triangle inequality, the worst case initial state σ will be a Kronecker-delta distribution with total mass at some state space element.We can introduce an analogous mixing time quantity, relative mixing, which extends standard mixing time in that the initial state is guaranteed to be within η from the target state: Now, suppose we are given a Markov chain P , and we wish to evaluate a bound on τ η (ǫ) for this Markov chain.In order to capture robust properties, the definition above asks for the worst case as well (as the distance requirement it must hold for all σ s.t.D(σ, π) ≤ η), so to bound the relative mixing times, we can construct the following distribution ρ : where we choose σ worse to be the worst-case initial state for MC P if we wish to mix it within ǫ/η.Then we have that: so ρ is within η distance from π, as required.Now, we are looking for an integer t ≥ 0, such that: Then we have: hence, we require a t such that However, since σ worse was chosen to be the worst case state for mixing within ǫ η , we have that Thus the lower bound of the relative mixing time is just the standard mixing time, where ǫ is replaced with ǫ/η.Thus, we get the following lower bounds for relative mixing: It is now clear that the relative mixing has the same dependence on 1/δ, hence the improvement is slight.To make a fair comparison to the quantum mixing case we have shown, we can set η = 1/2 (for our algorithm, the trace distance is always larger than this), and see that the lower bound for classical mixing is lower bounded by O(1/δ log(1/(4ǫ))), which is essentially the same scaling, as for standard mixing time.
For completeness, we point out that there have been prior works asking a related question, which establish that the mixing time is an essentially robust quantity, independent from the setting (that is, in what context) the mixing is applied.We refer the reader to [17] for a collection of such results.

C. Extensions
While the main theorem we have used in our approach assumes monotonic (decaying or increasing) distributions, this can be easily further extended.Assume, for instance, we know that π, the target distribution over N = 2 n elements only decays to some element k, its behavior is unknown from that point on, and the total support up to element k is p = k i=1 π i .Consider for the moment the truncated distribution π, obtained by setting all probabilities after k to zero and re-normalizing (by multiplying with 1/p.)By Theorem 1, we know that there exist a (efficiently constructable) log-sized set S of "ladder" distributions over N, such that for at least one of them, σ, it holds that D(σ, π) ≤ 1 − (n + 1) −1 /2.But then we have, by the triangle inequality, homogeneity of the trace norm, and the fact that the maximal distance is unity, that: This implies that the total complexity of the mixing algorithm we have described, applied to this setting will be increased, multiplicatively, by p −1 .Note that the same reasoning will hold, in the mirrored case, where we know that π is increasing from some element k, with corresponding support of p.This simple observation already allows us to efficiently prepare target distributions whose probability mass functions are convex (decaying to some element, and increasing from that element).To see this note that either the mass of the distribution prior its minimum, or after, must be above or equal to 1/2.Thus, we can simply run the algorithm assuming both options which then yields just a constant multiplicative overhead of 4 (two runs ×(1/2) −1 ).
Another extension of this observation is the case where relative to a known order, the distribution is, for a known contiguous subset of the state space elements (with total weight p), decaying or increasing.In this case as well, the mixing time only suffers a 1/p pre-factor.In particular, this implies that distributions which are strictly unimodal (meaning increasing to some element, and decreasing from that element) can also be efficiently prepared, provided the mode k is known.To see this holds, note that in the strictly unimodal case as described, the total mass of the probability distribution either up to the k th element, or after, must be above 1/2.Then, the ladder distributions would only be constructed to the k th element, which again can be done with polylog(N ) overhead.Unfortunately, for this case, the knowledge of the position of the mode k is neccessary -strictly unimodal distribution also contain the Kronecker-delta distribution.For our approach the capacity to efficiently mix to (a distribution arbitrarily close to) an arbitrary Kronecker-delta distribution would immediately imply efficient mixing for all distributions.This is beyond what we can claim.

V. DISCUSSION
In this work, we have addressed the problem of attaining stationary distributions of Markov chains using a quantum approach.We have built on observations, originally made by Richter and Childs, that quantum hitting algorithms run in reverse can serve as mixing algorithms.These observations initially received little attention due to their apparent inefficiency -an a-priori square-root scaling with the system size N .We have shown, in contrast, that in the cases when it is beforehand known that the target distribution is decaying, relative to a known order on the state space, the dependency on the system size is only log 3/2 (N ).We have also shown the essential optimality of this bound for our approach, in particular, that an explicit dependence on the system size is unavoidable and logarithmic.Following this we have shown how our approach easily extends to a much wider class of distributions, including concave distributions, but also strictly unimodal distributions, where the position of the mode is known.Unfortunately, it is the case that such assumptions are often not satisfied in many physics-inspired applications which require mixing of Markov chains.For instance, in statistical physics, the mode is often the quantity explicitly sought, when the distribution is known to be unimodal.In other uses, e. g. the computation of a permanent of the matrix, the underlying state space is not simply characterized at all, and knowing the order would already imply the solution to the problem.
Nonetheless, other applications involving Markov chain mixing, such as artificial intelligence [14] and applications relying on bayesian inference (which often rely on MCMC) may have more instances where our approach may yield a genuine quantum speed-up.Moreover, the quantum algorithm we have provided realizes a coherent encoding of the stationary distribution, which can be used as a fully quantum subroutine, for instance in the preparation of initial states in e.g.hitting algorithms [7,8].
Another possibility includes settings where the shape of the target distribution is known (say Gaussian), and the mode is also, however, we are interested in learning higher moments of the distribution by mixing and sampling.For instance, all correlations of Gaussian states in quantum optics are captured by the second moments, whereas the mode and mean coincide, and reveal nothing about correlations.We leave the applications of our results for future work.From a more theoretical point of view, the results of this work highlight another difference between classical and quantum mixing, in particular, the approaches which rely on reversing hitting algorithms.In the classical mixing case, the choice of the initial state does not substantially contribute to overall mixing efficiency.In contrast, in the quantum case, improvements in the choice of the initial state can, as we have shown, radically alter the overall performance.
While the conjecture that quantum approaches to mixing can yield a generic quadratic speed-up in all cases remains open, our approach extends the class of Markov chains for which such a speed up is possible.Notably, unlike in other studied cases where speed up has been shown, our assumptions lay only on the structure of the stationary distribution of the Markov chain, rather directly on the structure (underlying digraph) of the Markov chain itself.
reflects over the state |0 .The operator ref (A) is a reflector the subspace A = span({U P |i |0 } i ).A second reflector is established by defining a second diffusion operator, realizing a walk step from the second copy of the graph back to the first: V P = SW AP I,II U P SW AP I,II .From here, we proceed analogously as in the case for the set A, to generate the ref (B) operator, reflecting over B = span({V P |0 |j } j ).The Szegedy walk operator is then defined as W (P ) = ref (B)ref (A