Efficient quantum algorithm for all quantum wavelet transforms

Wavelet transforms are widely used in various fields of science and engineering as a mathematical tool with features that reveal information ignored by the Fourier transform. Unlike the Fourier transform, which is unique, a wavelet transform is specified by a sequence of numbers associated with the type of wavelet used and an order parameter specifying the length of the sequence. While the quantum Fourier transform, a quantum analog of the classical Fourier transform, has been pivotal in quantum computing, prior works on quantum wavelet transforms (QWTs) were limited to the second and fourth order of a particular wavelet, the Daubechies wavelet. Here we develop a simple yet efficient quantum algorithm for executing any wavelet transform on a quantum computer. Our approach is to decompose the kernel matrix of a wavelet transform as a linear combination of unitaries (LCU) that are compilable by easy-to-implement modular quantum arithmetic operations and use the LCU technique to construct a probabilistic procedure to implement a QWT with a known success probability. We then use properties of wavelets to make this approach deterministic by a few executions of the amplitude amplification strategy. We extend our approach to a multilevel wavelet transform and a generalized version, the packet wavelet transform, establishing computational complexities in terms of three parameters: the wavelet order M, the dimension N of the transformation matrix, and the transformation level d. We show the cost is logarithmic in N, linear in d and superlinear in M. Moreover, we show the cost is independent of M for practical applications. Our proposed QWTs could be used in quantum computing algorithms in a similar manner to their well-established counterpart, the quantum Fourier transform.


I. INTRODUCTION
As a solid alternative to the Fourier transform, wavelet transforms are a relatively new mathematical tool with diverse utility that has generated much interest in various fields of science and engineering over the past four decades.Although wavelet-like functions have existed for over a century, a prominent example is what is now known as the Haar wavelet.The interest is due to the attractive features of wavelets [1][2][3][4].Such functions are differentiable, up to a particular order, and are local in both the real and dual spaces.They provide an exact representation for polynomials up to a certain order, and a simple yet optimal preconditioner for a large class of differential operators.Crucially, wavelets provide structured and sparse representations for vectors, functions, or operators, enabling data compression and constructing faster algorithms.These appealing features of wavelets and their associated transforms make them advantageous for numerous applications in classical computing over their established counterpart, the Fourier transform.
With the wavelet transforms' diverse utility and extensive use in classical computing, a natural expectation is that a quantum analog of such transforms will find applications in quantum computing, especially for developing faster quantum algorithms and quantum data compression.Wavelets have already been used in quantum physics and computation [5][6][7][8][9][10][11][12].However, prior works on developing a quantum analog for wavelet transforms are limited to a few representative cases [13][14][15][16][17].In contrast, the quantum Fourier transform, a quantum analog of the classical Fourier transform, has been extensively used in quantum computing as a critical subroutine for many quantum algorithms.
Unlike the Fourier transform, a wavelet transform is not unique and is specified by the type of wavelet used and an order parameter.In particular, a wavelet transform is defined by a sequence of numbers, known as the filter coefficients, associated with the type of wavelet used and an even number known as the order of the wavelet that specifies the length of the sequence.Given the sequence, a unitary matrix known as the kernel matrix of the wavelet transform is constructed, the application of which on a vector yields the single-level wavelet transform of the vector.Such a transform partitions the vector into two components: a lowfrequency or average component and high-frequency or difference component (see FIG. 1).To expose the multi-scale structure of the vector, or a function for that matter, the wavelet transform is recursively applied to the low-frequency component, yielding the multi-level wavelet transform of the vector.The wavelet packet transform is a generalization of the multi-level wavelet transform, in which the wavelet transform is recursively applied to both the low-and high-frequency components.We refer to a quantum analog of the (single-) multi-level and packet wavelet transforms as the (single-) multi-level and packet QWTs, respectively.This paper proposes and analyzes a conceptually simple and computationally efficient quantum algorithm for executing singlelevel, multi-level, and packet QWTs associated with any wavelet and any order on a quantum computer.Our approach is based on decomposing a unitary associated with a wavelet transform in terms of a linear combination of a finite number of simple-to-implement unitaries and using the linear combination of unitaries (LCU) technique [18] to implement the original unitary.Specifically, we decompose the kernel matrix of the wavelet transform, associated with a wavelet of order M , as a linear combination of M simple-to-implement unitaries and, by the LCU technique, construct a probabilistic procedure for implementing the single-level QWT.The success probability of this approach is a known constant by properties of the wavelet filters.We use this known success probability to make the implementation deterministic using a single ancilla qubit and a few rounds of amplitude amplification.
Having an implementation for the single-level QWT and recursive formulae describing the multi-level and packet wavelet transforms based on single-level transforms, we construct quantum algorithms for multi-level and packet QWTs.We establish the computational complexity of these transformations in terms of three parameters: the wavelet order M , the dimension of the wavelet-transform matrix N , and the level of the wavelet transform d.Without loss of generality, we assume that our main parameter of interest N is a power of two, as N = 2 n , and report the computational costs with respect to n, the number of qubits that the wavelet transforms act on.
We summarize our main results on computational costs of the described transformations in the following three theorems.We establish these theorems in subsequent sections after providing a detailed description of our algorithms.
Theorem 1 (Single-level QWT with logarithmic gate cost).A single-level QWT on n qubits, associated with a wavelet of order M , can be implemented using ⌈log 2 M ⌉ + 1 ancilla qubits and O(n) + O(M 3/2 ) toffoli and elementary one-and two-qubit gates.
Theorem 3 (Packet QWT).A d-level packet QWT on n qubits can be achieved using ⌈log 2 M ⌉ + 1 ancilla qubits and ) toffoli and elementary one-and two-qubit gates.
We remark that the number of levels for the multi-level or packet QWTs is upper bounded by n, i.e., d ≤ n.Hence, as a corollary of Theorems 2 and 3 , the gate cost for these transformations is at most quadratic in n = log 2 N .We show that the gate costs reported in the above theorems are independent of M for practical applications and only a few number of ancilla qubits suffice to implement the multi-level and packet QWTs.We discuss allowable range for the order parameter M versus the values used in practical applications in the discussion section.
The rest of this paper proceeds as follows.We begin by describing the notation we use throughout the paper.Then we detail our approach for implementing a single-level QWT by simple modular arithmetic operations in §II.We describe the multi-level and packet QWT in §III, followed by detailed complexity analysis for our algorithms in §IV.Finally, we discuss our results and conclude in §V.
Notation: We refer to A ∈ C 2 n ×2 n as n-qubit matrix and denote the n-qubit identity by 1 n .Throughout the paper, we use the symbol M for the wavelet order and m = ⌈log 2 M ⌉.The wavelet order is an even positive number as M = 2K with K a positive integer called the wavelet index; the symbol K is used for M/2.We use zero indexing for iterable mathematical objects such as vectors and matrices.Qubits of an n-qubit register is ordered from right to left, i.e., the rightmost (leftmost) qubit in |q n−1 , . . ., q 1 , q 0 ⟩ representing the state of an n-qubit register that encodes the binary representation of an integer q is the first (last) qubit.The first and last qubits are also referred to as the least-significant bit (LSB) and the most-significant bit (MSB).Qubits in a quantum circuit are ordered from bottom to top: the bottom qubit is the LSB and the top qubit is the MSB.

II. SINGLE-LEVEL QWT
This section describes our algorithm for executing a single-level wavelet transform on a quantum computer.Such a transformation is specified by a kernel matrix.We describe this matrix in §II A and decompose it as a linear combination of a finite number of unitaries.The decomposition enables a prepare-select-unprepare-style procedure for probabilistic implementation of the desired transformation that we cover in §II B. In §II C, we describe how purposefully reducing the success probability yields a perfect amplitude amplification.Finally, in §II D and §II E, we provide a compilation for the select and prepare operations based on simple-to-implement modular arithmetic operations.32) on three-qubit basis states.Three-level wavelet transforms are shown for simplicity.In the first level, the size-N vector ψ is partitioned into two size-N/2 vectors: an average vector a1 = Hψ and a difference vector d1 = Gψ with H and G defined in Eq. ( 2).For the quantum (packet) wavelet transform, the components of ψ are amplitudes of a quantum state.The wavelet transform is recursively applied to the average vector in the multi-level wavelet transform.In contrast, the packet transform applies the wavelet transform to both the average and difference vectors.

A. The wavelet kernel matrix as a linear combination of unitaries
We begin this subsection by briefly describing the kernel matrix associated with a wavelet transform.We refer to [4, Chap.2.1] for a review of wavelet formalism and how this matrix is constructed.The kernel matrix W of a wavelet transform is specified by the wavelet filter coefficients: a sequence of numbers (h 0 , h 1 , . . ., h M ) that depend on the type of wavelet and satisfy where the even number M is the wavelet order.Specifically, the an example of the kernel matrix W for a forth-order wavelet (M = 4) is as follows The unitary matrix U here is a modification of the unitary W that we use for decomposing W as a linear combination of unitaries.
To this end, let us first define the circular downshift and upshift permutation operations as where the matrix size is 2 n × 2 n .Note that these operations are inverse of each other and their action on n-qubit basis state |j⟩ is Upon acting on a vector with 2 n components, S ↓ n /S ↑ n shifts the vector's components one place downward/upward with wraparound.Similarly, when acting on a matrix with 2 n rows from the left side, S ↓ n /S ↑ n shifts the rows of the matrix one place downward/upward with wraparound.
To construct an LCU decomposition for the n-qubit unitary W , the kernel matrix associated with a wavelet of order M = 2K, first we transform it into another unitary U by K − 1 downshift permutations of the rows in the lower half of W . Specifically, we transform W as where G ′ is obtained by K − 1 downshift permutations of the rows of G and its elements are Let us now represent K − 1 upnshift permutations on n qubits by ushift n with the action on n-qubit basis state |j⟩.Then we have i.e., W is obtained by K − 1 upshift permutations of the rows in the lower half of U .We now decompose the unitary U as a linear combination of M unitaries as where Z is the Pauli-Z operator and the unitary is a permutation matrix that is obtained from U as follows: all entries of U with value ±h ℓ are replaced with 1 and all other nonzero entries are replaced with 0. Because W is unitarily equivalent to U by Eq. ( 9), the LCU decomposition in Eq. ( 10) provides a similar LCU decomposition for W .

B. Probabilistic implementation for the single-level QWT
The decomposition in Eq. ( 10) enables a prepare-select-unprepare-style method [18] for probabilistic implementation of U .To this end, let where m = ⌈log 2 M ⌉ is the number of ancilla qubits, and let select be an operation such that with U ℓ defined in Eq. (10).Then for any n-qubit state we have where This equation follows as where the last line follows by projecting the ancilla qubits to |0 m ⟩ state, i.e., Equation ( 14) yields a probabilistic implementation for U .Because U and W are unitarily equivalent, by Eq. ( 9), we also have a probabilistic implementation for W with the same success probability.In particular, let us define a probabilistic QWT as then we have with the |⊥⟩ and the |1⟩-controlled unitary Λ 1 (ushift) defined in Eq. ( 9).
The success amplitude of this approach is known and its value is 1/h.As shown in FIG.2(Left), the success amplitude is greater than 1/4 for a wide range of wavelet order.We now present an alternative approach for a probabilistic implementation of the single-level QWT.The state-preparation of this approach is simpler and could be preferred in practical applications.Instead of preparing the state with square-root coefficients by prep in Eq. ( 12), in this approach we use the operation linprep defined as which prepares the state with linear coefficients.For any n-qubit state |ψ⟩ we then have where |⊥⟩ and select are as those in Eq. ( 14).This equation follows as where the last step is obtained by projecting the ancilla qubits to |0 m ⟩ state, i.e., The success amplitude of this approach is 1/ √ M .As shown in FIG.2(Right), the magnitude of wavelet coefficients h ℓ with high index ℓ are negligibly small.Consequently, the success amplitude becomes effectively independent of M for practical applications.

C. Reduction of success amplitude for perfect amplitude amplification
The success amplitude of the described probabilistic approaches for implementing the single-level QWT is a known constant value.For perfect amplitude amplification, we purposefully reduce the success amplitude using one extra ancilla qubit.This end is achieved by applying a rotation gate on the extra qubit initialized in |0⟩.A few rounds of amplitude amplification then yields the success state with unit probability.
The success amplitude of the probabilistic implementation pqwt in Eq. ( 21) is sin α.This amplitude is known and has a value greater than 1/4 as discussed in §II B. Let θ < α be the angle defined in the equation below and let R(θ) be the rotation gate defined as R(θ) |0⟩ := cos θ |0⟩ + sin θ |1⟩, then by Eq. ( 21) and for any n-qubit state |ψ⟩ we have where |⊥ ′′ ⟩ is an (m + n + 1)-qubit state that satisfies (⟨0 m+1 | ⊗ 1 n ) |⊥ ′′ ⟩ = 0.The success amplitude is now sin(π/14), enabling perfect amplitude amplification.Indeed, by only three rounds of amplitude amplification, W is applied on |ψ⟩ and all m + 1 ancilla qubits end up in the all-zero state.

FIG. 2. (Left)
The success amplitude of the probabilistic implementation in Eq. ( 21) for a single-level QWT for commonly used wavelets.The success amplitude is known and is greater than 1/4 for a wide range of wavelet order; it is greater than 0.31 for the range of wavelet order used in practical applications.For perfect amplitude amplification, the former (latter) value needs three (two) rounds of amplitude amplification.(Right) The magnitude of wavelet coefficient |h ℓ | as a function of the index ℓ for the Debauchees wavelet with order M = 28, 30, 32, 34.The zoomed-in part shows the wavelet coefficients with higher indexes are negligibly small, and the number of small coefficients increases by increasing the wavelet order.The magnitude of wavelet coefficients for other wavelets has a similar pattern.
We remark that the success amplitude is grater than 0.31 for the range of wavelet order used in practical applications; see FIG. 2(Left).In this case, we reduce the success amplitude to sin(π/10) < 0.31 by setting cos θ = sin(π/10)/ sin α and achieve the perfect amplitude amplification by only two rounds of amplitude amplification.
We use the oblivious amplitude amplification because the input state |ψ⟩ is unknown.To this end, let R n = 2|0 n ⟩⟨0 n | − 1 n be the n-qubit reflection operator with respect to the n-qubit zero state |0 n ⟩ and let be the amplitude amplification operator.Then the following holds [19, Lemma 2.2] Therefore, the unit success probability is achieved by three executions of amplitude amplification (t = 3).
The success amplitude of the second approach based on Eq. ( 23) is sin α = 1/ √ M .In this case, we reduce the success amplitude to sin(π/2(2t + 1)) by applying the rotation gate R(θ) on the extra qubit initialized in |0⟩ state, with t and θ defined as Then we achieve the desired state with unit success probability by t rounds of amplitude amplification, i.e., W is applied on |ψ⟩ and all m + 1 ancilla qubits end up in the all-zero state.

D. Implementing select by modular quantum arithmetic
Here we describe our approach for implementing the select operation by simple modular arithmetic operations on a quantum computer.As per Eq. ( 13), select applies U ℓ on the second register |j⟩ based on the value of ℓ encoded in the first register |ℓ⟩.
If ℓ is odd, then U ℓ = P ℓ by Eq. (10).Otherwise, U ℓ is a product of P ℓ and a single Pauli-Z on the first qubit of the second register.That is to say that U ℓ and P ℓ are equivalent up to a |0⟩-controlled-Z operation; control qubit is the qubit representing the least-significant bit (LSB) of ℓ and target qubit is the one representing the most-significant bit (MSB) of j.Implementing select is therefore achieved by an implementation for P ℓ .
The n-qubit permutation P ℓ in Eq. ( 11) transforms the n-qubit basis state |j⟩ as if j and ℓ have same parity, if j and ℓ have opposite parity. ( This transformation can be implemented by modular quantum addition add and subtraction sub defined as and by the quantum perfect shuffle transformation defined as shuffle |q n−1 . . .q 1 q 0 ⟩ := |q 0 q n−1 . . .q 1 ⟩ , which performs the transformation |q⟩ → |q/2⟩ if q is an even number and |q⟩ → |N/2 + (q − 1)/2⟩ if q is odd (see FIG. 1).
For clarity, we remark that here and in the following |ℓ⟩ is an m-qubit basis state with m < n and |j⟩ is an n-qubit basis state.
which flips the parity qubit based on the parity of ℓ and j; parity of a number is 0 if its even and is 1 otherwise.This operation can be implemented using two cnot gates, one controlled on the LSQ of the register encoding ℓ and the other controlled on the LSQ of the register encoding j.The target qubit for each cnot is the parity qubit.
Having computed the parity by par, we then apply sub to the last two registers if the parity qubit is |0⟩ and apply add to these registers if the parity is |1⟩, followed by the shuffle operation in Eq. (32) on the last register.By these operations, the state of the parity qubit, the m-qubit register encoding ℓ, and the n-qubit register encoding j transform as where N = 2 n .We finally erase the parity qubit to achieve an implementation for P ℓ .To this end, we note that the parity qubit is |1⟩ only if the value encoded in the last register is greater than N/2; see Eq. (30).Hence a cnot from the qubit representing the MSB of the value encoded in the system register to the parity qubit would erase this qubit.
The quantum circuit in the dotted-line box in FIG. 3 gives an implementation for the select operation based on the described approach.The sequence of swap gates in this circuit gives a gate-level implementation for shuffle in Eq. (32)

E. A compilation for state-preparation operations
Here we provide procedures for implementing the linprep and prep operations that prepare states with linear and square-root coefficients, respectively.We begin with an implementation for linprep in Eq. ( 22) using the rotation gate, defined as for some known angle θ ℓ , and the increment gate that preforms the map |ℓ⟩ → |ℓ + 1⟩ for |ℓ⟩ an m-qubit basis state.Notice that the increment gate is indeed the downshift permutation S ↓ m defined in Eq. ( 4) and its inverse is the upshift permutation S ↑ m .
The linprep operation prepares a quantum state with amplitudes given by the wavelet filter h = (h 0 , . . ., h M −1 ) ⊤ , a column vector of M real numbers that satisfy Eq. (1).By the procedure given in Ref. [20], the wavelet filter vector h of length M = 2K can be achieved by a sequence of K unitaries , where e ℓ is the ℓth column of the M -by-M identity matrix and the unitary U ℓ is constructed from rotation gates R(θ ℓ ) as illustrated in FIG.4(a).As an example, for M = 6 we have where c ℓ := cos θ ℓ and s ℓ := sin θ ℓ .Having classically precomputed the rotation angles (θ 0 , θ 1 , . . ., θ K−1 ) by the procedure in Ref. [20], we construct a quantum circuit for linprep as follows.Let m = ⌈log 2 M ⌉.For M that is not a power of 2, we pad (2 m − M )/2 zeros from left and right to the wavelet filter vector h to have a vector as (0, . . ., 0, h 0 , . . ., h M −1 , 0 . . ., 0) ⊤ .Then unitaries U ℓ are modified accordingly so that U K−1 • • • U 1 U 0 e 2 m−1 yields the modified wavelet filter vector.A diagrammatic representation of this approach is shown FIG. 3. Equivalent quantum circuits for executing a single-level QWT comprised of high-level operations.Three registers are used: the parity register par (one qubit), the ancilla register anc (m qubits), and system register sys (n qubits).The state of sys register is in a superposition of |j⟩ states for different values of j and the state of anc register, after applying prep with action given in Eq. ( 12), is in a superposition of |ℓ⟩ states for different values of ℓ.The gates inside the dotted-line box implement the select operation in Eq. ( 13) as follows.The |0⟩-controlled Z is applied as per Eq. ( 10).The first two cnots compute the parity of j and ℓ by their LSB.Then controlled on the parity qubit, we apply sub (parity zero) or add (parity one).The sequence of swap gates implement the shuffle operation in Eq. ( 32).The subsequent cnot resets the parity qubit to |0⟩ because the state of par is filliped to |1⟩ only if j and ℓ have opposite parity as per Eq. ( 33); otherwise it stays |0⟩.If par is |1⟩, the MSB of sys register is in the state |1⟩ as the value encoded in sys is greater than N/2 by Eq. ( 30), so the last cnot resets the parity qubit.The cnot has no action if par is |0⟩.This is because the value encoded in the system register is less than N/2 by Eq. ( 30) when j and ℓ have same parity.Consequently, the MSB of sys is |0⟩, making the last cnot inactive.The controlled-ushift operation, with ushift given in Eq. ( 8), maps the implemented unitary by select to the single-level QWT W as in Eq. ( 9).The rotation gate R is used for amplitude amplification A given in Eq. ( 27).The bottom circuit follows from the top circuit.The amplitude amplification A ′ is unitarily equivalent to A. As the initial vector is a particular vector, the rotations represented by white boxes do not affect the vector.(d) Quantum circuit for linprep using rotation gates, the increment gate denoted by +1 and its inverse denoted by −1.The gate +1 (−1) is applied before (after) each rotation gate R ℓ with even ℓ, as in dotted boxes.
in FIG.4(b) for M = 6.For each θ ℓ with even ℓ, first we shift elements of the vector one place to the right, shown in FIG.4(c) by the right arrow, to be able to apply the rotations in parallel on consequent pairs of the vector elements and then shift the vector elements one place to the left.Because the rotations are in parallel, we can decompose the associated unitary as a tensor product of an identity and a rotation gate as 1 m−1 ⊗ R ℓ .Shifting to the right (left) is implemented by the increment gate (inverse of the increment gate) on a quantum computer.The inverse of the increment gate is applied (2 m − M )/2 times at the end to achieve the desired amplitudes as (h 0 , . . ., h M −1 , 0 . . ., 0) ⊤ .The quantum circuit in FIG.4(d) illustrates the case where M = 6.
We now describe an approach for implementing the prep operation in Eq. ( 12).This operation prepares the state with square-root coefficients, i.e., the state |ψ⟩ := ℓ √ p ℓ |ℓ⟩ with p ℓ := |h ℓ | /h.To prepare this state, first we prepare the uniform superposition state (1/ √ M ) ℓ |ℓ⟩ and then apply the uniformly controlled rotation [21] that performs the map The output state after this operation is sin α |0⟩ |ψ⟩+cos α |⊥⟩ with the success amplitude sin α := 1/ √ M .As per the discussion in §II C, the state |ψ⟩ is achieved using one extra qubit and Θ( √ M ) rounds of amplitude amplification.We remark that the same approach can be used to implement unprep in Eq. ( 12).

III. MULTI-LEVEL AND PACKET QWT
We now use our implementation for the single-level QWT as a subroutine and construct quantum algorithms for multi-level and packet QWTs.To this end, let W This decomposition follows from the notion of multi-level wavelet transform: at each level, the transformation is only applied on the low-frequency component (i.e., the top part) of the column vector it acts on.The wavelet packet transform, however, acts on both the low-and high-frequency components, so we have the decomposition for the wavelet packet transform.Equation (39) yields the decomposition where is the |0 s ⟩-controlled unitary operation, for any s ∈ {1, . . ., d − 1}.Similarity, Eq. (40) yields the decomposition for the d-level wavelet packet transform.These decompositions give a simple procedure for implementing a multi-level and packet QWT shown by the quantum circuits in FIG. 5 The multi-level packet QWT is construed from single-level QWTs that can be implemented by the method described in §II.In contrast, the multi-level QWT is constructed from multi-controlled single-level QWTs.As in FIG.5(c), we break down these multi-controlled operations in terms of multi-bit Toffoli gates and controlled single-level QWTs.We discuss an implementation of a multi-bit Toffoli gate in §IV A and a controlled single-level QWT in §IV C, where we analyze the complexities of these operations.

IV. COMPLEXITY ANALYSIS
In this section, we analyze the computational cost of executing single-level, multi-level and packet QWTs, thereby establishing Theorems 1-3.We begin by analyzing the computational cost of key subroutines in our algorithms in §IV A. We then build upon them and provide cost analysis for the single-level QWT in §IV B and for the multi-level and packet QWTs in §IV C.
In our cost analysis and in implementing the key operations, we use ancilla and "borrowed" qubits.In contrast to an ancilla qubit that starts from |0⟩ and returns to |0⟩, a borrowed qubit can start from any state and will return to its original state.The purpose of using borrowed qubits is that they enable simple implementation for complex multi-qubit operations.The availability of a sufficient number of qubits in our algorithm on which the key operations do not act on them allows us to use them as borrowed qubits in implementing such operations.

A. Complexity of key subroutines
Here we analyze the cost of key subroutines used in our algorithm for a single-level QWT: prep, select and ushift, the latter of which adds a classically known constant value to the value encoded in a quantum register.We also analyze the cost of implementing a multi-qubit reflection, an operation used in the amplitude amplification part of our algorithm.
For simplicity of cost analysis, we state the cost of each key subroutine in a lemma and proceed with analyzing the cost in the poof.We begin with a lemma stating the cost of executing a multi-bit Toffoli gate, an operation frequently used in our algorithm and provides an implementation for the multi-qubit reflection about the all-zero state.
Lemma 1.The (m + 1)-bit Toffoli gate with m ≥ 3, defined as The implementation based on m − 2 borrowed qubits follows from Gidney's method [22] for implementing a multi-bit Toffoli gate and the one using one borrowed qubit follows by the method given in Ref. [23,Corollary 7.4 ] and also in Ref. [24].Notice that the gate cost of the two methods scales similarly, but one uses only a single borrowed qubit.However, we sometimes use the method with m − 2 borrowed qubits due to its simplicity in implementing a multi-bit Toffoli and the availability of a sufficient number of qubits in our algorithm that can be borrowed.
We proceed with the cost of select in the following lemma.Lemma 2. select in Eq. (13) can be executed using one ancilla and one borrowed qubit, two Hadamard and O(n) not, cnot and toffoli gates.
Proof.By FIG. 3, select is composed of one controlled-Z gate, three cnot gates, one controlled-sub, one controlled-add and n − 1 swap gates.The controlled-Z gate can be executed using two Hadamard gates and one cnot, and each swap can be executed using three cnots.By the compilation given in Ref. [25], the add itself can be implemented using one ancilla qubit and O(n) not, cnot and toffoli gates.Hence the controlled-add can be compiled using O(n) cnot, toffoli and four-bit Toffoli gates, the latter of which can be implemented using one borrowed qubit and four toffoli gates by Lemma 1.
In the next lemma, we show that the m-qubit reflection R m about the all-zero state |0 m ⟩ can be implemented using m − 2 borrowed qubits.Proof.Using the phase kickback trick and one ancilla qubit, we can implement R m up to an irrelevant global −1 phase factor as where |ψ⟩ is any m-qubit state and Λ m 1 (X) is the (m + 1)-bit Toffoli gate.The lemma then follows by Gidney's method [22] for implementing the (m + 1)-bit Toffoli using O(m) toffoli gates and m − 2 borrowed qubits.
We remark that the (m + 1)-bit Toffoli can be implemented using only one borrowed qubit and O(m) toffoli and elementary one-or two-qubit gate by Lemma 1.However, we use the method with m−2 borrowed qubits due to its simplicity in implementing a multi-bit Toffoli and the availability of a sufficient number of qubits in our algorithm that can be borrowed.
The following lemma states the cost of adding a known classical value to a quantum register.We use a controlled version of this operation in our algorithm, the cost of which is stated in the following corollary.Lemma 4. Adding a classically known m-bit constant to an n-qubit register with m < n can be achieved using m + 1 ancilla qubits and O(m) not, cnot and toffoli gates.
Proof.First, prepare m ancillae in the computational state that encodes the m-bit constant.This preparation can be achieved by applying at most m not gates.Then add this state to the state of the n-qubit register by add operation in Eq. (31).By m < n and the compilation given in Ref. [25], add can be implemented by one ancilla qubit and O(m) not, cnot and toffoli gates.
The computational cost reported in Lemma 4 is indeed the cost of executing ushift in Eq. (8).We use a controlled version of this operation as in the circuit shown in FIG. 3.Because of the toffoli gate in Lemma 4, the controlled-ushift requires implementing a four-bit Toffoli gate, an operation that can be implemented using one borrowed qubit and four toffoli gates by Lemma 1.Therefore, we have the following cost for the controlled ushift as a corollary of Lemma 4 and Lemma 1.

Corollary 4.
The controlled-ushift operation can be executed by one borrowed qubit, m + 1 ancilla qubits, and O(m) cnot and toffoli gates.
The final lemma states the cost of the prep operation.We remark that the cost of this operation is independent of n as prep generates a quantum state on a number of ancilla qubits that depends on the wavelet order M .Lemma 5. linprep in Eq. ( 22) can be executed using O(M log 2 M ) elementary gates and ⌈log 2 M ⌉ borrowed qubits.prep and unprep in Eq. ( 12) can be executed using O(M 3/2 ) elementary gates and one ancilla qubit.
Proof.The linprep can be implemented using O(M ) rotation gates and O(M ) increment and inverse of increment gates by the procedure given in §II E. The increment gate on m = ⌈log 2 M ⌉ qubits can be implemented using m borrowed qubits and O(m) elementary gates [26], so the overall gate cost of linprep is O(M log 2 M ).As per §II E, prep and unprep operations can be implemented by preparing the uniform superposition state on m qubits, applying the uniformly controlled rotation in Eq. ( 38), and O( √ M ) rounds of amplitude amplification.The uniform superposition state is prepared by m Hadamard gates, and the uniformly controlled rotation can be implemented by O(M ) cnot and rotation gates [21].Therefore, the overall gate cost of prep and unprep is O(M 3/2 ).

B. Complexity of single-level QWT
We now build upon the computational cost of the key subroutines analyzed in the previous section to obtain the computational cost of executing a single-level QWT.To this end, we mainly use Eq.(27) and Eq.(28).By these equations, a single-level QWT is achieved by performing three rotation gates and • Two pqwt and one pqwt † , which by Eq. ( 20) needs performing two select and one select † ; two prep and one prep † ; two unprep and one unprep † ; and one controlled-ushift; • Two (m + 1)-qubit reflection R m+1 .
Therefore, by Lemmas 2, 3, 5 and Corollary 4 , the gate cost G(1qwt) for executing a single-level QWT is where m = ⌈log 2 M ⌉ in our application; M is the wavelet order.The number of ancilla qubits used is m + 1: m ancillae are used for the state-preparation step, and one extra ancilla is the parity qubit par, which is also used in the amplitude amplification step.
We remark that the borrowed qubits in executing prep, select, controlled-ushift and reflection operations, in Lemmas 2-5, are borrowed from the portion of quantum registers that these operations do not act on them.For instance, the m − 2 borrowed qubits in Lemma 3 for executing the m-qubit reflection R m could be any m − 2 qubits of the n qubit register that R m does not act on them.For select, the borrowed qubit is needed to implement the four-bit Toffoli gate, see proof of Lemma.2, and this qubit could be any qubit in the circuit that the four-bit Toffoli gate does not act on it.We also remark that the m + 1 ancilla qubits in Corollary 4 needed for controlled-ushift are qubits of the single-qubit par register and m-qubit anc register.This operation is executed after the amplitude amplification, see FIG. 3, when par and anc are in the all-zero state.Putting all together, the overall gate cost for implementing the single-level QWT is O(n) + O(M 3/2 ) and the number of ancilla qubits is ⌈log 2 M ⌉ + 1.This is the computational cost reported in Theorem 1.

C. Complexity of multi-level and packet QWTs
Here we analyze the complexity of implementing a d-level and packet QWTs, thereby establishing Theorem 2 and Theorem 3. By FIG. 5, implementing a multi-level QWT is achieved by implementing multiply-controlled single-level QWTs.Our strategy is to break down each multiply-controlled unitaries in terms of multi-bit Toffoli gates and single-controlled unitary.We then use a compilation for a controlled single-level QWT and an ancilla-friendly compilation for multi-bit Toffoli gates to achieve an efficient yet ancilla-friendly implementation for a multi-level QWT.The packet QWT, however, is achieved by a sequence of single-level QWTs without controlled qubits, as shown in FIG. 5.
Before describing the specifics of our implementation strategy, we first state the complexity of the |1⟩-controlled single-level QWT in the following lemma.We then build upon this complexity to establish the complexity of multi-level QWT.Lemma 6.The controlled single-level QWT on n qubits, associated with a wavelet of order M , can be achieved using ⌈log 2 M ⌉+2 ancilla qubits and O(n) + O(M 3/2 ) elementary gates.
Proof.By the circuit in FIG. 3, a controlled single-level QWT needs preforming double-controlled-sub, -add and -ushift operations, and single-controlled prep and unprep operations.Each cnot is transformed to a toffoli, each swap is transformed to three toffoli gates and R is transformed to controlled-R.A double-controlled operation can be reduced to a single-controlled operation using two toffoli gates and one ancilla qubit.By the discussion in the proof of Lemma 2, the controlled-add (-sub) can be compiled using O(n) cnot, toffoli gates.The other ancilla qubits are the m qubits used for state preparation and the parity qubit.Altogether with Corollary 4 prove the lemma.
We now proceed with the complexity of d-level QWT.Let the integer s, with 1 ≤ s ≤ d, represent the level of a QWT.Then for the level s = r + 1 we need to implement |0 r ⟩-controlled-W n−r , where W n−r is the single-level QWT on n − r qubits.For simplicity of cost analysis, we map all |0⟩-controlled operations in FIG.5(b) to |1⟩-controlled operations; this can be achieved by 2(d − 1) not gates for d-level QWT as in FIG.5(c).For r ≥ 2, we implement |1 r ⟩-controlled-W n−r by a single ancilla qubit, two (r + 1)-bit Toffoli gates and one controlled-W n−r as shown in FIG.5(c).Notice that s = 1 corresponds to a single-level QWT on n qubits and s = 2 corresponds to a controlled single-level QWT on n − 1 qubits.
The gate cost for the controlled single-level QWT on n − r qubits is O(n − r) by Lemma 6, disregarding the cost with respect to M , and the gate cost for the (r + 1)-bit Toffoli gate is O(r) by Lemma 1. Hence the gate cost for each level, including the first and second levels, is O(n).We also have an additional gate cost of O(M 3/2 ) for each level associated with the cost of implementing prep and unprep.We remark that only a single ancilla qubit is used for all levels; the ancilla qubit starts and ends in |0⟩ for each level to be reused in the next level, as illustrated in FIG.5(c).Putting all together, we arrive at the computational cost stated in Theorem 2 for a d-level QWT.
Because the packet QWT does not have multi-controlled operations (see FIG. 5(a)), its gate cost simply follows from the cost of the single-level QWT.The single-level QWT acts on n − r qubits at level s = r + 1 and has the gate cost O(n − r) by Theorem 1.The gate cost for all levels 1 ≤ s ≤ d is therefore O(dn − d(d − 1)/2).We also have an additional gate cost of O(M 3/2 ) for each level associated with the cost of implementing prep and unprep, yielding the overall gate cost stated in Theorem 3. We note that the packet QWT does not need the extra ancilla qubit used in multi-level QWT for implementing the multi-controlled operations.

V. DISCUSSION AND CONCLUSION
Wavelets and their associated transforms have been extensively used in classical computing.The basis functions of wavelet transforms have features that make such transforms advantageous for numerous applications over their established counterpart, the Fourier transform.However, prior works on developing a quantum analog for wavelet transforms were limited to a few representative cases.In this paper, we presented quantum algorithms for executing any wavelet transform and a generalized version, the wavelet packet transform, on a quantum computer; the algorithms work for any wavelet of any order.We have established the computational complexity of our algorithms in terms of three parameters involved in wavelet transforms: the wavelet order M , the level d of wavelet transform, and the number of qubits n = log 2 N the QWT acts on, with N the dimension of the kernel matrix associated with the wavelet transform.
The core idea of our approach is to express the kernel matrix as a linear combination of M unitary operations that are simple to implement on a quantum computer and use the LCU technique to construct a probabilistic procedure for implementing the desired QWT.We then make the implementation deterministic using the known success probability of the probabilistic procedure by only a few (two or three) rounds of amplitude amplification.The gate cost of our algorithm for single-level QWT scales optimally with n, the number of qubits, for the case that the wavelet order M is constant.Indeed, the order parameter used in practical applications is constant, typically in the range of 2 ≤ M ≤ 20 [3,27,28].We also demonstrated that the wavelet filter coefficients become negligibly small for larger values of the wavelet order, making the cost of our algorithms effectively independent of M for practical applications.In contrast, the transformation level d scales linearly with the number of qubits, or log 2 N , for practical applications.Because the value of d is upper-bounded by n, the gate cost of multi-level and packet QWTs scales as O(n 2 ) in the worst case.Even for the worst case, our algorithm improves the gate cost of prior works on the secondand fourth-order Daubechies QWT from O(n 3 ) to O(n 2 ).
We remark that our approach requires a number of ancilla qubits that scales as log 2 M with the wavelet order.The number of ancilla qubits would be a small constant number considering the range of wavelet order or the magnitude of wavelet coefficients in practical applications.A potential area for further exploration is constructing ancilla-free quantum algorithms for all QWTs.Constructing such algorithms would be valuable for early fault-tolerant quantum computers with limited qubits and is plausible because QWTs are unitary transformations.More importantly, a primary area for future research is exploring the opportunities offered by quantum wavelet transforms in quantum algorithms, particularly in simulating quantum systems [6,7,12] and image processing [29][30][31] where wavelet transforms could be advantageous over the established Fourier transform.

FIG. 1 .
FIG.1.Visualization for (Left) multi-level wavelet transform, (Middle) the packet wavelet transform, and (Right) action of quantum perfect shuffle transform in Eq. (32) on three-qubit basis states.Three-level wavelet transforms are shown for simplicity.In the first level, the size-N vector ψ is partitioned into two size-N/2 vectors: an average vector a1 = Hψ and a difference vector d1 = Gψ with H and G defined in Eq. (2).For the quantum (packet) wavelet transform, the components of ψ are amplitudes of a quantum state.The wavelet transform is recursively applied to the average vector in the multi-level wavelet transform.In contrast, the packet transform applies the wavelet transform to both the average and difference vectors.

dFIG. 4 .
FIG. 4. (a)Diagrammatic representation of the procedure producing the wavelet filters for a wavelet of order M from a particular initial vector by a set of M/2 rotations; M = 6 is illustrated here.(b) Zero padding for cases that M is not a power of two.(c) Rotations can be applied in parallel.The right (left) arrow represents shifting elements of the vector one place to the right (left).As the initial vector is a particular vector,

FIG. 5 .
FIG. 5.Quantum circuits for (a) the d-level packet QWT and (b) the d-level QWT using single-level QWTs; d = 4 is illustrated here.(c) An implementation of multi-controlled single-level QWTs needed for the multi-level QWT in (b) using multi-bit Toffoli gates, controlled single-level QWT and one ancilla qubit that starts and ends in the |0⟩ state.

n
denote the d-level wavelet transform of size 2 n × 2 n and let P (d) n denote the d-level wavelet packet transform of the same size.Also let W (1) n = W n for notation simplicity.The d-level wavelet transform can be recursively decomposed as [7, Appendix A] implemented by either of the following computational resources:(I) m − 2 borrowed qubits and O(m) toffoli gates, or (II) one borrowed qubit and O(m) toffoli and elementary one-or two-qubit gate.