A Grover-search based quantum learning scheme for classification

Yuxuan Du; Min-Hsiu Hsieh; Tongliang Liu; Dacheng Tao

doi:10.1088/1367-2630/abdefa

1. Introduction

The field of machine learning has achieved remarkable success in computer vision, natural language processing, and data mining [1]. Recently, an increasing interest from the physics community to use machine learning methods to solve complicated physics problems, e.g. classifying phases of matter and simulating quantum systems [2–4], has emerged. Besides the revolutionary influence of machine learning to the physics world, another uprising field that tightly binds machine learning with physics is quantum machine learning whose goal is to solve specific tasks beyond the reach of classical computers [5].

To better understand how quantum computing facilitates the machine learning tasks, devising quantum algorithms that have the ability to solve fundamental machine learning problems with quantum advantages is desirable [5]. For example, the proposed quantum linear systems algorithm (a.k.a., HHL algorithm) enables the linear equations to be solved with the exponential speedup over its classical counterparts [6]. By employing HHL algorithm as the subroutine, many quantum machine learning algorithms with exponential quantum speedup have been proposed, e.g. the quantum principal component analysis [7], quantum singular value decomposition [8], quantum non-negative matrix factorization [9], and the quantum regression [10]. However, those proposed quantum algorithms that possess fabulous quantum advantages can only be executed on a fault-tolerant quantum computer by using the quantum random access memory [6], which is still a rather distant dream.

When approaching the noisy intermediate-scale quantum (NISQ) era, it is intrigued to explore whether there exists any quantum algorithm that can not only solve fundamental learning problems with promised quantum advantages but can also be efficiently implemented on near-term quantum devices [11]. To achieve this goal, one of the most likely solutions is the quantum neural network (QNN), which is also called as variational quantum algorithms [12–14]. Concretely, QNN is composed of a variational quantum circuit to prepare quantum states and a classical controller to perform optimization tasks [13, 15]. Partial evidence to support this claim is the theoretical result that the probability distribution generated by the variational quantum circuit used in QNN can not be efficiently simulated by classical computers [16–18]. Driven by the strong expressive power of quantum circuits and the similar work philosophy between QNN and the classical deep neural network (DNN), its natural to exploit whether QNN can be realized on near-term quantum computers to accomplish certain machine learning tasks with better performance over classical learning algorithms.

A central application of QNN, analogous to DNN, is tackling classification tasks [1]. Many real-world problems can be categorized into the classifying scenario, e.g. the recognization of hand-written digits, the characterization of different creatures, and the discrimination of quantum states. For binary classification, given a dataset

$\begin{equation}\hat{\mathcal{D}}={\left\{\left({\boldsymbol{x}}_{i},{y}_{i}\right)\right\}}_{i=0}^{N-1}\in \left({\mathbb{R}}^{N{\times}M},{\left\{0,1\right\}}^{N}\right),\end{equation} \tag{ 1 }$

with N examples and M features in each example, QNN aims to learn a decision rule f_θ(⋅) that correctly predicts the label of the given dataset $\hat{\mathcal{D}}$ , i.e.

$\begin{equation}\underset{\boldsymbol{\theta }}{\mathrm{min}}\sum\limits _{i=0}^{N-1}{\mathbb{1}}_{{y}_{i}\ne {f}_{\boldsymbol{\theta }}\left({\boldsymbol{x}}_{i}\right)},\end{equation} \tag{ 2 }$

where θ refers to the trainable parameters and ${\mathbb{1}}_{z}$ is the indicator function that takes the value 1 if the condition z is satisfied and zero otherwise. Recently, QNNs with varied quantum circuit architectures and optimization methods have been proposed to accomplish the aforementioned classification tasks. In particular, the references [19–21] have devised the amplitude encoding based QNN to classify the Iris dataset and the hand-written digits image dataset; the references [22–24] have developed the kernel-based QNN to accomplish the synthetic datasets; and the references [25] have proposed the convolution based QNN to tackle quantum state discrimination tasks. When no confusion can arise, we use the quantum classifier in the rest of the study to specify QNNs that are used to accomplish classification tasks defined in equation (2).

Despite the promising heuristic results mentioned above, very few studies have theoretically explored the power of quantum classifiers. A noticeable theoretical result about quantum classifiers is the trade-off between the computational cost (i.e. the number of measurements) and the training performance indicated by [13]. Denote $\mathcal{L}\left({\boldsymbol{\theta }}^{\left(t\right)},\boldsymbol{z}\right)$ as the loss function employed in quantum classifiers, where θ ^(t) refers to the trainable parameters at the tth iteration and $\boldsymbol{z}={\left\{{\boldsymbol{z}}_{j}\right\}}_{j=1}^{N}$ is the given dataset with in total N samples. As shown in figure 1, when the batch gradient descent method is employed to optimize the loss function $\mathcal{L}$ , the updating rule of the trainable parameters follows

$\begin{equation}{\boldsymbol{\theta }}^{\left(t+1\right)}={\boldsymbol{\theta }}^{\left(t\right)}-\frac{\eta }{B}\sum\limits _{i=1}^{B}\nabla \mathcal{L}\left({\boldsymbol{\theta }}^{\left(t\right)},{\mathcal{B}}_{i}\right),\end{equation} \tag{ 3 }$

where η is the learning rate, ${\mathcal{B}}_{i}$ refers to the ith batch with ${\cup }_{i=1}^{B}{\mathcal{B}}_{i}=\boldsymbol{z}$ and ${\mathcal{B}}_{i}\cap {\mathcal{B}}_{j}=\varnothing$ , and B denotes the number of batches. Define

$\begin{equation}{R}_{1}=\mathbb{E}\left[{\Vert}{\nabla }_{\boldsymbol{\theta }}\mathcal{L}\left({\boldsymbol{\theta }}^{\left(t\right)}\right){{\Vert}}^{2}\right].\end{equation} \tag{ 4 }$

as the utility measure that evaluates the distance between the optimized result and the stationary point in the optimization landscape. The following theorem summarizes the utility bound R₁ of quantum classifiers.

**Figure 1.** The protocol of the batch gradient descent method. The left panel corresponds to the setting as B = N, where the N training examples ${\left\{{\boldsymbol{x}}_{i}\right\}}_{i=1}^{N}$ are iteratively fed into the variational quantum circuits to acquire the gradients that estimate ${\left\{\nabla \mathcal{L}\left({\boldsymbol{\theta }}^{\left(t\right)},{\boldsymbol{x}}_{i}\right)\right\}}_{i=1}^{N}$ . The right panel exhibits the implementation of the quantum classifier when B = 1. Specifically, a superposition state $\left\vert \phi \left(\boldsymbol{x}\right)\right\rangle =\frac{1}{\sqrt{N}}{\sum }_{i=1}^{N}{\left\vert h\left({\boldsymbol{x}}_{i}\right)\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ is prepared, where h(⋅) corresponds to the employed encoding method and the subscripts 'I' and 'F' refer to the index and feature registers, respectively. Given access to $\left\vert \phi \left(\boldsymbol{x}\right)\right\rangle$ , the trainable quantum circuit U_L( θ ) is employed to interact with its feature register subscripted with F.
Download figure:
Standard image High-resolution image

**Figure 1.** The protocol of the batch gradient descent method. The left panel corresponds to the setting as B = N, where the N training examples ${\left\{{\boldsymbol{x}}_{i}\right\}}_{i=1}^{N}$ are iteratively fed into the variational quantum circuits to acquire the gradients that estimate ${\left\{\nabla \mathcal{L}\left({\boldsymbol{\theta }}^{\left(t\right)},{\boldsymbol{x}}_{i}\right)\right\}}_{i=1}^{N}$ . The right panel exhibits the implementation of the quantum classifier when B = 1. Specifically, a superposition state $\left\vert \phi \left(\boldsymbol{x}\right)\right\rangle =\frac{1}{\sqrt{N}}{\sum }_{i=1}^{N}{\left\vert h\left({\boldsymbol{x}}_{i}\right)\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ is prepared, where h(⋅) corresponds to the employed encoding method and the subscripts 'I' and 'F' refer to the index and feature registers, respectively. Given access to $\left\vert \phi \left(\boldsymbol{x}\right)\right\rangle$ , the trainable quantum circuit U_L( θ ) is employed to interact with its feature register subscripted with F.
Download figure:
Standard image High-resolution image

Theorem 1 (Modified from theorem 1 of [13]). Quantum classifiers under the depolarization noise setting output ${\boldsymbol{\theta }}^{\left(T\right)}\in {\mathbb{R}}^{d}$ after T iterations with the utility bound

$\begin{equation*}{R}_{1}{\leqslant}\tilde {O}\left(\mathrm{p}\mathrm{o}\mathrm{l}\mathrm{y}\left(\frac{d}{T{\left(1-p\right)}^{{L}_{\mathrm{Q}}}},\frac{d}{BM{\left(1-p\right)}^{{L}_{\mathrm{Q}}}},\frac{d}{{\left(1-p\right)}^{{L}_{\mathrm{Q}}}}\right)\right),\end{equation*}$

where M is the number of measurements to estimate the quantum expectation value, L_Q is the circuit depth of variational quantum circuits, p is the rate of the depolarization noise, and B is the number of batches.

The result of theorem 1 indicates that a larger number of batches B ensures a better utility bound R₁, while the price to pay is increasing the total number of measurements. For example, when B = N, we have ${\mathcal{B}}_{i}={\boldsymbol{z}}_{i}$ for ∀i ∈ [N] and each sample z _j is sequentially fed into variational quantum circuits to acquire $\nabla \bar{\mathcal{L}}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)$ that estimates $\nabla \mathcal{L}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)$ . Once the set ${\left\{\nabla \mathcal{L}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)\right\}}_{i=1}^{N}$ is collected, the gradients $\nabla \mathcal{L}\left(\boldsymbol{\theta },\boldsymbol{z}\right)$ can be estimated by $\frac{1}{N}{\sum }_{i=1}^{N}\nabla \bar{\mathcal{L}}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)$ . Suppose that the required number of measurements to estimate the derivative of the jth parameter θ _j, i.e. ${\nabla }_{j}\mathcal{L}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)=\frac{\partial \mathcal{L}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)}{\partial {\boldsymbol{\theta }}_{j}}$ , is M, then the total number of measurements to acquire $\frac{1}{N}{\sum }_{i=1}^{N}{\nabla }_{j}\bar{\mathcal{L}}\left(\boldsymbol{\theta },{\boldsymbol{z}}_{i}\right)$ is NM. Therefore, the estimation of $\nabla \mathcal{L}\left(\boldsymbol{\theta },\boldsymbol{z}\right)$ , which includes d parameters, requires NMd measurements. Such a cost becomes unaffordable for large N. However, the trade-off between the utility R₁ and the computational efficiency caused by the varied number of batches B is not considered in previous quantum classifiers, where most of them only focused on the setting B = N. How to design a quantum classifier that can attain a good utility R₁ with a low computational cost is unknown.

Another theoretical issue toward quantum classifiers is that none of the previous results have explored their potential advantages compared with classical counterparts. This questions the necessity of employing quantum classifiers because no benefit can be offered. Under the above observations, it is highly desirable to develop a quantum classifier that can not only achieve a good utility R₁ using a low computational cost, but can also possess certain quantum advantages compared with classical classifiers.

Here we devise a Grover-search based learning scheme (GBLS) to address the above two issues under the NISQ setting. Our proposal has the following advantages. First, GBLS is a flexible and effective learning scheme, which enables the optimization of different quantum classifiers with a varied number of batches B. Note that the choice of the encoding methods and the variational ansatz used in GBLS is very flexible, which covers a wide range of the proposed quantum classifiers [20–24]. Moreover, the Grover-search based machinery is only required in the training process, and the prediction of the new input is completed by only using the optimized variational quantum circuits, which ensures its efficacy. Second, we prove that the query complexity can be quadratically reduced over its classical counterparts in the optimal setting (see theorem 2) when it is applied to accomplish specific binary classification tasks. Last, numerical simulation results demonstrate that GBLS can well accomplish binary classification tasks even when the system noise and the finite number of quantum measurements are considered (see section 3). Notably, the required number of measurements of GBLS is dramatically less than other advanced quantum classifiers [22–24] with competitive performance (see table 1). In other words, GBLS is a powerful protocol that allows quantum classifiers to achieve a good utility bound R₁ with a low computational cost.

Table 1. The basic information of different quantum classifiers. The notations T, K, M, N, and d refer to the number of epochs, the batch size (i.e. in our simulation K = 4), the number of measurements used to estimate quantum expectation value, the total number of training examples, and the total number of trainable parameters.

Methods	MSE_batch	MSE	BCE	GBLS
Number of batches B	$\frac{N}{K}$	N	N	$\frac{N}{K}$
Number of measurements	$O\left(\frac{TMNd}{K}\right)$	O(TMNd)	O(TMNd)	$O\left(\frac{TMNd}{K}\right)$

The central concept in GBLS is reformulating the classification tasks as the search problem. Note that although the advantage held by the quantum Grover-search algorithm is evident, how to transform the classification task into the search problem is inconclusive. Such a reformulation is the main technical contribution in this study. Recall that Grover-search [26] identifies the target element i* in a database of size K by iteratively applying a predefined oracle ${U}_{f}=\mathbb{I}-2\left\vert {i}^{{\ast}}\right\rangle \left\langle {i}^{{\ast}}\right\vert$ and a diffusion operator ${U}_{\text{init}}=2\left\vert \varphi \right\rangle \left\langle \varphi \right\vert -\mathbb{I}$ with $\left\vert \varphi \right\rangle =\frac{1}{\sqrt{K}}{\sum }_{i}\left\vert i\right\rangle$ to the input state. GBLS, as shown in figure 2, employs a specified variational quantum circuit ${U}_{{L}_{1}}$ and a multiple controlled qubits gate along the Z axis (MCZ) to replace the oracle U_f. In particular, the variational quantum circuit conditionally flips a flag qubit (i.e. the black dot behind ${U}_{{L}_{1}}$ highlighted by the pink region) depending on the training data. The flag qubit is then employed as a part of MCZ gate to guide a Grover-like search algorithm to identify the index of the specified example, i.e. the status of the flag qubit such as '0' or '1' determines the successful probability to identify the target index. Through optimizing the trainable parameters of the variational quantum circuits ${U}_{{L}_{1}}$ , GBLS aims to maximize the successful probability to sample the target index when the corresponding training example is positive; otherwise, GBLS minimizes the successful probability of sampling the target index. The inherited property from the Grover-search algorithm allows our proposal to achieve an advantage in terms of query complexity when the binary classification task involves the searching constraint (see section 2.3 for details). Besides the computational merit, GBLS is insensitive to noise, guaranteed by the fact that combining a variational learning approach with Grover-search can preserve a high probability of success in finding the solution under the NISQ setting [27].

**Figure 2.** The paradigm of GBLS. U defined in equation (9) is composed of unitary operators (i.e. U_data, ${U}_{{L}_{1}}$ , MCZ, and U_init) highlighted by the shadowed yellow region. The last cycle employs the unitary operation U_E defined in equation (10), highlighted by the brown region. The qubits interacted with ${U}_{{L}_{1}}$ (or U_init) form the feature (or data) register ${\mathcal{R}}_{F}$ (or ${\mathcal{R}}_{I}$ ).
Download figure:
Standard image High-resolution image

**Figure 2.** The paradigm of GBLS. U defined in equation (9) is composed of unitary operators (i.e. U_data, ${U}_{{L}_{1}}$ , MCZ, and U_init) highlighted by the shadowed yellow region. The last cycle employs the unitary operation U_E defined in equation (10), highlighted by the brown region. The qubits interacted with ${U}_{{L}_{1}}$ (or U_init) form the feature (or data) register ${\mathcal{R}}_{F}$ (or ${\mathcal{R}}_{I}$ ).
Download figure:
Standard image High-resolution image

2. Grover-search based learning scheme

The outline of this section is as follows. In subsection 2.1, we first elaborate on the implementation details of the proposed GLBS as depicted in figure 2. We then explain how to use the trained GLBS to predict the given new input with O(1) query complexity in subsection 2.2. We last explain how GBLS can solve certain learning problems with potential advantages in subsection 2.3.

2.1. Implementation

In the preprocessing stage, GBLS employs the dataset $\hat{\mathcal{D}}$ defined in equation (1) to construct an extended dataset $\mathcal{D}$ . Compared with the original dataset $\hat{\mathcal{D}}$ , the cardinality of each training example in $\mathcal{D}$ is enlarged to K. For the purpose of applying the Grover-search algorithm to locate the target index i* = K − 1, the construction rule for the kth extended training example ${\mathcal{D}}_{k}$ for all k ∈ [N] is as follows. The mathematical representation of ${\mathcal{D}}_{k}$ is

$\begin{equation}{\mathcal{D}}_{k}=\left[\left({\boldsymbol{x}}_{k}^{\left(0\right)},{y}_{k}^{\left(0\right)}\right),\left(\right. {\boldsymbol{x}}_{k}^{\left(1\right)},{y}_{k}^{\left(1\right)},\dots ,\left({\boldsymbol{x}}_{k}^{\left(K-1\right)},{y}_{k}^{\left(K-1\right)}\right)\right].\end{equation} \tag{ 5 }$

The last pair in ${\mathcal{D}}_{k}$ corresponds to the kth example of $\hat{\mathcal{D}}$ , i.e. $\left({\boldsymbol{x}}_{k}^{\left(K-1\right)},{y}_{k}^{\left(K-1\right)}\right)=\left({\boldsymbol{x}}_{k},{y}_{k}\right)$ . The first K − 1 pairs ${\left\{\left({\boldsymbol{x}}_{k}^{\left(i\right)},{y}_{k}^{\left(i\right)}\right)\right\}}_{i=0}^{K-2}$ in ${\mathcal{D}}_{k}$ are uniformly sampled from a subset of $\hat{\mathcal{D}}$ , where all labels of this subset, i.e. ${\left\{{y}_{k}^{\left(i\right)}\right\}}_{i=1}^{K-2}$ , are opposite to y_k. Note that the construction of the subset is efficient. Since y_k ∈ {0, 1}, we can construct two subsets ${\hat{\mathcal{D}}}^{\left(0\right)}$ and ${\hat{\mathcal{D}}}^{\left(1\right)}$ that only contains examples of $\hat{D}$ with label '0' and label '1', respectively, where ${\hat{\mathcal{D}}}^{\left(0\right)}\cup {\hat{\mathcal{D}}}^{\left(1\right)}=\hat{\mathcal{D}}$ . When y_k = 0, the first K − 1 pairs are sampled from ${\hat{\mathcal{D}}}^{\left(1\right)}$ ; otherwise, when y_k = 1, the first K − 1 pairs are sampled from ${\hat{\mathcal{D}}}^{\left(0\right)}$ .

As aforementioned, different quantum classifiers exploit different methods to encode ${\mathcal{D}}_{k}$ into the quantum states [12]. For ease of notation, we denote the quantum state corresponding to the kth example ${\mathcal{D}}_{k}$ as

$\begin{equation}{U}_{\text{data}}\left\vert \mathbf{0}\right\rangle {:=}{\left\vert {{\Phi}}^{k}\right\rangle }_{F,I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert h\left({\boldsymbol{x}}_{i}\right)\right\rangle }_{F}{\left\vert i\right\rangle }_{I},\end{equation} \tag{ 6 }$

where h(⋅) is an encoding operation (a possible encoding method is discussed in section 3), and the subscripts 'F' and 'I' refer to the feature register ${\mathcal{R}}_{F}$ with N_F qubits and the index register ${\mathcal{R}}_{I}$ with N_I qubits, respectively.

We now move on to explain the training procedure of GBLS. Recall that the reference [27] points out that combining a variational learning approach with Grover-search algorithm produces an additional quantum advantage than conventional Grover's algorithm such that the target solution can be located with a higher success probability. A similar idea is used in GBLS. Namely, the employed variational quantum circuits ${U}_{{L}_{1}}$ aim to learn a hyperplane that separates the last pair in ${\mathcal{D}}_{k}$ with its first K − 1 pairs. Denote ${U}_{{L}_{1}}={\prod }_{l=1}^{L}U\left({\boldsymbol{\theta }}^{l}\right)$ , where each layer U( θ ^l) contains O(poly(N_F)) parameterized single qubit gates and at most O(poly(N_F)) fixed two-qubit gates with the identical layouts. In the optimal situation, given the initial state ${\left\vert {{\Phi}}^{k}\right\rangle }_{F,I}$ in equation (6), applying ${U}_{{L}_{1}}={\prod }_{l=1}^{L}U\left({\boldsymbol{\theta }}^{l}\right)$ to the feature register ${\mathcal{R}}_{F}$ yields the following target state:

(a)
If the last pair of the input example ${\mathcal{D}}_{k}$ refers to the label y_k = 0, the target state is
$\begin{equation}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\left\vert {{\Phi}}^{k}\left({y}_{k}=0\right)\right\rangle }_{F,I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert {\psi }_{i}^{\left(0\right)}\right\rangle }_{F}{\left\vert i\right\rangle }_{I};\end{equation} \tag{ 7 }$
(b)
Otherwise, when the last pair of the input example ${\mathcal{D}}_{k}$ refers to y_k = 1, the target state is
$\begin{equation}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\left\vert {{\Phi}}^{k}\left({y}_{k}=1\right)\right\rangle }_{F,I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert {\psi }_{i}^{\left(1\right)}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}.\end{equation} \tag{ 8 }$

We denote ${\left\vert {\psi }_{i}^{\left(0\right)}\right\rangle }_{F}$ (resp. ${\left\vert {\psi }_{i}^{\left(1\right)}\right\rangle }_{F}$ ) as the first qubit of the quantum state in the feature register ${\mathcal{R}}_{\mathcal{F}}$ being $\left\vert 0\right\rangle$ (resp. $\left\vert 1\right\rangle$ ). As shown in figure 3, once the state $\left({U}_{{L}_{1}}\otimes {\mathbb{I}}_{I}\right){\left\vert {{\Phi}}^{k}\right\rangle }_{F,I}$ is prepared, GBLS iteratively applies MCZ gate to the index register controlled by the first qubit of the feature register and the index register, uses U_data and ${U}_{{L}_{1}}$ to uncompute the feature register, and applies the diffusion operator U_init to the index register to complete the first cycle. Denote all quantum operations belong to one cycle as U, i.e.

$\begin{equation}U{:=}{U}_{\text{init}}{\circ}{U}_{\text{data}}^{{\dagger}}{\circ}{\left({U}_{{L}_{1}}\otimes \mathbb{I}\right)}^{{\dagger}}{\circ}\mathrm{M}\mathrm{C}\mathrm{Z}{\circ}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\circ}{U}_{\text{data}}.\end{equation} \tag{ 9 }$

With a slight abuse of notation, we define ${U}_{\text{init}}={\mathbb{I}}_{F}\otimes \left(2\left\vert \varphi \right\rangle \left\langle \varphi \right\vert -{\mathbb{I}}_{I}\right)$ with $\left\vert \varphi \right\rangle =\frac{1}{\sqrt{K}}{\sum }_{i}\left\vert i\right\rangle$ in the rest of the paper. GBLS repeatedly applies U to the initial state $\left\vert \mathbf{0}\right\rangle$ except for the last cycle, where the applied unitary operations are replaced by

$\begin{equation}{U}_{E}{:=}{U}_{\text{init}}{\circ}\mathrm{M}\mathrm{C}\mathrm{Z}{\circ}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\circ}{U}_{\text{data}},\end{equation} \tag{ 10 }$

as highlighted by the brown shadow in figure 4. Following the conventional Grover-search, GBLS queries U and U_E with in total $O\left(\sqrt{K}\right)$ times before taking quantum measurements. This completes the quantum part of GBLS.

**Figure 3.** The circuit implementation of the oracle U in equation (9).
Download figure:
Standard image High-resolution image

We next analyze how the quantum state evolves for the case y_k = 0 and y_k = 1, respectively. For the case of y_k = 0, applying ${U}_{{L}_{1}}\otimes {\mathbb{I}}_{I}$ to the input state ${\left\vert {{\Phi}}^{k}\left({y}_{k}=0\right)\right\rangle }_{F,I}$ in equation (6) will transform this state to $1/\sqrt{K}{\sum }_{i=0}{\left\vert {\psi }_{i}^{\left(0\right)}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ as described in equation (7). Since the control qubit in the feature register is 0, applying MCZ gate does not flip the phase of the state. After uncomputing, the result state yields $1/\sqrt{K}{\sum }_{i=0}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ . The positive phase for all computational basis i ∈ [K − 1] implies that applying the quantum operation ${U}_{\text{init}}{\circ}{U}_{\text{data}}^{{\dagger}}{\circ}{\left({U}_{{L}_{1}}\otimes {\mathbb{I}}_{I}\right)}^{{\dagger}}$ does not change the state as well, i.e.

$\begin{equation}\left({\mathbb{I}}_{F}\otimes \left(2\left\vert \varphi \right\rangle \left\langle \varphi \right\vert -{\mathbb{I}}_{I}\right)\right)\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}.\end{equation} \tag{ 11 }$

In other words, when we measure the index register of the output state, the probability to sample the computation basis i with i ∈ [K − 1] is uniformly distributed.

For the case of y_k = 1, the input state ${\left\vert {{\Phi}}^{k}\left({y}_{k}=1\right)\right\rangle }_{F,I}$ in equation (6) will be transformed to $1/\sqrt{K}{\sum }_{i=0}{\left\vert {\psi }_{i}^{\left(1\right)}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ after interacting with unitary ${U}_{{L}_{1}}\otimes {\mathbb{I}}_{I}$ , as described in equation (8). With the control qubit in the feature register being 1, such a generated quantum state will evolve as Grover-search algorithm does by iteratively applying MCZ, the uncomputation operation ${U}_{\text{data}}^{{\dagger}}{\circ}{\left({U}_{{L}_{1}}\otimes \mathbb{I}\right)}^{{\dagger}}$ , and U_init. Mathematically, the result state after interacting with MCZ yields

$\begin{equation}{\hat{U}}_{f}{\left\vert {{\Phi}}^{k}\left({y}_{k}=1\right)\right\rangle }_{F,I}=\mathrm{cos}\enspace \gamma {\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}-\mathrm{sin}\enspace \gamma {\left\vert {\psi }_{{i}^{{\ast}}}^{\left(1\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I},\end{equation} \tag{ 12 }$

where ${\hat{U}}_{f}{:=}\mathrm{M}\mathrm{C}\mathrm{Z}{\circ}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right)$ , $\mathrm{cos}\enspace \gamma =\frac{\sqrt{K-1}}{\sqrt{K}}$ , ${\left\vert B\right\rangle }_{I}=\frac{1}{\sqrt{K-1}}{\sum }_{i=0}^{K-2}{\left\vert i\right\rangle }_{I}$ , and ${\left\vert {i}^{{\ast}}\right\rangle }_{I}$ refers to the computational basis $\left\vert K-1\right\rangle$ . Analogous to the U_f in Grover-search, the trainable and data-driven ${\hat{U}}_{f}$ used above conditionally flips the phase of the state $\left\vert {i}^{{\ast}}\right\rangle$ . Next, the uncomputing operation ${U}_{\text{data}}^{{\dagger}}{\circ}{\left({U}_{{L}_{1}}\otimes \mathbb{I}\right)}^{{\dagger}}$ and the diffusion operator U_init are employed to increase the probability of ${\left\vert {i}^{{\ast}}\right\rangle }_{I}$ . Mathematically, the generated state after the first cycle yields

$\begin{equation}U{\left\vert {{\Phi}}^{k}\left({y}_{k}=1\right)\right\rangle }_{F,I}=\mathrm{cos}\enspace 3\gamma {\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}+\mathrm{sin}\enspace 3\gamma {\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I},\end{equation} \tag{ 13 }$

where U is defined in equation (9). The probability of sampling i* is increased to sin² 3γ, which is in accordance to Grover-search algorithm. This observation leads to the following theorem, whose proof is given in appendix A.

Theorem 2. For GBLS, under the optimal setting, the probability of sampling the outcome i* = K − 1 approaches 1 asymptotically iff the label of the last entry of ${\mathcal{D}}_{k}$ is y_k = 1.

We leverage the particular property of GBLS, in which the output distribution is varied for different label of input ${\mathcal{D}}_{k}$ as shown in theorem 2, to accomplish the binary classification task. Concisely, the output state of GBLS, i.e. ${U}_{E}{U}^{O\left(\sqrt{K}\right)}{\left\vert \mathbf{0}\right\rangle }_{F,I}$ , corresponding to y_k = 1 will contain the computational basis i = K − 1 with probability near to 1. By contrast, the output state corresponding to y_k = 0 will contain all computational bases i ∈ [K − 1] with the equal probability. Driven by this observation and the mechanism of the Grover-search algorithm, the loss function of GBLS is

$\begin{equation}\underset{\boldsymbol{\theta }}{\mathrm{min}}\enspace \mathcal{L}\left(\boldsymbol{\theta }\right){:=}\mathrm{sign}\left(1/2-{y}_{k}\right)\mathrm{Tr}\left({\Pi}\rho \left(\boldsymbol{\theta }\right)\right),\end{equation} \tag{ 14 }$

where sign(⋅) is the sign function, ${\Pi}=\left(\left\vert 1\right\rangle \left\langle 1\right\vert \right)\otimes \mathbb{I}\otimes \left(\left\vert {i}^{{\ast}}\right\rangle \left\langle {i}^{{\ast}}\right\vert \right)$ refers to the measurement operator, $\rho \left(\boldsymbol{\theta }\right)={U}_{E}U{\left(\boldsymbol{\theta }\right)}^{O\left(\sqrt{K}\right)}\left\vert \mathbf{0}\right\rangle \left\langle \mathbf{0}\right\vert {\left({U}_{E}U{\left(\boldsymbol{\theta }\right)}^{O\left(\sqrt{K}\right)}\right)}^{{\dagger}}$ is the generated quantum state, and U( θ ) is defined in equation (9) (for clearness, we use the explicit form U( θ ) instead of U). Intuitively, the minimized $\mathcal{L}\left(\boldsymbol{\theta }\right)$ corresponds to the facts that when y_k = 1 (y_k = 0), the success probability to sample i* as well as attain the first feature qubit to be '1' ('0') is maximized (minimized). GBLS employs a gradient-based method, i.e. the parameter shift rule [22], to optimize θ . Confer appendix B for the detail.

We would like to address that, GBLS can be used to conduct both the linear and nonlinear classification tasks depending on the specified quantum classifiers. For example, when GBLS adopts the proposal [23, 24] to implement U_data and ${U}_{{L}_{1}}$ , it has capability of classifying nonlinear data.

2.2. Prediction

Once the training of GBLS has finished, the trained ${U}_{{L}_{1}}$ can be directly employed to predict the label of the future instances with O(1) query complexity, where the corresponding circuit implementation is shown in figure 5. To achieve this, we devise the following prediction method. Denote the new input as $\left(\tilde {\boldsymbol{x}},\tilde {y}\right)$ . We first encode $\tilde {\boldsymbol{x}}$ into the quantum state with the identical encoding method used in the training procedure, i.e. ${\left\vert \tilde {\psi }\right\rangle }_{F}=\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle$ . Applying the trained ${U}_{{L}_{1}}$ to ${\left\vert \tilde {\psi }\right\rangle }_{F}$ yields

$\begin{equation}{U}_{{L}_{1}}{\left\vert \psi \right\rangle }_{F}=\tilde {\alpha }{\left\vert {\tilde {\psi }}^{\left(0\right)}\right\rangle }_{F}+\tilde {\beta }{\left\vert {\tilde {\psi }}^{\left(1\right)}\right\rangle }_{F},\end{equation} \tag{ 15 }$

where $\vert \tilde {\alpha }{\vert }^{2}+\vert \tilde {\beta }{\vert }^{2}=1$ .

**Figure 5.** The circuit implementation of GBLS for prediction. The same encoding method used in the training process is adopted to prepare the state $\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle$ . The trained variational quantum circuit U( θ ^(T)) is applied to $\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle$ before the measurement.
Download figure:
Standard image High-resolution image

$\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle $ — **Figure 5.** The circuit implementation of GBLS for prediction. The same encoding method used in the training process is adopted to prepare the state $\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle$ . The trained variational quantum circuit U( θ ^(T)) is applied to $\left\vert h\left(\tilde {\boldsymbol{x}}\right)\right\rangle$ before the measurement.
Download figure:
Standard image High-resolution image

Denote the probability of the outcome '1' after measuring the first feature qubit of the state in equation (15) as ${p}_{1}=\vert \tilde {\beta }{\vert }^{2}$ and let the threshold be 1/2. The new input data $\tilde {\boldsymbol{x}}$ will be identified as label '0', if p₁ < 1/2; otherwise, it will be given label '1'.

2.3. Potential advantage of GBLS

Here we design a binary classification task to explore the potential advantage of GBLS in terms of query complexity. Consider the classification task that requires not only to find a decision rule in equation (2) but also to output the index j satisfying a pre-determined black-box function. Note that the identification of a target index is a common functionality in the context of database searching in the medical system, economy, and online shopping. For example, given a medical database, it is natural to expect that the trained classifier can predict whether a patient is ill or healthy based on her/his symptoms, and can identify a healthy patient with additional properties, e.g. the gender of the patient is female, which can be modeled by a black box function.

The mathematical formulation of this classification task is as follows. Given the data ${\mathcal{D}}_{k}$ in equation (5), denoted the black box as q(⋅), the task yields

$\begin{equation}\left(\underset{\boldsymbol{\theta }}{\mathrm{min}}\sum\limits _{i=0}^{K-1}{\mathbb{1}}_{{y}_{i}\ne {f}_{\boldsymbol{\theta }}\left({\boldsymbol{x}}_{i}\right)}\right)\wedge \left(\left\{j\vert q\left(j\right)=1,\enspace {y}_{j}=1\right\}\right),\end{equation} \tag{ 16 }$

where the function q(⋅) is a Boolean function with the input set $\left\{j:\forall \enspace {y}_{j}\in {\mathcal{D}}_{k},\enspace {y}_{j}=1\right\}$ . Taking GBLS implemented in the previous subsections as an example, q(⋅) has the following form, ∀j = {0, ..., K − 1}

$\begin{equation}q\left(j\right)=\begin{cases}1,\quad \text{if}\enspace j=K-1;\quad \hfill \\ 0,\quad \text{otherwise.}\quad \hfill \end{cases}.\end{equation} \tag{ 17 }$

Furthermore, q(⋅) could be implemented by the MCZ gate, which conditionally flips the phase of the computational basis corresponding to j*:=K − 1 if the state is ${\left\vert {\psi }_{j}^{\left(1\right)}\right\rangle }_{F}{\left\vert {j}^{{\ast}}\right\rangle }_{I}$ given in equation (8). In this way, the Grover-like search structure used in GBLS promises that the probability to sample j* will be maximized. We remark that GBLS can be effectively generalize to implement other forms of q(⋅) via modifying the MCZ gate. When the size of the dataset loaded by GBLS is K, a well-trained GBLS can locate the target index with $O\left(\sqrt{K}\right)$ query complexity, guaranteed by the result of theorem 2. However, given access to the well-trained classifier f_θ(⋅), both classical algorithms and previous quantum classifiers need at least O(K) query complexity to find j*. The reduced query complexity of GBLS implies a potential quantum advantage to accomplish classification tasks.

3. Numerical experiments

We now apply GBLS to classify a nonlinear synthetic dataset $\hat{\mathcal{D}}$ to evaluate its performance. The construction of $\hat{\mathcal{D}}$ follows the proposal [23]. Consider a synthetic dataset $\hat{\mathcal{D}}={\left\{{\boldsymbol{x}}_{i},{y}_{i}\right\}}_{i=0}^{N-1}$ with N = 200, where ${\boldsymbol{x}}_{i}=\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\in {\mathbb{R}}^{2}$ , ${\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\in \left(0,2\pi \right)$ . Let g(⋅) be a specific embedding function with $\left\vert g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right\rangle \in {\mathbb{C}}^{4}$ for all i ∈ {0, ..., N − 1}. The label of x _i is assigned as y_i = 1 if

$\begin{equation*}\left\langle g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right\vert {V}^{{\dagger}}{\Pi}V\left\vert g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right\rangle {\geqslant}0.5+{\Delta},\end{equation*}$

where V ∈ SU(4) is a unitary operator, ${\Pi}=\mathbb{I}\otimes \left\vert 0\right\rangle \left\langle 0\right\vert$ is the measurement operator, and the gap Δ is set as 0.2. The label of x _i is assigned as y_i = 0 if

$\begin{equation*}\left\langle g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right\vert {V}^{{\dagger}}{\Pi}V\left\vert g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right\rangle {\leqslant}0.5-{\Delta}.\end{equation*}$

We illustrate the synthetic dataset $\hat{\mathcal{D}}$ in the left panel of figure 6.

At the data preprocessing stage, we split the dataset $\hat{\mathcal{D}}$ into the training datasets ${\hat{\mathcal{D}}}_{\text{train}}$ with size N_train = 100 and the test dataset ${\hat{\mathcal{D}}}_{\text{test}}$ with N_test = 100. In the training process, we follow the construction rule of GBLS to build the extended training dataset ${\mathcal{D}}_{\text{train}}$ by using ${\hat{\mathcal{D}}}_{\text{train}}$ . We set K = 4 in the following analysis, where the training example ${\mathcal{D}}_{k}\subset {\mathcal{D}}_{\text{train}}$ can be encoded into a quantum state by using four qubits with N_I = N_F = 2 (see appendix C for the detailed implementation of GBLS). Note that, at each epoch, we shuffle ${\mathcal{D}}_{\text{train}}$ and rebuild the extended dataset ${\hat{\mathcal{D}}}_{\text{train}}$ . An epoch means that an entire dataset is passed forward through the quantum learning model, e.g. when the dataset contains 1000 training examples, and only two examples are fed into the quantum learning model each time, then it will take 500 iterations to complete 1 epoch.

The numerical simulations are implemented on Python in conjunction with the PennyLane, Qiskit, and pyQuil libraries [28–30]. The hyper-parameters setting used in our experiment is as follows. The block of U_E in figure 4 is employed once for the case K = 4, according to the Grover's theorem $O\left(\sqrt{K}\right)$ . The layer number of variational quantum circuits, i.e. ${U}_{{L}_{1}}={\prod }_{l=1}^{L}U\left({\boldsymbol{\theta }}^{l}\right)$ , is set as L = 2. The number of epochs used in classical optimization is 20. For comparison, we also apply the quantum kernel classifier proposed by [23, 24] with two different loss functions, i.e. the mean squared error (MES) loss, and the binary cross entropy (BCE) loss, to learn the synthetic dataset $\hat{\mathcal{D}}$ . The selection of the quantum kernel classifiers as the reference is based on the fact that this method has achieved state-of-the-art performance to classify nonlinear data [23].

Ideal setting. We first evaluate performance of different quantum classifiers under the ideal setting, where the quantum system is noiseless and the number of measurements is infinite. The right panel of figure 6 illustrates the averaged training and testing accuracies versus the number of epochs. In particular, our proposal achieves comparable performance with the quantum kernel classifier with the BCE loss, where both the train and test accuracies converge to 99% within 2 epochs. Moreover, these two methods outperform the quantum kernel classifier with the MSE loss (B = N), whose test accuracy can only reach 95% after 10 epochs. The variance of these three quantum classifiers after 10 epochs becomes small, which implies that all of them hold stable performance under the ideal setting.

Depolarization noise setting. We next investigate performance of GBLS and the referenced quantum kernel classifiers under the realistic setting, where the quantum system noise is considered and the number of measurements is finite. Specifically, we employ the depolarization channel to model the system noise, i.e. given a quantum state $\rho \in {\mathbb{C}}^{d{\times}d}$ , the quantum depolarization channel ${\mathcal{E}}_{p}$ that acts on this state is defined as

$\begin{equation*}{\mathcal{E}}_{p}\left(\rho \right)=\left(1-p\right)\rho +p{\pi }_{d},\end{equation*}$

where p is the depolarization rate, and π_d is the maximally mixed state with ${\pi }_{d}={\mathbb{I}}_{d}/d$ . Meanwhile, to explore the trade-off between the computational cost (i.e. the total number of measurements) and the utility R₁ indicated by theorem 1, we also compare performance between GBLS and a modified quantum kernel classifier with the MSE loss, which supports to use the batch gradient descent method with B = N/4 to optimize parameters (please refer to appendix C for implementation details). Table 1 summarizes the basic information about GBLS and the referenced quantum classifiers. See appendix D about the derivation of the required number of measurements for GBLS and the quantum kernel classifier with the BCE loss.

The hyper-parameters settings applied to GBLS and other quantum classifiers are as follows. The depolarization rate is set as p = 0.05 and p = 0.25, respectively. The number of measurements is set as 10 to approximate the quantum expectation result. The parameter shift rule is used to estimate the analytic gradients [22, 31]. For each classifier, we repeat the numerical simulations with five times to collect the statistical information. Confer appendix C for other settings such as learning rates and random seeds.

The simulation results of GBLS and the referenced quantum classifiers are illustrated in figure 7. Specifically, when p = 0.05, GBLS and the other three referenced quantum classifiers achieve comparable performance after 10 epochs. Moreover, the quantum kernel classifier with the MSE loss (B = N/4 possesses a lower the convergence rate and a larger variance than the rest three classifiers. When p = 0.25, there exists a relatively large gap between the quantum kernel classifiers with the MSE_bactch method and the rest three quantum classifiers in the measure of the convergence rate. Such a difference reflects the importance to use GBLS to investigate classification tasks under the varied number of batches. We summarize the averaged training and test accuracies of GBLS and other quantum classifiers at the last epoch in table 2. Even though the measurement error and quantum gate noise are considered, GBLS can still attain stable performance, since its variance is very small (i.e. at most 0.04). This observation suggests the applicability of our proposal on NISQ machines.

Table 2. Performance of different quantum classifiers under the depolarization noise at the 20-th epoch. The labels 'MSE_batch', 'MSE', 'BCE', and 'GBLS' follow the same meanings as explained in table 1. The value 'a ± b' refers that the averaged accuracy is a and its variance is b.

Methods	MSE_batch	MSE	BCE	GBLS
p = 0.05 (train)	0.929 ± 0.037	0.978 ± 0.013	0.956 ± 0.024	0.935 ± 0.024
p = 0.25 (train)	0.846 ± 0.072	0.936 ± 0.032	0.918 ± 0.031	0.881 ± 0.025
p = 0.05 (test)	0.943 ± 0.032	0.975 ± 0.006	0.860 ± 0.089	0.945 ± 0.021
p = 0.25 (test)	0.862 ± 0.095	0.934 ± 0.009	0.791 ± 0.056	0.879 ± 0.040

We would like to emphasize the main issue considered in this study: whether there exists a quantum classifier that can attain a good utility bound R₁ by using a few number of measurements. The numerical simulation results of GBLS provide a positive response toward this issue. Recall the setting given in table 1 and the results in figure 7. Although the required number of measurements for GBLS is reduced by K = 4 times compared with quantum classifiers with the BCE loss and the MSE loss (B = N), they achieve comparable performance. This result implies a huge separation of the computational efficacy between GBLS and previous quantum classifiers with B = N when N is large.

Noise model from real quantum hardware. We further compare performance of GBLS and the referenced quantum classifiers under a noise model extracted from real quantum hardware, i.e. IBMQ_ourense, provided by the Qiskit and PennyLane python libraries [28, 29]. Notably, for all classifiers, the gate noise is only imposed on the trainable quantum circuits U_L instead of the whole circuits, since the implementation of multi-controlled gates (e.g. CCZ) used in GBLS will introduce a huge amount of noise and destroy the optimization of GBLS (see appendix C for details). Meanwhile, the measurement noise is applied to all quantum classifiers. Due to the relatively poor performance of the quantum kernel classifier with the MSE loss and B = N/4, here we only focus the comparison among GBLS and quantum kernel classifiers with the BCE loss and the MSE loss (B = N). Note that all hyper-parameters settings are identical to those used in the above numerical simulations.

The simulation results are exhibited in figure 8. Specifically, the three classifiers achieve comparable performance. Such results indicate that the efficacy of GBLS, since the required number of measurements for GBLS is reduced by four times compared with the rest two quantum classifiers.

4. Discussion and conclusion

In this study, we have proposed a GBLS for classification. Different from previous proposals, GBLS supports the optimization of a wide range of quantum classifiers with a varied number of batches. This property allows us to explore the trade-off between the computational efficiency and the utility bound R₁. Moreover, we demonstrate that GBLS possesses a potential advantage to tackle certain classification tasks in the measure of query complexity. Numerical experiments showed that GBLS can achieve comparable performance with other advanced quantum classifiers by using a fewer number of measurements. We believe that our work will provide immediate and practical applications for near-term quantum devices.

Acknowledgments

This work received support from Australian Research Council (Project FL-170100117), and the Faculty of Engineering and Information Technologies at the University of Sydney (the Engineering and Information Technologies Research Scholarship).

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/yuxuan-du/.

Appendix A.: Proof of theorem 1

Proof of theorem 1 .

To achieve theorem 1, we separately discuss the situations in which the label of the last entry in ${\mathcal{D}}_{k}$ is y_k = 1 and y_k = 0, respectively.

For the case y_k = 1. Suppose that the label of the last entry in ${\mathcal{D}}_{k}$ is y_k = 1. Followed from equation (13), after the first cycle, the generated state of GBLS is

$\begin{equation*}U{\left\vert \mathbf{0}\right\rangle }_{F,I}\equiv {U}_{{c}_{1}}{\left\vert {{\Phi}}^{k}\left({y}_{k}=1\right)\right\rangle }_{F,I}={\left\vert \mathbf{0}\right\rangle }_{F}\otimes \left(\mathrm{cos}\enspace 3\gamma {\left\vert B\right\rangle }_{I}+\mathrm{sin}\enspace 3\gamma {\left\vert {i}^{{\ast}}\right\rangle }_{I}\right),\end{equation*}$

where $\mathrm{sin}\enspace \gamma =\frac{1}{\sqrt{K}}$ . This result indicates that the probability to sample the target index i* is increased from sin² γ to sin² 3γ, which is same with Grover-search.

Then, by induction as the proof of Grover-search does [32], the generated state of GBLS after applying U to ${\left\vert \mathbf{0}\right\rangle }_{F,I}$ with ℓ times yields

$\begin{equation}\prod\limits _{i=1}^{\ell }{U}^{i}{\left\vert \mathbf{0}\right\rangle }_{F,I}={\left\vert \mathbf{0}\right\rangle }_{F}\otimes \left(\mathrm{cos}\left(\left(2\ell +1\right)\gamma \right){\left\vert B\right\rangle }_{I}+\mathrm{sin}\left(\left(2\ell +1\right)\gamma \right){\left\vert {i}^{{\ast}}\right\rangle }_{I}\right).\end{equation} \tag{ A.1 }$

Note that, GBLS requires that the employed quantum operation at the last cycle is U_E as defined in equation (10) instead of U. Mathematically, the generated state is

$\begin{align}\hfill {U}_{E}\prod\limits _{i=1}^{\ell }{U}^{i}{\left\vert \mathbf{0}\right\rangle }_{F,I}& ={U}_{\text{init}}{\circ}\mathrm{M}\mathrm{C}\mathrm{Z}{\circ}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\circ}{U}_{\text{data}}{\left\vert \mathbf{0}\right\rangle }_{F}\otimes \left(\mathrm{cos}\left(\left(2\ell +1\right)\gamma \right){\left\vert B\right\rangle }_{I}+\mathrm{sin}\left(\left(2\ell +1\right)\gamma \right){\left\vert {i}^{{\ast}}\right\rangle }_{I}\right)\hfill \\ \hfill & ={U}_{\text{init}}{\circ}\mathrm{M}\mathrm{C}\mathrm{Z}\left(\mathrm{cos}\left(\left(2\ell +1\right)\gamma \right){\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}+\mathrm{sin}\left(\left(2\ell +1\right)\gamma \right){\left\vert {\psi }_{B}^{\left(1\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I}\left. \right)\right)\hfill \\ \hfill & ={U}_{\text{init}}\left(\mathrm{cos}\left(\left(2\ell +1\right)\gamma \right){\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}-\mathrm{sin}\left(\left(2\ell +1\right)\gamma \right){\left\vert {\psi }_{B}^{\left(1\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I}\left. \right)\right)\hfill \\ \hfill & =\left(\mathrm{cos}\left(\left(2\ell +3\right)\gamma \right){\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}+\mathrm{sin}\left(\left(2\ell +3\right)\gamma \right){\left\vert {\psi }_{B}^{\left(1\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I}\left. \right)\right),\hfill \end{align} \tag{ A.2 }$

where the first equality uses equation (A.1), the second equality exploits equation (13) to engineer the feature register, the third equality employs MCZ to flip the phase the state $\left\vert {i}^{{\ast}}\right\rangle$ whose first qubit in the feature register is $\left\vert 1\right\rangle$ , and last equality comes from the application of the diffusion operator ${U}_{\text{init}}={\mathbb{I}}_{F}\otimes \left(2\left\vert \varphi \right\rangle \left\langle \varphi \right\vert -{\mathbb{I}}_{I}\right)$ with $\left\vert \varphi \right\rangle =\frac{1}{\sqrt{K}}{\sum }_{i}\left\vert i\right\rangle$ to the index register.

The result of equation (A.2) indicates that, under the optimal setting, the probability to sample i* is close to 1 when $\ell \sim O\left(\sqrt{K}\right)$ , since $\mathrm{sin}\enspace \gamma \approx \gamma =1/\sqrt{K}$ and then sin ((2ℓ + 3)γ) ≈ 1.

For the case y_k = 0. We then demonstrate that, when the label of the last entry in ${\mathcal{D}}_{k}$ is y_k = 0, even if applying $U={\prod }_{i=1}^{\ell }$ and U_E to ${\left\vert \mathbf{0}\right\rangle }_{F,I}$ with $\ell \sim O\left(\sqrt{K}\right)$ , the probability to sample i* is 1/K. Followed from equation (11), after the first cycle, the generated state of GBLS is

$\begin{equation*}{U}_{{c}_{1}}{\left\vert {{\Phi}}^{k}\left({y}_{k}=0\right)\right\rangle }_{F,I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I},\end{equation*}$

where $\mathrm{sin}\enspace \gamma =\frac{1}{\sqrt{K}}$ . Due to ${U}_{{c}_{1}}{\left\vert {{\Phi}}^{k}\left({y}_{k}=0\right)\right\rangle }_{F,I}=U{\left\vert \mathbf{0}\right\rangle }_{F,I}$ , after applying U to the state $\left\vert \mathbf{0}\right\rangle$ , the probability to sample any index is identical. By induction, applying the corresponding U to the state ${\left\vert \mathbf{0}\right\rangle }_{F,I}$ with ℓ times yields

$\begin{equation}\prod\limits _{i=1}^{\ell }{U}^{i}{\left\vert \mathbf{0}\right\rangle }_{F,I}=\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I},\end{equation} \tag{ A.3 }$

where given any positive integer ℓ, the probability to sample ${\left\vert {i}^{{\ast}}\right\rangle }_{I}$ is 1/K.

As with the case of y_k = 1, at the last cycle, we apply the unitary U_E to the state ${\prod }_{i=1}^{\ell }{U}^{i}{\left\vert \mathbf{0}\right\rangle }_{F,I}$ , and the generated state is

$\begin{align}\hfill {U}_{E}\prod\limits _{i=1}^{\ell }{U}^{i}{\left\vert \mathbf{0}\right\rangle }_{F,I}& ={U}_{\text{init}}{\circ}\mathrm{M}\mathrm{C}\mathrm{Z}{\circ}\left({U}_{{L}_{1}}\otimes \mathbb{I}\right){\circ}{U}_{\text{data}}\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}{\left\vert \mathbf{0}\right\rangle }_{F}{\left\vert i\right\rangle }_{I}\hfill \\ \hfill & ={U}_{\text{init}}\left(\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}\left({\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}+{\left\vert {\psi }_{{i}^{{\ast}}}^{\left(0\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I}\right)\right)\hfill \\ \hfill & =\frac{1}{\sqrt{K}}\sum\limits _{i=0}^{K-1}\left({\left\vert {\psi }_{B}^{\left(0\right)}\right\rangle }_{F}{\left\vert B\right\rangle }_{I}+{\left\vert {\psi }_{{i}^{{\ast}}}^{\left(0\right)}\right\rangle }_{F}{\left\vert {i}^{{\ast}}\right\rangle }_{I}\right),\hfill \end{align} \tag{ A.4 }$

where the first equality uses the explicit form of U_E and equation (A.3), and the second equality is guaranteed by equation (12) (note that the only difference is replacing ${\left\vert {\psi }_{{i}^{{\ast}}}^{\left(1\right)}\right\rangle }_{F}$ with ${\left\vert {\psi }_{{i}^{{\ast}}}^{\left(0\right)}\right\rangle }_{F}$ based on the setting y_k = 0), and the last equality exploits the explicit form of U_init.

The result of equation (A.4) reflects that, under the optimal setting, the probability to sample i* can never be increased when y_k = 0. Therefore, we can conclude that, under the optimal setting, the probability to sampling the outcome i* approaches 1 asymptotically if and only if the label of the last entry of ${\mathcal{D}}_{k}$ is y_k = 1. □

Appendix B.: Variational quantum circuits and the optimizing method

In this section, we first introduce the variational quantum circuits ${U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)$ used in GBLS. We then elaborate the optimization method, i.e. the parameter shift rule, that is employed to train ${U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)$ .

Variational quantum circuits, which is also called parameterized quantum circuit, are composed of trainable single qubit gates and two qubits gates (e.g. CNOT or CZ). As a promising scheme for NISQ devices, variational quantum circuits have been extensively investigated for accomplishing the generative and discriminative [15, 20, 33–35] tasks via variational hybrid quantum–classical algorithms [36]. One typical variational quantum circuits is called the multiple-layer parameterized quantum circuits (MPQC), where the arrangement of quantum gates in each layer is identical [33]. Denote the operation formed by the lth layer as U( θ ^l). The generated quantum state from MPQC yields

$\begin{equation*}\left\vert {\Psi}\right\rangle =\prod\limits _{l=1}^{L}U\left({\boldsymbol{\theta }}^{l}\right){\left\vert 0\right\rangle }^{\otimes {N}_{F}},\end{equation*}$

where L is the total number of layers. GBLS employs MPQC to construct ${U}_{{L}_{1}}$ , i.e.

$\begin{equation}{U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)=\prod\limits _{l=1}^{L}U\left({\boldsymbol{\theta }}^{l}\right),\end{equation} \tag{ B.1 }$

and the circuit arrangement for the lth layer U( θ ^l) is shown in figure B1. When the number of layers is L, the total number of trainable parameters for GBLS is 2N_F L.

The updating rule of GBLS at the kth iteration follows

$\begin{equation}{\boldsymbol{\theta }}^{\left(k+1\right)}={\boldsymbol{\theta }}^{\left(k\right)}-\eta \frac{\mathcal{L}\left({\boldsymbol{\theta }}^{\left(k\right)},{\mathcal{D}}_{k}\right)}{\partial \boldsymbol{\theta }},\end{equation} \tag{ B.2 }$

where η is the learning rate and ${\mathcal{D}}_{k}$ is the kth training example. By expanding the explicit form of $\mathcal{L}\left({\boldsymbol{\theta }}^{\left(k\right)},{\mathcal{D}}_{k}\right)$ given in equation (14), the gradients of $\mathcal{L}\left({\boldsymbol{\theta }}^{\left(k\right)},{\mathcal{D}}_{k}\right)$ can be rewritten as

$\begin{equation}\frac{\partial \mathcal{L}\left({\boldsymbol{\theta }}^{\left(k\right)},{\mathcal{D}}_{k}\right)}{\partial \boldsymbol{\theta }}=\mathrm{sign}\left(1/2-{y}_{k}\right)\frac{\partial \enspace \mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}^{\left(k\right)}\right)\right)}{\partial \boldsymbol{\theta }},\end{equation} \tag{ B.3 }$

where y_k refers to the label of the last entry in ${\mathcal{D}}_{k}$ , sign(⋅) is the sign function, Π is the measurement operator, and

$\begin{equation*}\rho \left({\boldsymbol{\theta }}^{\left(k\right)}\right)={U}_{E}U{\left({\boldsymbol{\theta }}^{\left(k\right)}\right)}^{O\left(\sqrt{K}\right)}\left\vert \mathbf{0}\right\rangle \left\langle \mathbf{0}\right\vert {\left({U}_{E}U{\left({\boldsymbol{\theta }}^{\left(k\right)}\right)}^{O\left(\sqrt{K}\right)}\right)}^{{\dagger}}.\end{equation*}$

GBLS adopts the parameter shift rule proposed by [22] to attain the gradient $\frac{\partial \enspace \mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}^{\left(k\right)}\right)\right)}{\partial \boldsymbol{\theta }}$ . Concisely, the parameter shift rule iteratively computes each entry of the gradient. Without loss of generality, here we explain how to compute $\frac{\partial \enspace \mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}^{\left(k\right)}\right)\right)}{\partial {\boldsymbol{\theta }}_{j}}$ for j ∈ [2N_F L]. Define ${\boldsymbol{\theta }}_{{\pm}}^{\left(k\right)}$ as

$\begin{equation}{\boldsymbol{\theta }}_{{\pm}}^{\left(k\right)}=\left[{\boldsymbol{\theta }}_{0}^{\left(k\right)},\dots ,{\boldsymbol{\theta }}_{j-1}^{\left(k\right)},{\boldsymbol{\theta }}_{j}^{\left(k\right)}{\pm}\frac{\pi }{2},{\boldsymbol{\theta }}_{j+1}^{\left(k\right)},\dots ,{\boldsymbol{\theta }}_{2{N}_{F}L-1}^{\left(k\right)}\right],\end{equation} \tag{ B.4 }$

where only the jth parameter is rotated by ${\pm}\frac{\pi }{2}$ . Then the mathematical representation of the gradient for the jth entry is

$\begin{equation}\frac{\partial \enspace \mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}^{\left(k\right)}\right)\right)}{\partial {\boldsymbol{\theta }}_{j}}=\frac{\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{+}^{\left(k\right)}\right)\right)-\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{-}^{\left(k\right)}\right)\right)}{2}.\end{equation} \tag{ B.5 }$

In conjunction with equations (B.2), (B.3) and (B.5), the updating rule of GBLS at the tth iteration for the jth entry is

$\begin{equation}{\boldsymbol{\theta }}_{j}^{\left(k+1\right)}={\boldsymbol{\theta }}_{j}^{\left(k\right)}-\eta \frac{\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{+}^{\left(k\right)}\right)\right)-\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{-}^{\left(k\right)}\right)\right)}{2}\enspace \mathrm{sign}\left(\frac{1}{2}-{y}_{k}\right).\end{equation} \tag{ B.6 }$

Appendix C.: More details of numerical simulations

In this section, we provide more details about the numerical simulations. Specifically, we first explain how to construct the employed synthetic dataset. We then elaborate on the implementation of GBLS and referenced classifiers, and their hyper-parameters settings. We next analyze the required circuit depth to implement these quantum classifiers. Last, we introduce the construction of the modified dataset used in the MSE_batch method.

The construction of the synthetic dataset. Given the training example ${\boldsymbol{x}}_{i}=\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\in {\mathbb{R}}^{2}$ for all i ∈ [N − 1], the embedding function $g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right):{\mathbb{R}}^{2}\to {\mathbb{C}}^{4}$ that is used to encode x _i into the quantum states is formulated as

$\begin{equation}g\left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)=\left({R}_{Y}\left(\phi \left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right)\otimes {R}_{Y}\left(\phi \left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right)\right){\left\vert 0\right\rangle }^{\otimes 2},\end{equation} \tag{ C.1 }$

where $\phi \left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)={\left({\omega }_{1}^{\left(i\right)}-{\omega }_{2}^{\left(i\right)}\right)}^{2}$ is a specified mapping function. The above formulation implies that g( x _i) can be converted to a sequence of quantum operations, where its implementation is illustrated in the upper left panel of figure B2. To simultaneously encode multiple training examples into the quantum states, we should implement g( x _i) as a controlled version, where the implementation is shown in the upper right panel of figure B2.

**Figure B2.** The implementation of GBLS used in numerical simulations. The upper left panel illustrates the circuit implementation of the encoding unitary U_data corresponding to the feature map g( x _i). The lower panel demonstrates the implementation of GBLS given the input ${\mathcal{D}}_{k}=\left\{{\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j},{\boldsymbol{x}}_{k},{\boldsymbol{x}}_{l}\right\}$ , where the implementation of the controlled-g( x _i) quantum operation is shown in the upper right panel.
Download figure:
Standard image High-resolution image

**Figure B2.** The implementation of GBLS used in numerical simulations. The upper left panel illustrates the circuit implementation of the encoding unitary U_data corresponding to the feature map g( x _i). The lower panel demonstrates the implementation of GBLS given the input ${\mathcal{D}}_{k}=\left\{{\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j},{\boldsymbol{x}}_{k},{\boldsymbol{x}}_{l}\right\}$ , where the implementation of the controlled-g( x _i) quantum operation is shown in the upper right panel.
Download figure:
Standard image High-resolution image

The random unitary V ∈ SU(4) used in the numerical simulations is formulated as V = R_Y(ψ₁) ⊗ R_Y(ψ₂), where ψ₁ and ψ₁ are uniformly sampled from [0, 2π).

The details of GBLS, the referenced classifiers, and hyper-parameters setting. The implementation of GBLS is shown the lower panel of figure B2. In particular, the data encoding unitary U_data is composed of a set of controlled-g( x _i) quantum operations. The MPQC introduced in appendix B is employed to build ${U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)$ , where each layer U( θ ^l) is composed of R_Y gates and CZ gates and the layer number is L = 2.

The basic components of the referenced quantum classifiers are identical to those used in GBLS. In particular, for all employed quantum kernel classifiers, the implementation of variational quantum circuits ${U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)$ are the same with GBLS, where the layer number is L = 2 and each layer is composed of R_Y gates and CZ gates as shown in figure B2. The implementation of the encoding unitary U_data depends on the batch size B. For the quantum kernel classifiers with the BCE loss and MSE loss (B = N), following equation (C.1), the encoding unitary is

$\begin{equation}{U}_{\text{data}}={R}_{Y}\left(\phi \left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right)\otimes {R}_{Y}\left(\phi \left({\omega }_{1}^{\left(i\right)},{\omega }_{2}^{\left(i\right)}\right)\right).\end{equation} \tag{ C.2 }$

For the quantum kernel classifier with the MSE loss (B = N/4), the implementation of the encoding unitary U_data is the same with GBLS as shown in figure B2.

The detailed hyper-parameters settings for GBLS and the referenced classifiers are as follows. The learning rate for GBLS, the quantum kernel classifier with the BCE loss, the quantum kernel classifier with the MSE loss (B = N and B = N/4) is identical, which is set as η = 1.0. Moreover, when we explore the statistical performance of different quantum classifiers under the noise setting, the random seeds are set as ${\left\{i\right\}}_{i=1}^{R}$ with R being the total number of repetitions.

The analysis of the quantum circuit depth. Here we analyze the required circuit depth to implement quantum kernel classifiers used in numerical simulations. As explained in the above subsection, the quantum kernel classifiers with B = N can be efficiently realized, since the data encoding unitary U_data and the variational quantum circuits only involve single and two qubits gates. In particular, the circuit depth to construct the unitary U_data in equation (C.2) is 1. Moreover, the circuit depth to construct U_L( θ ) as shown in figure B2 is 4. In total, when the number of batches B equals to N, the required depth for the quantum kernel classifier with the BCE or MSE loss is 5.

Compared with the setting B = N, the implementation of the quantum kernel classifier with B = N/4 and GBLS requires a relatively deep circuits. The substantial reason is that the fabrication of the data encoding unitary U_data involves multi-controlled qubits gates as shown in figure B2 (highlighted by the brown region). Specifically, when we decompose the CC–R_Y gate into single-qubit and two-qubit gates, the required circuit depth is 27. Therefore, following figure B2, the circuit depth to implement U_data is 113. Considering that the circuit depth to implement ${U}_{{L}_{1}}$ is 4, the total circuit depth to implement the quantum kernel classifier with B = N/4 is 117. As shown in figure B2, the quantum circuit in GBLS is composed of U_data, ${U}_{{L}_{1}}$ , and U_init. The implementation of U_data and ${U}_{{L}_{1}}$ is identical to the quantum kernel classifier with B = N/4. Moreover, based on Grover-search algorithm, the circuit depth to implement U_init is 15, which includes 4 Hadamard gates and 1 CCZ gate. Therefore, the total circuit depth to implement GBLS is 132.

We remark that the circuit depth of the quantum kernel classifier with B = N/4 and GBLS is dominated by the implementation of U_data, which exploits multi-controlled qubits gates to load different training examples in superposition. Such an observation implies that efficient encoding methods can dramatically reduce the required circuit depth to construct these quantum classifiers. A possible solution is proposed by [37], which constructs a target multi-qubits gate by optimizing a variational quantum circuit which consists of tunable single-qubit gates and fixed two qubits gates.

The modified training dataset for the MSE_batch method. We note that naively employing the original training dataset $\hat{\mathcal{D}}$ to optimize the quantum kernel classifier with the MSE_batch loss is infeasible. Let us illustrate a simple example. Suppose the input state is $\frac{1}{\sqrt{2}}{\sum }_{i=1}^{2}{\left\vert g\left({\boldsymbol{x}}^{\left(i\right)}\right)\right\rangle }_{F}{\left\vert i\right\rangle }_{I}$ with the batch size 2, where the subscript 'I' ('F') refers to the index (feature) register. When the trainable quantum circuits ${U}_{L}\left(\boldsymbol{\theta }\right)\otimes {\mathbb{I}}_{I}$ and the measurement operator are applied to this state, the output corresponds to the averaged predictions of the examples ${\left\{{\boldsymbol{x}}^{\left(i\right)\left. \right)}\right\}}_{i=1}^{2}$ . Such a setting is ill-posed once the labels x ⁽¹⁾ and x ⁽ⁱ⁾ of are opposite, e.g. the former is 0 and the latter is 1, since a wrong prediction (the former is 1 and the latter is 0) also leads to the averaged truth label 0.5.

To conquer the above issue, we build a modified dataset instead of $\hat{\mathcal{D}}$ to optimize the quantum kernel classifier with the MSE_batch loss. Specifically, we shuffle the given dataset $\hat{\mathcal{D}}$ and ensure that for the modified dataset, the training examples in each batch ${\mathcal{B}}_{i}$ for ∀i ∈ [B] must possess the same label. In doing so, the averaged truth label can either be 0 and 1 without any confusion.

Appendix D.: The computational complexity of GBLS and the quantum kernel classifier with the BCE loss

We now separately derive the required number of measurements, or equivalently, the computational complexity, for GBLS and the quantum kernel classifier with the BCE loss at each epoch. For both methods, the hyper-parameters setting is supposed to be identical, i.e. the size of the dataset $\hat{\mathcal{D}}$ is N, the layer number of MPQC ${U}_{{L}_{1}}$ is L, the number of qubits to load data features is N_F, the total number of trainable parameters θ is N_F L, and the number of measurements applied to estimate the quantum expectation value is M.

We say one query when the variational quantum circuit used in the quantum classifier takes the encoded data and then be measured by the measurement operator once. Following the training mechanism of the quantum classifier, its query complexity amounts to counting the total number of measurements to the variational quantum circuits to acquire the gradients in one epoch.

We now derive the required number of measurements of the quantum kernel classifier with the BCE loss in one epoch. Given the dataset $\hat{\mathcal{D}}$ , the BCE loss yields

$\begin{equation}{\mathcal{L}}_{\text{BCE}}=-\frac{1}{N}\sum\limits _{i=0}^{N-1}{y}_{i}\enspace \mathrm{log}\left(p\left({y}_{i}\right)\right)+\left(1-{y}_{i}\right)\mathrm{log}\left(1-p\left({y}_{i}\right)\right),\end{equation} \tag{ D.1 }$

where y_i is the label of the ith example and p(y_i) is the predicted probability of the label y_i, or equivalently, the output of the quantum circuit used in the quantum kernel classifier

$\begin{equation}p\left({y}_{i}\right)=\mathrm{Tr}\left({\Pi}\rho \left(\boldsymbol{\theta }\right)\right),\end{equation} \tag{ D.2 }$

where $\rho \left(\boldsymbol{\theta }\right)={U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)\left\vert g\left({\boldsymbol{x}}_{i}\right)\right\rangle \left\langle g\left({\boldsymbol{x}}_{i}\right)\right\vert {U}_{{L}_{1}}{\left(\boldsymbol{\theta }\right)}^{{\dagger}}$ , ${U}_{{L}_{1}}\left(\boldsymbol{\theta }\right)$ refers to variational quantum circuits defined in equation (B.1), $\left\vert g\left({\boldsymbol{x}}_{i}\right)\right\rangle$ represents the encoded quantum state defined in equation (C.1), and Π is the measurement operator. Following the parameter shift rule, the derivative of BCE loss satisfies

$\begin{equation}\frac{\partial {\mathcal{L}}_{\text{BCE}}}{\partial {\boldsymbol{\theta }}_{j}}=\frac{1}{N}\sum\limits _{i=0}^{N-1}\left(\frac{1-{y}_{i}}{1-p\left({y}_{i}\right)}-\frac{{y}_{i}}{p\left({y}_{i}\right)}\right)\frac{\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{+}\right)\right)-\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{-}\right)\right)}{2},\end{equation} \tag{ D.3 }$

where θ _± is defined in equation (B.4). The above equation implies that to acquire the gradients of the BCE loss, it necessitates to feed the training example one by one to the quantum kernel classifier to estimate p(y_i), and then conduct the classical post-processing to compute the coefficient $\frac{1-{y}_{i}}{1-p\left({y}_{i}\right)}-\frac{{y}_{i}}{p\left({y}_{i}\right)}$ . In other words, the number of batches for this quantum classifier can only be B = N. Since the estimation of p(y_i), Tr(Πρ( θ ₊)), and Tr(Πρ( θ ₋)) are completed by using M measurements, the derivative $\partial {\mathcal{L}}_{\text{BCE}}/\partial {\boldsymbol{\theta }}_{j}$ can be estimated by using 3NM measurements. Considering that there are in total N_F L trainable parameters, the total number of measurements at each epoch for the quantum kernel classifier with the BCE loss is 3NMN_F L.

Unlike the quantum kernel classifier with the BCE loss, GBLS uses a simple loss function $\mathcal{L}$ defined in equation (14), which allows us to efficiently acquire the gradient $\partial \mathcal{L}/\partial {\boldsymbol{\theta }}_{j}$ by leveraging the superposition property. Recall equation (B.6). The gradient of GBLS satisfies

$\begin{equation*}\frac{\partial \mathcal{L}\left(\boldsymbol{\theta },{\mathcal{D}}_{k}\right)}{\partial {\boldsymbol{\theta }}_{j}}=\frac{\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{+}^{\left(k\right)}\right)\right)-\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{-}^{\left(k\right)}\right)\right)}{2}\enspace \mathrm{sign}\left(\frac{1}{2}-{y}_{k}\right),\end{equation*}$

where y_k refers to the label of the last pair in the extended training example ${\mathcal{D}}_{k}$ . The above equation indicates that the gradient for ${\mathcal{D}}_{k}$ , which contains K training examples in $\hat{\mathcal{D}}$ , can be estimated by using 2M measurements, where the first (last) M measurements aim to approximate $\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{-}^{\left(k\right)}\right)\right)$ ( $\mathrm{Tr}\left({\Pi}\rho \left({\boldsymbol{\theta }}_{+}^{\left(k\right)}\right)\right)$ ). Therefore, the total number of measurements to collect $\left\{\frac{\partial \mathcal{L}\left(\boldsymbol{\theta },{\mathcal{D}}_{k}\right)}{\partial {\boldsymbol{\theta }}_{j}}\right\}$ for all possible ${\mathcal{D}}_{k}$ is 2MB = 2MN/K. Considering that there are in total N_F L trainable parameters, the query complexity at each epoch for GBLS is 2N_F LMN/K. Note that when K → N, the required number of measurements of GBLS can be dramatically reduced.

To ease of understanding, let us illustrate an intuitive example. Define two extended training examples, where the first one includes all positive examples in $\mathcal{D}$ and one negative example, and the second one includes all negative examples in $\mathcal{D}$ and one positive example. Since these two extended examples cover the whole dataset $\mathcal{D}$ , when GBLS uses these two examples to update θ , it completes one epoch. Celebrated by the simple form of $\mathcal{L}$ , the number of measurements to estimate the gradients for the jth entry θ _j given these two extended examples is O(1). Considering there are in total O(N_F L) trainable parameters, the total number of measurements at each epoch for GBLS is O(LN_F).

A Grover-search based quantum learning scheme for classification

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Grover-search based learning scheme

2.1. Implementation

2.2. Prediction

2.3. Potential advantage of GBLS

3. Numerical experiments

4. Discussion and conclusion

Acknowledgments

Data availability statement

Appendix A.: Proof of theorem 1

Appendix B.: Variational quantum circuits and the optimizing method

Appendix C.: More details of numerical simulations

Appendix D.: The computational complexity of GBLS and the quantum kernel classifier with the BCE loss

A Grover-search based quantum learning scheme for classification

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Grover-search based learning scheme

2.1. Implementation

2.2. Prediction

2.3. Potential advantage of GBLS

3. Numerical experiments

4. Discussion and conclusion

Acknowledgments

Data availability statement

Appendix A.: Proof of theorem 1

Appendix B.: Variational quantum circuits and the optimizing method

Appendix C.: More details of numerical simulations

Appendix D.: The computational complexity of GBLS and the quantum kernel classifier with the BCE loss