Compact quantum kernel-based binary classifier

Carsten Blank; Adenilton J da Silva; Lucas P de Albuquerque; Francesco Petruccione; Daniel K Park

doi:10.1088/2058-9565/ac7ba3

1. Introduction

As the quest for fault-tolerant quantum computers continues, noisy intermediate-scale quantum (NISQ) computers are expected to be available in the near future [1], supported by the recent technological advances in quantum computing [2–9]. Although the size of quantum circuits that NISQ devices can execute reliably is limited, the size of the quantum state space efficiently manipulated by them is much beyond what classical computers can handle. As such an interesting era is within reach, an important task in the quantum computing community is finding commercially relevant applications of the NISQ technology for which quantum advantage can be demonstrated in the near future.

Machine learning has been considered as a promising domain for which quantum computing can shine [10–15]. Quantum advantages in machine learning are expected, since quantum computers can in principle store and manipulate the amount of classical information that scales exponentially with the number of qubits [16–18]. Moreover, quantum computers can reduce the computational cost exponentially for solving certain basic linear algebra problems [19, 20] that often appear as basic subroutines in machine learning tasks, such as in support vector machine [21] and principal component analysis [22]. However, the size of quantum circuits required for implementing basic linear algebra subroutines on a quantum computer is too large for near-term quantum devices.

Several quantum machine learning algorithms have been proposed to perform the kernel-based classification by exploiting the ability of quantum computers to efficiently evaluate inner products in an exponentially-large Hilbert space [15], and without relying on expensive subroutines [21, 23–27]. In particular, references [25, 26] proposed the most simple quantum circuit for utilizing the quantum interference effect now known as the Hadamard-test classifier (HTC) and show that it can be used as a simple model of a kernel-based classifier for real-valued data. If an efficient state preparation routine is known, the algorithm achieves the logarithmic scaling in the dimension and number of the input data with a simple setup. Namely, given a quantum state that encodes the classical data in a specific form, the algorithm only uses a Hadamard gate and the expectation measurement of a two-qubit observable to complete the labeling task. Furthermore, the algorithm is agnostic to the quantum data encoding method, such as amplitude encoding [12] and quantum feature mapping [24].

In this work, we present a kernel-based quantum binary classifier that is even simpler than HTC by introducing compact amplitude encoding (CAE) of real-valued data, which reduces the number of training steps linearly and the number of qubits by two. In CAE, training data belonging to one class is encoded as the real part of the probability amplitudes, while the remaining training data is encoded as the imaginary part. In order to utilize CAE, we show that the single-qubit interfering circuit of the HTC can be generalized to take the imaginary part of the quantum state into account. In this way, the label information encoding is not explicitly executed on a quantum circuit and two sets of data are encoded in a single quantum register, thereby eliminating the state preparation subroutine for preparing the label registers and reducing the number of index qubits. Furthermore, our classifier provides a simple method for assigning arbitrary weights to two training data sets with different labels, which broadens the application of our method to imbalanced data sets in which the number of training data points in two classes are unequal. The CAE also localizes data so that the entanglement is reduced compared to the HTC. Although this finding is only on a numerical basis, we suspect that this is a general feature. This could lead to better performance of the classifier in the NISQ era since less entanglement implies reduced circuit complexity [28–31].

The remainder of the paper is organized as follows. Section 2 provides theoretical backgrounds and reviews for this study, such as the description of the classification problem and the existing kernel-based quantum classifier that has been known to be the simplest. Section 3 explains the main classification algorithm proposed in this work. In section 4, we carry out the entanglement analysis with numerical simulations of classification on Iris and Wine data sets as examples. The simulation results show that the classifier proposed in this work is more compact than the HTC with respect to the amount of entanglement the circuit produces. Section 5 concludes and discusses future research directions.

2. Quantum kernel-based classifier

2.1. Binary classification

Classification is a canonical pattern recognition problem that aims to label a data point $\tilde{x}\in {\mathbb{R}}^{N}$ as accurately as possible given a labeled (or training) data set

$\begin{equation*}\mathcal{D}=\left\{({x}_{0},{y}_{0}),\dots ,({x}_{M-1},{y}_{M-1})\right\}\subset {\mathbb{R}}^{N}\times \left\{0,1,\dots ,L-1\right\}.\end{equation*}$

Since the training data set includes labels, this task can be addressed via a supervised machine learning technique. Among many machine learning approaches, a kernel method provides a straightforward interpretation of the classification process. It is based on choosing a feature space for the data set such that the classification score is defined as a linear function of the similarity (i.e. kernel) between each training data and the test data. This principle naturally connects quantum computing to kernel-based classification since the quantum Hilbert space can be used as the data feature space with a proper definition of the similarity [15, 24].

In this work, we focus on real-valued data as is common in practical machine learning tasks. Moreover, we focus on binary classification (i.e. L = 2) since a multi-class classification can be constructed with binary classifiers by one versus all or one versus one scheme [32]. Note that in binary classification, the two class labels are often denoted as ±1.

2.2. Amplitude encoding

The first step towards utilizing the quantum Hilbert space as the feature space is encoding classical data as a quantum state. Although the best encoding strategy remains an open problem, many previous quantum machine learning algorithms chose to represent a classical vector ${\mathbf{x}}_{j}={({x}_{0j},\dots ,{x}_{(N-1)j})}^{\mathrm{\top }}\in {\mathbb{R}}^{N}$ as probability amplitudes of a quantum state in the following form [12, 21, 22, 25, 33],

$\begin{equation}\vert {\mathbf{x}}_{j}\rangle {:=}\sum\limits _{i=0}^{N-1}{x}_{ij}\vert i\rangle ,\end{equation} \tag{ 1 }$

using ⌈log₂(N)⌉ qubits, where the input vector is normalized and have unit length, i.e. ||x_j|| = 1. The above form of data encoding is often called amplitude encoding, and it can be generalized for a set of M data points x₁, ..., x_M as

$\begin{equation}\frac{1}{\sqrt{M}}\,\sum\limits _{j=0}^{M-1}\sum\limits _{i=0}^{N-1}{x}_{ij}\vert i\rangle \otimes \vert j\rangle ,\end{equation} \tag{ 2 }$

which uses at least ⌈log₂(NM)⌉ qubits.

Hereinafter, we will omit the Kronecker product symbol (⊗) and write |i⟩ ⊗ |j⟩ = |ij⟩ whenever the meaning is clear.

2.3. Hadamard-test classifier

A simple model of quantum classifier for real-valued data was introduced in reference [25], and is nowadays referred to as the HTC. This work presents improvements of the HTC in several aspects, and hence this section briefly reviews the algorithm's structure.

For the HTC, the data set $\mathcal{D}$ is encoded in a quantum state as

$\begin{equation}\vert \psi \rangle =\frac{1}{\sqrt{2}}\,\sum\limits _{j=0}^{M-1}\sqrt{{a}_{j}}\left(\vert 0\rangle \vert {\mathbf{x}}_{j}\rangle +\vert 1\rangle \vert \tilde{\mathbf{x}}\rangle \right)\vert {y}_{j}\rangle \vert j\rangle ,\end{equation} \tag{ 3 }$

where |x_j⟩ and $\vert \tilde{\mathbf{x}}\rangle$ are quantum representatives of the classical training and test data vectors by utilizing an encoding of choice (a feature map). The label y_j ∈ {−1, +1} is transformed to the computational basis of the label qubit with the rule y_j → |(1 − y_j)/2⟩ ∈ {|0⟩, |1⟩}. In the original paper [25], the weights were chosen uniform, i.e. a_j = 1/M ∀ j, while it has been shown [27] that it can be used as a variable to be optimized, similar to the treatment in support vector machines. Principally, the HTC applies a Hadamard-test as measurement scheme: this applies a Hadamard gate to the ancillary qubit in order to interfere training and test data states. This is finalized by measuring an expectation value of a two-qubit observable ${\sigma }_{z}^{(\mathrm{a})}{\sigma }_{z}^{(\mathrm{l})}$ on the ancillary and label qubit, leading up to

$\begin{equation}\langle \psi \vert {H}^{(\mathrm{a})}{\sigma }_{z}^{(\mathrm{a})}{\sigma }_{z}^{(\mathrm{l})}{H}^{(\mathrm{a})}\vert \psi \rangle =\sum\limits _{j=0}^{M-1}{a}_{j}{y}_{j}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle .\end{equation} \tag{ 4 }$

The superscripts a and l are the qubits on which the corresponding operator is applied, namely the ancillary and the label qubit, respectively. The connection between the HTC and the field of kernel methods is based on this equation. One can see that the kernel function is $k({\mathbf{x}}_{j},\tilde{\mathbf{x}})=\langle {\mathbf{x}}_{j}\vert \tilde{\mathbf{x}}\rangle$ . Consequently, the right-hand side of equation (4) defines the classification score, which is denoted by f. Therefore, HTC assigns a new label to the test data by the rule

$\begin{equation}\tilde{y}=\text{sgn}\left[\sum\limits _{j=0}^{M-1}{a}_{j}{y}_{j}\langle {\mathbf{x}}_{j}\vert {\mathbf{x}}_{j}\rangle \tilde{\mathbf{x}}\right].\end{equation} \tag{ 5 }$

3. Compact classifier

3.1. Generalization

The generalization of the interference circuit of the HTC was discussed in reference [34] as a means to show that the Hadamard gate is the optimal choice for minimizing the number of sampling. In this work, we take a step further and utilize the generalized interference circuit to reduce the quantum circuit cost of the HTC.

The generalized interference circuit for the HTC uses R_z(ϕ)R_y(θ₀) in place of the first Hadamard gate, and uses R_y(θ₁) in place of the last Hadamard gate, where R_z(ϕ) = cos(ϕ/2)I − i sin(ϕ/2)σ_z and R_y(θ) = cos(θ/2)I − i sin(θ/2)σ_y are the single-qubit rotation gates. As shown in reference [34], the two-qubit expectation value measured in the generalized interference circuit is

$\begin{align}\langle {\sigma }_{z}^{(\mathrm{a})}{\sigma }_{z}^{(\mathrm{l})}\rangle =\;& \sum\limits _{j=0}^{M-1}{a}_{j}{y}_{j}(\mathrm{c}\mathrm{o}\mathrm{s}({\theta }_{0})\mathrm{c}\mathrm{o}\mathrm{s}({\theta }_{1})-\mathrm{s}\mathrm{i}\mathrm{n}({\theta }_{0})\mathrm{s}\mathrm{i}\mathrm{n}({\theta }_{1})\\ & \times (\mathrm{c}\mathrm{o}\mathrm{s}(\phi )\text{Re}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle -\mathrm{s}\mathrm{i}\mathrm{n}(\phi )\text{Im}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle )).\end{align} \tag{ 6 }$

While the original HTC only uses the real part of the inner product (i.e., $\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle$ ) in the classification algorithm, the generalized interference circuit opens up the possibility of harnessing the imaginary part as well. Since the goal of this paper is to utilize the imaginary part, we set θ₀ = π/2 and θ₁ = −π/2 for simplicity. In this case, equation (6) becomes

$\begin{equation}\langle {\sigma }_{z}^{(\mathrm{a})}{\sigma }_{z}^{(\mathrm{l})}\rangle =\sum\limits _{j=0}^{M-1}{a}_{j}{y}_{j}(\mathrm{c}\mathrm{o}\mathrm{s}(\phi )\text{Re}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle -\mathrm{s}\mathrm{i}\mathrm{n}(\phi )\text{Im}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle ).\end{equation} \tag{ 7 }$

The above result can also be obtained by applying a single-qubit rotation gate R_z(ϕ) to the ancilla qubit of the HTC in equation (3).

3.2. Compact amplitude encoding

The main idea in this work is based on the observation that if cos(ϕ) and sin(ϕ) in equation (7) have the same sign, then the real and imaginary parts of the state overlap contributes with opposite sign. Thus by encoding the training data with label +1 (−1) to real (imaginary) part of the probability amplitudes of the quantum state, the binary classification can be done without explicitly preparing the label register. This leads to reducing the number of qubits by one. Note that when the number of training data vectors in two classes is the same, $\mathrm{cos}(\phi )=\mathrm{sin}(\phi )=1/\sqrt{2}$ . In section 3.3, we discuss how to control ϕ for imbalanced data set.

With this background, the CAE is introduced to utilize the imaginary part as follows. In CAE, two N-dimensional real vectors ${\mathbf{x}}_{k}^{+}={({x}_{0k}^{+},\dots ,{x}_{(N-1)k}^{+})}^{\mathrm{\top }}$ and ${\mathbf{x}}_{k}^{-}={({x}_{0k}^{-},\dots ,{x}_{(N-1)k}^{-})}^{\mathrm{\top }}$ with the corresponding labels indicated by the superscript (±) are loaded in a quantum state in the following form,

$\begin{equation}{\vert {\mathbf{x}}_{k}\rangle }_{\mathrm{c}}{:=}\sum\limits _{j=0}^{N-1}({x}_{jk}^{+}+\mathrm{i}{x}_{jk}^{-})\vert j\rangle ,\end{equation} \tag{ 8 }$

where ${\Vert}{\mathbf{x}}_{k}^{+}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{k}^{-}{{\Vert}}^{2}=1$ to satisfy the normalization condition. Note that various scaling methods can be employed to satisfy this condition, and one of the natural ways is to scale each vector so that ${\Vert}{\mathbf{x}}_{k}^{\pm }{\Vert}=1/\sqrt{2}$ . We implicitly assume this way of normalizing vectors unless stated otherwise. The above equation shows that two N-dimensional vectors are encoded in ⌈log₂(N)⌉ qubits. The subscript c in equation (8) distinguishes the quantum state from amplitude encoding, and indicates that the classical vector is encoded via CAE. It is important to note that |⋅⟩_c means two data points are encoded in the state vector, whereas the ordinary ket vector without the subscript c means one data point is encoded. In addition, we define two more states as

$\begin{equation}\vert {\mathbf{x}}_{k}^{\pm }\rangle {:=}\frac{1}{{\Vert}{\mathbf{x}}_{k}^{\pm }{\Vert}}\,\sum\limits _{j=0}^{N-1}{x}_{jk}^{\pm }\vert j\rangle .\end{equation} \tag{ 9 }$

Then the state overlap between two quantum registers, one encodes an N-dimensional real vector $\tilde{\mathbf{x}}$ via amplitude encoding (equation (1)) and the other encodes two N-dimensional real vectors ${\mathbf{x}}_{k}^{+}$ and ${\mathbf{x}}_{k}^{-}$ via CAE (equation (8)) is

$\begin{equation}{\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{k}\rangle }_{\mathrm{c}}=\frac{1}{\sqrt{2}}(\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{k}^{+}\rangle +\mathrm{i}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{k}^{-}\rangle ).\end{equation} \tag{ 10 }$

3.3. Compact quantum binary classifier

The compact quantum machine learning algorithm that implements the binary classifier from equation (5) is constructed as follows. We first prepare an initial state that encodes the data set $\mathcal{D}$ as

$\begin{equation}\vert {\psi }_{i}\rangle =\frac{1}{\sqrt{2}}\,\sum\limits _{j=0}^{\frac{M}{2}-1}\sqrt{{b}_{j}}\left(\vert 0\rangle {\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}+{\text{e}}^{-\text{i}\phi }\vert 1\rangle \vert \tilde{\mathbf{x}}\rangle \right)\vert j\rangle ,\end{equation} \tag{ 11 }$

where the subscript c indicates that the state is prepared via CAE. It is important to note that ${\sum }_{j=1}^{M/2}{\,b}_{j}=1$ , and hence the set of weights is different from that of the HTC which satisfies ${\sum }_{j=0}^{M-1}{\,a}_{j}=1$ . For example, for a set of uniform weights, b_j = 2/M and a_j = 1/M. For simplicity, we assume that the number of data with label +1 denoted by M₊ is equal to the number of data with label −1 denoted by M₋. The state above is easier to prepare than the state required in HTC shown in equation (3) since the label information is not explicitly encoded and the relative phase e^−iϕ can be added by applying a single-qubit rotation gate R_z(ϕ) on the ancilla qubit. Moreover, since two training data are encoded in one quantum register, the number of terms in the summation is decreased by a factor of 2, meaning that the dimension of the index register is also decreased by a factor of 2. After the state preparation, the rest of the algorithm only requires a Hadamard gate and the measurement of the ancilla qubit in the σ_z basis. The Hadamard gate interferes the copies of the new input and the training inputs to produce a state

$\begin{equation}\vert {\psi }_{\mathrm{f}}\rangle =\frac{1}{2}\sum\limits _{j=0}^{\frac{M}{2}-1}\sqrt{{b}_{j}}[\vert 0\rangle \hspace{1pt}({\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}+{\text{e}}^{-\text{i}\phi }\vert \tilde{\mathbf{x}}\rangle )+\vert 1\rangle ({\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}-{\text{e}}^{-\text{i}\phi }\vert \tilde{\mathbf{x}}\rangle )]\vert j\rangle .\end{equation} \tag{ 12 }$

The probability to measure the ancilla qubit in |0⟩ is given as

$\begin{equation*}\mathrm{P}\mathrm{r}(0)=\frac{1}{4}\sum\limits _{j=0}^{\frac{M}{2}-1}{b}_{j}(2+{\text{e}}^{-\text{i}\phi }{}_{c}\langle {\mathbf{x}}_{j}\vert \tilde{\mathbf{x}}\rangle +{\text{e}}^{\text{i}\phi }{\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}).\end{equation*}$

By denoting ${\kappa }_{j}={\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}$ , the above equation can be written as

$\begin{equation}\mathrm{Pr}(0)=\frac{1}{2}\sum\limits _{j=0}^{\frac{M}{2}-1}{b}_{j}\left(1+\mathrm{cos}(\phi )\text{Re}({\kappa }_{j})-\mathrm{sin}(\phi )\text{Im}({\kappa }_{j})\right).\end{equation} \tag{ 13 }$

Similarly, the probability to measure the ancilla qubit in |1⟩ can be calculated as

$\begin{equation}\mathrm{Pr}(1)=\frac{1}{2}\sum\limits _{j=0}^{\frac{M}{2}-1}{b}_{j}\left(1-\mathrm{cos}(\phi )\text{Re}({\kappa }_{j})+\mathrm{sin}(\phi )\text{Im}({\kappa }_{j})\right).\end{equation} \tag{ 14 }$

Therefore, the expectation value of the σ_z operator measured on the ancilla qubit is

$\begin{equation}\langle {\sigma }_{z}^{(\mathrm{a})}\rangle =\sum\limits _{j=0}^{\frac{M}{2}-1}{b}_{j}\left(\mathrm{cos}(\phi )\text{Re}({\kappa }_{j})-\mathrm{sin}(\phi )\text{Im}({\kappa }_{j})\right).\end{equation} \tag{ 15 }$

By setting $\mathrm{cos}(\phi )=\mathrm{sin}(\phi )=1/\sqrt{2}$ and using equation (10), we arrive at

$\begin{equation}\langle {\sigma }_{z}^{(\mathrm{a})}\rangle =\frac{1}{2}\sum\limits _{j=0}^{\frac{M}{2}-1}{b}_{j}\left(\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}^{+}\rangle -\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}^{-}\rangle \right).\end{equation} \tag{ 16 }$

Since the state overlap for the training data in class +1 (−1) always have the positive (negative) sign, the above equation can be written in its final form as

$\begin{equation}\langle {\sigma }_{z}^{(\mathrm{a})}\rangle =\frac{1}{2}\sum\limits _{j=0}^{M-1}{b}_{j}^{\prime }{y}_{j}\langle \tilde{\mathbf{x}}\vert {\mathbf{x}}_{j}\rangle ,\end{equation} \tag{ 17 }$

where ${b}_{m}^{\prime }={b}_{m+M/2}^{\prime }={b}_{m}$ for m = 0, ..., M/2 − 1 and ${\sum }_{j=0}^{M-1}{b}_{j}^{\prime }=2$ . This outcome is similar to the classification score in the Hadamard classifier obtained by the two-qubit measurement. The constant factor of 1/2 is attributed to the different normalization conditions between the set of weights b_j and a_j as described below equation (11). We refer to this classifier as compact Hadamard classifier (CHC). The CHC is obtained with a single-qubit measurement implying that one can bypass the standardization of the training data set required in HTC for increasing the post-selection probability. The comparison of quantum circuits for implementing HTC and CHC is depicted in figure 1.

**Figure 1.** The comparison of quantum circuits for the HTC (left) and CHC (right). The half-filled circles indicate that the unitary operation is the uniformly controlled gate [35, 36]. For the HTC, the filled rectangle connected with a vertical line to X applied to the label qubit indicates that the number of control qubits depends on the imbalance in the label proportion.
Download figure:
Standard image High-resolution image

The reduction of quantum circuit sizes, which is of critical importance in practice, achieved with CHC is as follows. By introducing a controllable relative phase between two computational basis states of the ancilla qubit and using CAE, the number of qubits is reduced by two, one for the label register and another in the index register. Then the quantum operations for encoding the training data set is reduced from M gates controlled by log₂(M) index qubits to M/2 gates controlled by log₂(M) − 1 index qubits. Having one fewer controlled qubit can further reduce the quantum circuit depth by a factor of two [37]. Therefore, the number of operations for encoding the training data set x_j is reduced by a factor of four. The number of operations in CHC is also reduced due to the removal of operations for explicitly encoding the label information in a separate register. The number of gates needed for preparing the label register in HTC is 1 when the number of data belonging to each class is equal since one controlled-NOT gate controlled by one of the index qubits applied to the label qubit can split the Hilbert space into two subspaces with an equal number of 0's and 1's in the label qubit. But the number of operation increases linearly with the difference in the number of data with different labels. For example, if αM and βM data belong to class +1 and −1, respectively, where α + β = 1, then the number of operations necessary to encode label information grows as |α − β|M. Therefore, in total, our compact classifier can reduce the number of operations at least by a factor of four, while the reduction can be larger depending on the label distribution of the given training data set. Furthermore, the two-qubit measurement scheme used in HTC is reduced to single-qubit measurement.

As elucidated above, the number of training data vectors in two classes may be different. In this case, the real or imaginary part of the quantum state is zero for the missing data. The difference in the number of training data vectors in two classes can later be compensated by controlling the weights between the real and imaginary parts of the state overlap by finding ϕ that satisfies

$\begin{equation*}\frac{\mathrm{sin}(\phi )}{\mathrm{cos}(\phi )}=\frac{{M}_{-}}{{M}_{+}},\end{equation*}$

where M_± is the number of training data points in ±1 class.

Table 1 presents the classification accuracy of proof-of-concept experiments. We divided Iris and Wine datasets into training and test sets with a ratio of 80/20 and two classes and used the second and third features of the Iris dataset and the two principal components of the Wine dataset. We performed a search among 40 combinations of four training patterns (two from each class) and selected the combination with the best training accuracy. We used these four selected training points to evaluate the test accuracy. The use of two features and four training points allowed us to perform experiments in a small-scale quantum device. The accuracies and standard deviations in table 1 are obtained from 30 repetitions of the random division into training and test sets. The experimental results are from ibmq_manila, a quantum computer with five superconducting qubits available on the IBM quantum cloud service.

Table 1. Classification accuracy for the compact classifier on Iris dataset for 30 random separations obtained by simulation (sim) and experiment (exp). We used ibmq_manila, a quantum computer made available by IBM via cloud access, to obtain the experimental (exp) results. The standard deviation appears between parenthesis.

Dataset	Accuracy (sim)	Accuracy (exp)
Iris class 1 & 2	0.998 (0.009)	0.952 (0.064)
Iris class 1 & 3	0.998 (0.009)	0.965 (0.550)
Iris class 2 & 3	0.922 (0.059)	0.853 (0.092)
Wine class 1 & 2	0.930 (0.046)	0.894 (0.074)
Wine class 1 & 3	0.882 (0.072)	0.829 (0.101)
Wine class 2 & 3	0.694 (0.068)	0.657 (0.081)

3.4. Smallest quantum binary classifier

Let us denote $\vert {\Psi}({\mathbf{x}}_{j},\tilde{\mathbf{x}})\rangle =(\vert 0\rangle {\vert {\mathbf{x}}_{j}\rangle }_{\mathrm{c}}+{\text{e}}^{-\text{i}\phi }\vert 1\rangle \vert \tilde{\mathbf{x}}\rangle )/\sqrt{2}$ . By allowing classical sampling from an ensemble $\left\{{a}_{j},\vert {\Psi}({\mathbf{x}}_{j},\tilde{\mathbf{x}})\rangle \right\}$ , where a_j is the probability to choose jth state, we can have the mixed state

$\begin{equation}\sum\limits _{j=0}^{\frac{M}{2}-1}{a}_{j}\vert {\Psi}({\mathbf{x}}_{j},\tilde{\mathbf{x}})\rangle \langle {\Psi}({\mathbf{x}}_{j},\tilde{\mathbf{x}})\vert .\end{equation} \tag{ 18 }$

It is easy to verify that the expectation measurement of σ_x operator (equivalent to application of a Hadamard gate followed by σ_z measurement) yields the same outcome as shown in equation (15). In this approach, the index register is unnecessary, and hence we further reduce the number of qubits by ⌈log₂(M)⌉ and the number of gates by O(poly(M)). This the smallest kernel-based binary classifier; one only requires ⌈log₂(N)⌉ qubits for encoding the data set, and a qubit for measurement.

3.5. Connection to quantum feature mapping

Under certain restrictions, the compact encoding scheme can be applied to the quantum feature mapping framework introduced in reference [23] to store two training data in one quantum register. In principle, this can be done with the state preparation

$\begin{equation}\vert {\Phi}{({\mathbf{x}}_{k}^{\pm })\rangle }_{\mathrm{c}}=\frac{\vert {\Phi}({\mathbf{x}}_{k}^{+})\rangle +\mathrm{i}\vert {\Phi}({\mathbf{x}}_{k}^{-})\rangle }{\sqrt{2}},\end{equation} \tag{ 19 }$

with a feature map U_Φ(x)|0⟩^⊗L = |Φ(x)⟩ that maps N-dimensional data to L = O(N) qubits. Of course, the above state must satisfy $\langle {\Phi}({\mathbf{x}}_{k}^{+})\vert {\Phi}({\mathbf{x}}_{k}^{-})\rangle =0$ . The feature map also needs to satisfy $\langle {\Phi}(\tilde{\mathbf{x}})\vert {\Phi}({\mathbf{x}}_{k}^{\pm })\rangle \in \mathbb{R}$ for all k. One way to prepare the above state is to apply unitary transformation

$\begin{equation*}V=\frac{{U}_{{\Phi}}({\mathbf{x}}_{k}^{+})+\mathrm{i}{U}_{{\Phi}}({\mathbf{x}}_{k}^{-})}{\sqrt{2}}\end{equation*}$

to |0⟩^⊗L. Since V is unitary, it must satisfy ${U}_{{\Phi}}({\mathbf{x}}_{k}^{+}){U}_{{\Phi}}{({\mathbf{x}}_{k}^{-})}^{{\dagger}}-{U}_{{\Phi}}({\mathbf{x}}_{k}^{-}){U}_{{\Phi}}{({\mathbf{x}}_{k}^{+})}^{{\dagger}}=0$ . Finding an appropriate feature map that satisfies the above conditions while retaining the quantum advantage is a difficult task and we leave it as an interesting open problem.

Given that the state in equation (19) can be prepared, one can follow the strategy from the previous section. More explicitly, classical sampling from a set

$\begin{equation*}\left\{\frac{\vert 0\rangle \vert {\Phi}{({\mathbf{x}}_{j}^{\pm })\rangle }_{\mathrm{c}}+{\text{e}}^{-\text{i}\phi }\vert 1\rangle \vert {\Phi}(\tilde{\mathbf{x}})\rangle }{\sqrt{2}}\right\}\end{equation*}$

with probability a_j followed by the expectation measurement of σ_x operator yields

$\begin{equation*}\frac{1}{2}\sum\limits _{j=0}^{\frac{M}{2}-1}{a}_{j}(\mathrm{c}\mathrm{o}\mathrm{s}(\phi )\langle {\Phi}(\tilde{\mathbf{x}})\vert {\Phi}({\mathbf{x}}_{j}^{+})\rangle -\mathrm{s}\mathrm{i}\mathrm{n}(\phi )\langle {\Phi}(\tilde{\mathbf{x}})\vert {\Phi}({\mathbf{x}}_{j}^{-})\rangle ),\end{equation*}$

which is reduced to the classifier of equation (17) when ϕ = π/4.

4. Entanglement analysis

Understanding the fundamental source of the quantum advantage is of critical importance for establishing the ground for further developments of new ideas. The quantum resource we considered in this section is the entanglement of the system. Besides the fundamental perspective, entanglement is also deeply connected to the quantum circuit complexity. More specifically, it has been reported that lower amount of entanglement required in a quantum algorithm implies reduction in the number of entangling gates during state preparation [28–31].

The measure of entanglement we use in this work is the Meyer–Wallach [38] measure, which calculates a value linearly related to the mean single-qubit purity of the state [39],

$\begin{equation}{Q}^{(\mathrm{m})}(\vert \psi \rangle )=2\left(1-\frac{1}{n}\sum\limits _{k=1}^{n}\mathrm{Tr}[{\rho }_{k}^{2}]\right)\end{equation} \tag{ 20 }$

where ρ_k is the single-qubit density matrix obtained by partitioning |ψ⟩ into one qubit and n − 1 qubits. The Meyer–Wallach measure is also occupied in quantifying the entangling capability of a variational quantum circuit due to its scalability and ease of computation [40]. Additionally, we also apply the geometric measure of entanglement [41–43] with

$\begin{equation}{Q}^{(\mathrm{g})}(\vert \psi \rangle )=\underset{\vert \phi \rangle }{\mathrm{min}}{\Vert}\vert \psi \rangle -\vert \phi \rangle {\Vert}\end{equation} \tag{ 21 }$

where the minimization is over all states |ϕ⟩ that are product states, i.e., $\vert \phi \rangle ={\otimes }_{l=1}^{n}\vert {\phi }^{l}\rangle$ with each |ϕ^l⟩ being a local state. This can be efficiently approximated by Tucker decomposition [44].

To compare the Meyer–Wallach and geometric entanglement of the final states of the CHC and HTC, we use the Iris and Wine data sets [45]. We are interested in the value

$\begin{equation}{{\Delta}}^{(e)}={Q}_{\text{HTC}}^{(e)}(\vert {\psi }_{\mathrm{f}}\rangle )-{Q}_{\text{CHC}}^{(e)}(\vert {\psi }_{\mathrm{f}}\rangle ),\quad e=m,g.\end{equation} \tag{ 22 }$

The final state of each classifier is created and the measure for entanglement is calculated. The data sets used were primarily chosen for reproducibility and the different feature sizes: Iris has four features (N = 4) and Wine has 13 (N = 13), for which we pad with zeros excess features due to the qubit nature. Encoding the final state, we use a number of samples of each class M = 2^m (for m = 0, 1, 2, 3, 4, 5) from a test vs train split of 2/3 train and 1/3 test chosen at random. As both data sets are not binary, we limit data to the first two classes, after all, we are not interested in the actual classification, but in the entanglement structure. The numerical survey shows that the both Meyer–Wallach and geometrical entanglement in the CHC is always lower than that of the HTC for all sample size 2^m+1 and each data set. The results are shown in figure 2 and table 2. In particular, the minimum Δ of each set shows that the value is always positive, hence the CHC is more resource saving in terms of entanglement. Note that there are several, non-equivalent, measures of entanglement, and a complete survey over all possible measures is beyond the scope of this work. Nevertheless, the trend we observed from evaluating two entanglement monotones advocates the compactness of the CHC.

**Figure 2.** This figure shows the final state entanglement difference Δ of equation (22) between the HTC and the proposed compact classifier for the (a) Meyer–Wallach and (b) the geometric measure of entanglement. It shows that for two example data sets the difference is always positive, meaning that the HTC's final state always has higher entanglement. The feature dimension is 4 for the Iris data and 13 for the Wine data.
Download figure:
Standard image High-resolution image

Table 2. Data of the difference between the entanglement (Meyer–Wallach) of the HTC and the compact classifier proposed here from a statistical description of the distribution, i.e. ${{\Delta}}^{(m)}={Q}_{\text{HTC}}^{(m)}(\vert {\psi }_{\mathrm{f}}\rangle )-{Q}_{\text{CQBC}}^{(m)}(\vert {\psi }_{\mathrm{f}}\rangle )$ . Each value is for a data type and a number of training data that are indicated in the first row as (data type, number of data). It can be seen that the minimum is always greater than zero. The Iris data set has four features, while the Wine data has 13 features. The table exhibits a trend that the entanglement difference decreases as the number of data register qubit increases. The table shows the statistical description using the mean, standard deviation, the minimum and maximum and finally the 25, 50, 75 percentiles.

	(Iris, 1)	(Iris, 2)	(Iris, 4)	(Iris, 8)	(Iris, 16)	(Iris, 32)	(Wine, 1)	(Wine, 2)	(Wine, 4)	(Wine, 8)	(Wine, 16)	(Wine, 32)
Mean	0.251	0.230	0.186	0.171	0.157	0.144	0.064	0.063	0.062	0.062	0.059	0.057
Std	0.083	0.061	0.037	0.026	0.017	0.016	0.028	0.020	0.016	0.011	0.010	0.008
Min	0.072	0.105	0.111	0.110	0.114	0.105	0.020	0.031	0.038	0.040	0.045	0.044
25%	0.190	0.188	0.159	0.151	0.146	0.132	0.043	0.048	0.050	0.053	0.052	0.051
50%	0.256	0.230	0.180	0.172	0.160	0.147	0.060	0.060	0.060	0.060	0.057	0.056
75%	0.312	0.278	0.211	0.185	0.167	0.157	0.078	0.072	0.071	0.070	0.065	0.063
Max	0.472	0.420	0.269	0.237	0.191	0.168	0.132	0.129	0.125	0.088	0.089	0.074

The analysis with Iris and Wine data sets asserts that the classifier presented in this work is compact in the sense that it requires less entanglement for binary classification. The observed reduced entanglement has useful consequences. As mentioned in the beginning of this section, the lower entanglement allows for the reduction in the number of entangling gates by exploiting the consequential low Schmidt-rank of bi-partitions [30, 31]. Furthermore, if entanglement is lower, then an approximation with even less entangling operations can be found. We suspect that this can be a useful trait for machine learning protocols, because approximation errors in the state preparation will likely only affect near-decision boundary classifications. Even though the methods described in references [30, 31] are computationally expensive subroutines like singular value decomposition, the reduction in the state preparation complexity can be beneficial in the NISQ era, if the classification error due to the approximation is less than that due to the hardware imperfections and decoherence.

5. Conclusion

This work proposes a compact quantum binary classifier whose quantum circuit size is smaller than that of the HTC, which was previously considered to be the simplest kernel-based quantum classifier. Thus, our method is placed above existing kernel-based quantum classifiers as the potential application of NISQ computing. The compact quantum classifier is enabled by CAE we introduced in this work. This technique encodes one training data point from each class as the real and the imaginary part of the probability amplitude of a computational basis state. Since the label information of training data is implicitly encoded, there is no need to use a separate quantum register to explicitly encode the label information as required in previous methods. The removal of the label register naturally further reduces the two-qubit measurement scheme required in previous methods to the single-qubit measurement. Furthermore, the ease with which unbalanced data can be encoded and the applicability of feature maps is highlighted as good traits for applications. Using binary classification tasks with Iris and Wine data sets, we show that the quantum classifier proposed in this work is compact also in the sense that it requires less entanglement than the HTC.

State preparation is responsible for the main cost of CHC, HTC, and many machine learning algorithms. With a reduction in entanglement, one can attempt to create state preparation algorithms that are more efficient than the available alternatives. Reducing the circuit depth of CHC based on this intuition and the number of repetitions to generate a quantum advantage over classical algorithms for big data classification remains an important future work. Another interesting future research direction is to examine the possibility to create compact versions of other machine learning data initialization methods. For instance, some kernel methods as qubit encoding requires one qubit for each data feature. One could investigate how to reduce the required number of qubits without a reduction in the accuracy of a classifier. In addition, understanding the reason behind the reduced amount of entanglement created in the CHC algorithm compared to HTC remains an interesting open question. Answering this question could also help in designing a compact version of existing quantum machine learning algorithms. This could also lead to the discovery of quantum-inspired machine learning algorithms. Furthermore, analytical analysis of the entanglement difference (Δ) for the given size of the samples (M) and the features (N) remains to be done. Another interesting question is the condition on the unitary when using quantum feature maps to encode data, both in terms of expressibility and complexity. Since the HTC and CHC construct a binary classifier based on the quantum interference effect, one can speculate that quantum coherence plays a critical role. Quantum coherence can be regarded as a resource, and it has been rigorously studied within the framework of resource theory in references [46–48]. Connection between quantum coherence and non-classical correlation has also been studied in references [49, 50]. Performing quantum resource theoretic analysis on the HTC and CHC is also remains as a potential future work.

Acknowledgments

This researchis supported by the National Research Foundation of Korea (NRF-2019R1I1A1A01050161, NRF-2021M3H3A1038085, and NRF-2022M3E4A1074591), and the South African Research Chair Initiative of the Department of Science and Technology and the National Research Foundation.

Data availability statement

The source code and data used in this study are available from the corresponding author upon reasonable request.

Compact quantum kernel-based binary classifier

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction