Quantum machine learning of large datasets using randomized measurements

Quantum computers promise to enhance machine learning for practical applications. Quantum machine learning for real-world data has to handle extensive amounts of high-dimensional data. However, conventional methods for measuring quantum kernels are impractical for large datasets as they scale with the square of the dataset size. Here, we measure quantum kernels using randomized measurements. The quantum computation time scales linearly with dataset size and quadratic for classical post-processing. While our method scales in general exponentially in qubit number, we gain a substantial speed-up when running on intermediate-sized quantum computers. Further, we efficiently encode high-dimensional data into quantum computers with the number of features scaling linearly with the circuit depth. The encoding is characterized by the quantum Fisher information metric and is related to the radial basis function kernel. Our approach is robust to noise via a cost-free error mitigation scheme. We demonstrate the advantages of our methods for noisy quantum computers by classifying images with the IBM quantum computer. To achieve further speedups we distribute the quantum computational tasks between different quantum computers. Our method enables benchmarking of quantum machine learning algorithms with large datasets on currently available quantum computers.


Introduction
Quantum machine learning aims to use quantum computers to enhance the power of machine learning [1,2]. One possible route to quantum advantage in machine learning is the use of quantum embedding kernels [3][4][5][6], where quantum computers are used to encode data in ways that are difficult for classical machine learning methods [7][8][9]. Noisy intermediate scale quantum computers [10,11] may be capable of solving tasks difficult for classical computers [12,13] and have shown promise in running proof-of-principle quantum machine learning applications [14][15][16][17][18][19][20][21][22][23][24][25]. However, currently available quantum computers are at least 6 orders of magnitude orders slower than classical computers. Furthermore, running quantum computers is comparatively expensive, necessitating methods to reduce quantum resources above all else. Thus, it is important to develop better methods to run and benchmark noisy quantum computers. Here, several bottlenecks limit quantum hardware for machine learning in practice. First, the quantum cost of measuring quantum kernels with conventional methods scales quadratically with the size of the training dataset [5]. This quadratic scaling is a severe restriction, as commonly machine learning relies on large amounts of data. Second, the data has to be encoded into the quantum computer in an efficient manner and generate a useful quantum kernel. Various encodings have been proposed [26,27], however the number of features is often limited by the number of qubits [19,20] or the quantum kernel is characterized only in a heuristic manner. Finally, the inherent noise of quantum computers limits the quality of the experimental results. Error mitigation has been proposed to reduce the effect noise [28], however in general this requires a large amount of additional quantum computing resources [29].
Here, we use randomized measurements to calculate quantum kernels. The quantum computing time scales linearly and the classical post-processing time quadratically with the size of the dataset. While our method scales in general exponentially in the number of qubits, compared to other methods a substantially lower number of measurements is needed for intermediate-sized quantum computers of about ten qubits. Additionally, we can reuse the collected measurement data to effectively mitigate the noise of quantum computers. To efficiently load high-dimensional data into the quantum computer, we apply an encoding that scales linearly with the depth of parameterized quantum circuits (PQCs). The resulting quantum kernel is characterized with the quantum Fisher information metric (QFIM) and can be approximately described by the radial basis function (RBF) kernel. We introduce the natural PQC (NPQC) with an exactly known QFIM and demonstrate its usefulness for quantum machine learning. We implement our approach on the IBM quantum computer to classify handwritten images of digits with high accuracy. We experimentally demonstrate further speedups by parallelizing quantum computational tasks between different quantum computers. With our approach, currently available quantum computers can process larger datasets containing ten thousands of entries within a feasible time, extending the range of quantum machine learning algorithms that can be run in practice.

Support vector machine
Our goal is to classify unlabeled test data by learning from labeled training data as shown in figure 1(a). The dataset for the supervised learning task {{x i , y i }} L i=1 contains in total L items. The ith data item is described by a M-dimensional feature vector x i and corresponding label y i . Label y i belongs to C possible classes, while the feature vector consists of M real-valued entries. To learn and classify data, we use a kernel K(x i , x j ) that is a measure of distance between feature vectors x i and x j [2]. The kernel corresponds to an embedding of the M-dimensional data into a higher-dimensional space, where analysis of the data becomes easier [30]. In quantum kernel learning, we embed the data into the high-dimensional Hilbert space of the quantum computer and use it to calculate the kernel (see figure 1(b)). With the kernels, we train a support vector machine (SVM) to find hyperplanes that separate two classes of data (see figure 1(c)). The SVM is optimized using the kernels of the training dataset with a semidefinite program that can be efficiently solved with classical [31] or quantum computers [32,33].
subject to the conditions i α i y i = 0 and α i ⩾ 0. After finding the optimal weights α * , the SVM predicts the class of a feature vector η as y where b is calculated from the weights. One can extend this approach to distinguish C classes by solving C SVMs that separate each class from all other classes.
The power of the SVM highly depends on a good choice of kernel K(x i , x j ), such that it captures the essential features of the dataset. In the following, we propose a powerful class of quantum kernels that can be implemented with currently available quantum computers. Then, we show how to compute kernels for large datasets and mitigate the noise inherent in real quantum devices.

Encoding
A crucial question is how to efficiently encode a high-dimensional feature vector into a quantum computer while providing a useful kernel for machine learning. We encode the M-dimensional feature vector x i as M-dimensional parameter θ i of a PQC via where c is a scaling constant and θ r the reference parameter. As shown in figure 1(b), we use hardware efficient PQCs with N qubits and d layers of unitaries for the encoding [34]. The lth layer is composed of a product of parameterized single qubit rotations R l,k (θ (n l,k ) i ) acting on qubit k and non-parameterized entangling gates W l that generate the quantum state |ψ( Our choice of quantum kernel measures the distance between two encoding states as given by the fidelity between ρ(θ i ) and ρ(θ j ) [8,27] which for pure states ρ( We can formalize the expressive power of our encoding with the QFIM F(θ), which is a M × M dimensional positive-semidefinite matrix that provides information about the kernel in the proximity of Figure 1. (a) Supervised learning to classify images of handwritten digits. By learning from a training set of labeled images, our goal is to identify previously unseen test data correctly. The support vector machine (SVM) learns using a kernel (equation (3)) which is a measure of distance between the data. (b) We learn a dataset of L images with i = {1, . . . , L}, each with M pixels n = {1, . . . , M}, where we denote each pixel with x (n) i ∈ R. For the ith image, we encode the M-dimensional feature vector x i into a parameterized quantum circuit (PQC) with an M-dimensional parameter vector θ i . The PQC has N qubits and d layers of parameterized single qubit rotations and two-qubit entangling gates. We encode the n l,k ≡ n entry of the feature vector into a single qubit rotation acting on qubit k and layer l via θ (2)), where θr is a fixed reference parameter and c a scaling factor. The number of encoded features scales linearly with N and d. The kernel (equation (3)) is characterized by the quantum Fisher information metric (QFIM) F (θr) and can be approximately described by the radial basis function kernel (equation (5)). We calculate the quantum kernel by measuring the PQC in randomized local bases of Haar random unitaries VHaar. (c) The SVM trained with the quantum kernel draws the decision boundaries (here shown for a two-dimensional feature vector space and three possible digits) that classify each feature vector to its corresponding label.
θ [35]. For a pure state |ψ⟩ = |ψ(θ)⟩ it is given by F ij (θ) = 4[⟨∂ i ψ|∂ j ψ⟩ − ⟨∂ i ψ|ψ⟩⟨ψ|∂ j ψ⟩], where ∂ j |ψ⟩ is the gradient in respect to the j-th element of θ [36]. In the limit c → 0 of encoding equation (2), the kernel of a pure quantum state can be written as where λ k is the kth eigenvalue of the QFIM F(θ r ) and g k = |⟨x i , µ k ⟩| 2 is the inner product of the feature vector x i and the kth eigenvector µ k of F(θ r ). R = rank(F) (the number of non-zero eigenvalues) of F(θ r ) is an important measure of the properties of the PQC and the encoding [35]. The M − R eigenvectors µ k with λ k = 0 have no effect on the kernel with K(θ, θ + cµ k ) = 1. Thus, feature vectors x q ∈ span{µ 1 , . . . , µ M−R } that lie in the space of eigenvectors with eigenvalue zero cannot be distinguished using the kernel as they have the same value K(θ, θ + cx q ) = 1. Further, the size of the eigenvalues λ k determines how strongly the kernel changes in direction µ k of the feature space. By appropriately designing the QFIM as the weight matrix of the kernel, generalizing from data could be greatly enhanced [27,35,37]. For example, the feature subspace with eigenvalue 0 could be engineered such that it coincides with data that belongs to a particular class. Conversely, features that strongly differ between different classes could be tailored to have large eigenvalues such that they can be easily distinguished [37]. For a PQC with N qubits the rank is upper bounded by R ⩽ 2 N+1 − 2, which is the maximal number of features that can be reliably distinguished by the kernel [35]. It has been recently shown that the kernel of pure quantum states of hardware efficient PQCs can be approximated as Gaussian or RBF kernels [38], which are one of the most popular non-linear kernels with wide application in various machine learning methods [39]. Specifically, for small enough c with the encoding equation (2), we can approximately describe the quantum kernel as which is the RBF kernel with the QFIM as weight matrix F(θ r ) [38]. While for general PQCs the QFIM is a priori not known, a type of PQC called NPQC has the special property that the QFIM takes a simple form with F(θ r ) = I, where I is the identity matrix and θ r a particular reference parameter, which we will choose in the following for the NPQC (see [40] and appendix A). The NPQC forms an approximate isotropic RBF kernel that can serve as a well characterised basis for quantum machine learning. We also study another commonly used type of hardware efficient circuit (YZ-CX PQC) composed of single qubit rotations and CNOT gates arranged in a one-dimensional nearest-neighbor chain with a non-trivial QFIM F(θ r ) ̸ = I. For the YZ-CX PQC we choose a randomly drawn θ r , we find that the overall performance is nearly independent of the choice. Further details on the NPQC and YZ-CX PQC are shown in the appendix A. The scaling factor c controls the scale of the resulting values of the quantum kernel. Too small kernel values can impede learning as the model becomes too constrained. We can restrict the kernel from below K min < K(θ i , θ j ) for all i, j by choosing c as

Measurement
We calculate the L quantum kernels using randomized measurements [41][42][43] by measuring quantum states in r randomly chosen single qubit bases. We first choose r sets n = {1, . . . , r} of transformations k drawn according to the Haar measure SU(2) acting on each qubit k. Then, we prepare the quantum state ρ(θ i ) and rotate into a random basis Then, we measure s samples of the rotated state in the computational basis and estimate the probability P (n) i (v q ) of measuring the computational basis state v ∈ {0, 1} N for state ρ(θ i ) and transformation n. This procedure is repeated for the r transformations and L quantum states. The kernel where D(v, v ′ ) is the Hamming distance that counts the number of bits that differ between the computational states v and v ′ .
To measure all entries of the kernel, we perform N R = srL measurements in total. The error ∆K of estimating a single kernel entry scales as ∆K ∝ 1/(s √ r) [42]. Thus, for a fixed error it is beneficial to choose the number of bases r to a relatively small number compared to s. Note that for sufficient accuracy a minimal number of r is needed which increases with N. Overall, the number of measurements needed to estimate the kernel scales as N R ∝ 2 aN L, with a factor a ≲ 1 that depends on the type of state being measured [41,42] and can be improved by importance sampling [44]. While for large N, the exponential measurement cost is prohibitive, for intermediate qubit number on the order of ten qubits the measurement cost is moderate. With our method, the number of measurements needed to determine the full kernel matrix scales only linearly with the dataset size N R ∝ L, a quadratic speedup in contrast to other methods. Other commonly used measurement strategies such as the swap test [45,46] or the inversion test [18,20] have to explicitly prepare both states ρ(θ i ) and ρ(θ j ) on the quantum computer. Thus, they scale unfavorably with the square N R ∝ L 2 of the dataset size (see appendix B). While randomized measurements requires an overhead compared to standard methods, we find that for relatively small datasets, L > 21, randomized measurement requires less measurements for our experimental parameters (see appendix D). For L = 10 3 , we find that randomized measurement requires a factor 100 lower number of measurements compared to the parameters used in previous works. A further advantage is found in error mitigation. For standard measurement methods on noisy quantum computers, error mitigation adds a substantial cost to the measurement budget [29]. In contrast, randomized measurement can mitigate errors without further measurement cost as we show in the following.

Error mitigation
In general, quantum computers are affected by noise, which will turn the prepared pure quantum state into a mixed state and may negatively affect the capability to learn. For depolarizing noise, we can use the information gathered in the process to mitigate its effect and infer the noiseless value of the kernel.
For global depolarizing noise, with a probability p i the pure quantum state |ψ(θ i )⟩ is replaced with the completely mixed state ρ m = I/2 N , where I is the identity matrix. The resulting quantum state is the density The purity can be determined from the randomized measurements Tr(ρ( 2 N by reusing the same data used to compute the kernel entries. Using these purities, the depolarization probability p i can be calculated by solving a quadratic equation [23,47]. With p i and the measured kernel K b (θ i , θ j ) affected by depolarizing noise, the mitigated kernel is approximated by .

Results
We now proceed to numerically and experimentally demonstrate our methods. First, we investigate the kernel of our encoding. In figure 2(a) we numerically simulate [48,49] two types of hardware efficient PQCs (YZ-CX PQC and NPQC) and show that the quantum kernel is well described by a RBF kernel (equation (5), dashed line). The kernel diverges from the RBF kernel for exponentially small values of the kernel and reaches a plateau at K min = 1 2 N , which is the fidelity of Haar random states [50]. In figure 2(b), we experimentally measure the kernel of the NPQC with an IBM quantum computer (ibmq_guadalupe [51]) using randomized measurements and error mitigation (equation (8)). We find that the mean value of the kernel matches well with the isotropic RBF kernel. See appendix E for details on the experiment and appendix C for results regarding the YZ-CX PQC.
Next we address the statistical error introduced by estimating the kernel using randomized measurements and global depolarizing noise p. In figure 3(a) we simulate the average error of measuring the mitigated kernel K m (θ i , θ j ) using randomized measurements with respect to its exact value K(θ i , θ j ) as function of number of measurement samples s. We find that there is a threshold of samples where the error becomes minimal. This threshold depends on the choice of the number of measurement settings r and number of qubits N. We find that the choice r = 8 provides sufficient accuracy for our experiments. We are able to mitigate depolarizing noise to a noise-free level even for high p. In figure 3(b), we show the minimal number of samples s min required to measure the kernel with an average error of at most ∆K < 0.1 as function of depolarization noise p. The randomized measurement scheme works well even with substantial noise p, where we find a power law s min ∝ (1 − p) −2 . Now we assess the overall performance of our approach on a practical task. We learn to classify handwritten 2D images of digits ranging from 0 to 9. The dataset contains L = 1797 images of 8 × 8 pixels, where each pixel has an integer value between 0 and 16 [52]. We map the image to M = 64 dimensional feature vectors. For the YZ-CX PQC, we use all M = 64 features, whereas for the NPQC we perform a principal component analysis to reduce it to M = 36 features. We calculate the kernel of the full dataset and use a randomly drawn part of it as training data for optimizing the SVM with Scikit-learn [53]. The accuracy of the SVM is defined as the percentage of correctly classified test data, which are L test = 200 images that have not been used for training. The dataset is rescaled using the training data such that each feature has mean value zero and its variance is given by 1 √ M . We encode the feature vectors x i via equation (2) with c = 1, where for the YZ-CX PQC we choose θ r randomly and for the NPQC we define θ r such that the QFIM is given by F(θ r ) = I (see appendix A).
In figure 4(a), we classify the data by measuring the quantum kernel with a single quantum computer. We plot the accuracy of classifying test data with the SVM against the size of the training data for the YZ-CX PQC and the NPQC. As a classical baseline, we show the RBF kernel. Further, we show a simulations of the exact quantum kernel (exact) and a noiseless simulation of the randomized measurements (noiseless). For experimental data, we use an IBM quantum computer (ibmq_guadalupe [51], see appendix E for more details) to perform randomized measurements with error mitigation (mitigated) and without error mitigation (unmitigated). The accuracy improves steadily with increased number of training data for all kernels. Our error mitigation scheme (equation (8)) substantially improves the accuracy of the SVM trained with experimental data to nearly the level of the noiseless simulation of the randomized measurements. The randomized measurements have a lower accuracy compared to the exact quantum kernel as we use only a relatively small number r of randomized measurement settings. For the NPQC, the exact quantum kernel shows nearly the same accuracy as the classical RBF kernel, whereas for the YZ-CX PQC the quantum kernel performs slightly worse compared to the classical kernel, likely indicating that its QFIM does not optimally capture the structure of the data. The depolarizing probability of the IBM quantum computer is estimated as p ≈ 0.36 for the NPQC and p ≈ 0.39 for the YZ-CX. To measure the kernel of the dataset with L train = 1597 and L test = 200, we require in total N R = s(L train + L test )r ≈ 1.2 × 10 8 measurements. For the inversion test, one would require N R = s c L train (L train − 1)/2 + s c L train L test ≈ 0.8 × 10 10 experiments, where we have set the number of measurements per kernel entry to s c = 5000 as chosen in past experiments [18]. Thus, we estimate that our method yields a reduction in total measurements by more than factor 60. We find that our method already yields a lower measurement cost for L train > 21 as shown in appendix D. (a) SVM trained with experimental quantum kernel measured on a single quantum computer (ibmq_guadalupe) with randomized measurements using error mitigation (red, equation (8)) and no error mitigation (yellow). The shaded area is the standard deviation of the accuracy. As a classical baseline, we show the isotropic radial basis function kernel (blue). Simulations of quantum kernels are the exact quantum kernel (orange) and noiseless simulation of randomized measurements (green). (b) We distribute the measurements on two different quantum computers (ibmq_guadalupe and ibmq_toronto, purple curve) and post-process the combined measurement results with error mitigation. As reference, we show the accuracy of quantum kernel measured on a single quantum computer for ibmq_guadalupe (red) and ibmq_toronto (light blue). We encode the data into the YZ-CX PQC with M = 64 features and the NPQC with M = 36 features. Experiments are performed using s = 8192 measurement samples, N = 8 qubits and r = 8 randomized measurement settings. The test data contains Ltest = 200. To calculate mean and standard deviation of the accuracy, we randomly draw test and training data from the full dataset 20 times for each training data size.
Finally, in figure 4(b) we distribute the measurements between two quantum computers. We split the dataset into two halves, where one half is measured using randomized measurements with ibmq_guadalupe and the other half with ibmq_toronto [51] (see appendix E for more details). The measurement outcomes from both machines are then combined for the post-processing on the classical computer to calculate the kernel matrix of the full dataset. Here, we also apply error mitigation. As reference, we also plot the accuracy achieved with a single quantum computer. For the YZ-CX PQC, we find nearly equal accuracy with the distributed and single quantum computer approach. For the NPQC, the accuracy of the distributed approach is slightly lower. The performance highly depends on the noise and calibration of the IBM quantum computers, which can fluctuate over time and highly depends when an experiment is performed. We attribute the lower performance of the distributed YZ-CX approach with a higher noise level present while the experiment was performed on ibmq_toronto. As the randomized measurement method correlates measured samples, differences in the respective noise model of the two quantum computers can have a negative effect on the resulting quantum kernel. In the appendices F and G, we show the accuracy of the training data and the confusion matrices.

Discussion
Our work demonstrates a practical method to learn large datasets on noisy quantum computers with intermediate qubit numbers. Randomized measurement enables a linear scaling in dataset size L and encodes high-dimensional data with number of features scaling linearly with quantum circuit depth d. We show our encoding can be characterized by the QFIM and its eigenvalues and eigenvectors [35]. As the behavior of the kernel is crucial for effectively learning and generalizing data, future work could design the QFIM to improve the capability of quantum machine learning models. We demonstrated the NPQC with a simple and exactly known QFIM, which could be a useful basis to study quantum machine learning on large quantum computers.
We encode the data in hardware efficient PQCs, which are known to be hard to simulate classically for large numbers of qubits [12]. This type of PQC has been used in quantum machine learning experiments [18]. While sampling from these circuits is difficult to simulate on classical computers, we find that the quantum kernel closely follow the radial basis function kernel up to exponentially small kernel values [38]. Similarly, many other classes of quantum kernels have efficient classical representations [54]. The resemblance with a classical kernel implies that these quantum kernels are unlikely to achieve an advantage over classical methods [8]. However, we note that radial basis function type of kernels have been of interest in quantum optics [55] and can serve as a reliable benchmark of quantum machine learning methods. Further, the non-trivial weight matrix F could be of independent interest in machine learning [56].
We mitigate the noise occurring in the quantum computer by using data sampled during the measurements of the kernel. We find that the number s min of measurement samples needed to mitigate depolarizing noise scales as s min ∝ (1 − p) −2 , allowing us to extract kernels even from very noisy quantum computers. We successfully apply this model to mitigate the noise of the IBM quantum computer. While the noise model of quantum computers is known to be complicated involving multiple types of sources of noise, the depolarizing model we use is sufficient to mitigate the noise of kernels measured on IBM quantum computers [47]. This may be the result of the randomized measurements leading to an insensitivity to fixed unitary noise channels. We note that noise induced errors can actually be beneficial to machine learning as the capability to generalize from data can improve with increasing noise [37].
In general, the number of measurements needed for the randomized measurement scheme scales exponentially with the number N of qubits [41,42]. This makes our method currently practical only for a lower number of qubits. However, various approaches could extend our method to larger qubit numbers. Importance sampling can reduce the number of measurements needed [44]. For particular types of states an exponential reduction in cost has been observed. It would be worthwhile to study how importance sampling can improve the measurement cost for quantum machine learning. In other settings adaptive measurements have been proposed to improve the scaling of measurement costs [57], as well as other approaches such as shadow tomography [58]. The choice of an effective set of measurements could be included in the machine learning task as hyper-parameters to be optimized. To reduce the number of qubits, one could combine our approach with quantum autoencoders to transform the encoding quantum state into a subspace with less qubits that captures the essential information of the kernel [59]. Alternatively, one could trace out most of the qubits of a many-qubit quantum state ρ(θ i ) such that a subsystem A with a lower number of qubits remains. Then, randomized measurements can efficiently determine the kernel Tr(ρ A (θ i )ρ A (θ j )). It would be worthwhile to investigate the learning power of kernels generated from subsystems of quantum states that possess quantum advantage [7,8].
Randomized measurements process each of the L quantum states of the dataset separately [42]. The full kernel matrix K(θ i , θ j ) with L 2 elements is then constructed via classical post-processing using equation (7) where the randomized measurement data for state |ψ(θ i )⟩, |ψ(θ j )⟩ is reused to calculate each entry of the matrix. This gives us the resulting speedup in quantum computational time. As a further advantage, our approach only requires preparing one quantum state at a time, reducing the number of gates by half compared to the inversion test or swap test. Further, we demonstrate how to achieve additional speedups by distributing measurements across different quantum computers.
The quantum computation time scales linearly with dataset size L and provides a quadratic speedup compared to conventional measurement methods such as the inversion test or swap test. Note that the classical post-processing to construct the kernel still scales as L 2 . However, we note that current quantum computers perform measurements at a rate of ∼5 kHz [12,13], which is a factor 10 6 slower than commonly available classical computers. Further, using quantum computers is very expensive compared to classical computation. Thus, the main bottleneck for quantum machine learning algorithms on current quantum hardware lies within the quantum part, while the classical part can be easily parallelized and distributed. Therefore, our work opens up benchmarking quantum machine learning with large datasets on intermediate-size quantum computers, which was impractical with previously known methods.
For our encoding equation (2), at small distances the quantum kernel in parameter space can be described by the QFIM via equation (4). We note this relation is general for any type of PQC. The rank of the QFIM F indicates the number of independent directions in parameter space with equation (4). The maximal number of independent features M max that can be encoded via equation (2) is thus given by the rank of the QFIM, which is upper bounded by rank(F) = M max ⩽ 2 N+1 − 2 [35]. Thus, even a modest number of qubits can represent a large number of parameters. The popular MNIST dataset [60] for classifying 2D images of handwritten digits has 28 × 28 pixels, which could be encoded in only N = 9 qubits.
Assuming 5 kHz measurement rate, s = 8192 measurement samples and r = 8 measurement settings, our method can process the full MNIST training dataset with L train = 60 000 entries in about 240 h of quantum processing time of a single quantum computer. In contrast, the inversion or swap test would require at least 10 years with s = 1000 samples on a quantum computer. With our scheme, we enable quantum machine learning with large datasets on intermediate-sized quantum computers. Future work could benchmark the performance of currently available quantum computers with datasets commonly used in classical machine learning.

Data availability statement
The code to reproduce the experimental results presented in this paper is available from [61] and the experimental data is available from [62].
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/chris-n-self/large-scale-qml https://doi.org/10.5281/zenodo.5211695. l,α , where l denotes the layer, α ∈ {x, y, z} the type of rotation and k the qubit number. Note this notation is different from the main text.

Appendix A. Parameterized quantum circuits
In figure 5(a), we show the first circuit we use, which we call the NPQC. The first layer is composed of 2 N single qubit rotations around the y and z axis for each qubit n with , α ∈ {x, y, z} and σ x n , σ y n , σ z n are the Pauli matrices acting on qubit k. Each additional layer l > 1 is a product of two qubit entangling gates and N parameterized single qubit rotations defined as U l (a l ) = (π/2) and CPHASE(n, m) is the controlled σ z gate for qubit index n, m, where indices larger than N are taken modulo. The entangling layer U l (0) is shown as example in figure 5(b). The shift factor a l ∈ {0, 1, . . . , N/2 − 1} for layer l is given by the recursive rule shown in the following. Initialize a set A = {0, 1, . . . , N/2 − 1} and s = 1. In each iteration, pick and remove one element r from A. Then set a s = r and a s+q = a q for q = {1, . . . , s − 1}. As the last step, we set s = 2s. We repeat this procedure until no elements are left in A or a target depth d is reached. One can have maximally d max = 2 N/2 layers with in total M = N(d + 1) parameters. The NPQC has a QFIM F(θ r ) = I, I being the identity matrix, for the reference parameter θ r given by where θ (k) r,l,z is the reference parameter for layer L, qubit k and rotation around z-axis. Close to this reference parameter, the QFIM remains approximately close to being an identity matrix. When implementing the NPQC for the IBM quantum computer, we choose the sift factor a l such that only nearest-neighbor CPHASE gates arranged in a chain appear. To match the connectivity of the IBM quantum computer, we removed one entangling gate and its corresponding single qubit rotations which require connection between the first and the last qubit of the chain. The second type of PQC used is shown in figure 5(c), which we call YZ-CX. It consists of d layers of parameterized single qubit y and z rotations, followed by CNOT gates. The CNOT gates arranged in a one-dimensional chain, acting on neighboring qubits. Every layer l, the CNOT gates are shifted by one qubit. Redundant single qubit rotations that are left over at the edges of the chain are removed.

Appendix B. Methods to measure quantum kernels
In figure 6, we explain the different methods to measure kernels of L quantum states. In this paper, we use the randomized measurements method shown in figure 6(a). The number of required measurements to measure all possible pairs of kernels scales linearly with dataset size L.
The inversion test is shown in figure 6(b). To measure the kernel between two quantum states, it uses the unitary of the first state combined the with inverse unitary of the second state. Then, the kernel is given by the probability of measuring the zero state. Here, the number of measurements scales with the square L 2 of the dataset size.
The swap test is shown in figure 6(c). It prepares both states for the kernel, requiring two times the amount of qubits as with the other tests. Then, a controlled SWAP gate is applied, with the control being on an ancilla qubit. Then, the kernel is given by the measurement of the ancilla. As with the inversion test, the number of required measurements scales with the square L 2 of the dataset size. Further, the controlled SWAP gate can require substantial quantum resources.

Appendix C. Experimental kernel of YZ-CX PQC
We provide further data on the experimental quantum kernel measured on the IBM quantum computer. We measure the kernel using randomized measurements for randomly chosen feature vectors. In figure 7, we show experimental data of the kernel for the YZ-CX PQC using ibmq_guadalupe. We find that the experimental data and numerical simulations match well.

Appendix D. Measurement cost
Here, we compare the measurement cost when learning from our dataset for varying number of training data used. For randomized measurements, the number of measurements is given by N random meas = s(L train + L test )r. For conventional methods such as SWAP or inversion test, we have N inv meas = s c L train (L train − 1)/2 + s c L train L test . We now assume that L test = L train /5. We assume s = 8192 and r = 8 with the same values as used for experiment of N = 9 qubits in the main text. For the conventional approach, we choose s c = 5000 as used in [18] for an experiment with comparable feature vector size. The measurement cost is plotted in figure 8, where we find that randomized measurements is advantageous with N random meas < N inv meas for L train > 21.

Appendix H. Product state as analytic radial basis function kernel
As an analytic example, we show that product states form an exact radial basis function kernel. We use the following N qubit quantum state (1 + 1 2 cos(∆θ n )) (H2) where we define ∆θ = θ − θ ′ as the difference between the two parameter sets. We now assume |∆θ n | ≪ 1 and that all the differences of the parameters are equal ∆θ 1 = · · · = ∆θ N . We then find in the limit of many qubits N which gives us the radial basis function kernel.