Quantum-classical hybrid neural networks in the neural tangent kernel regime

Recently, quantum neural networks or quantum–classical neural networks (qcNN) have been actively studied, as a possible alternative to the conventional classical neural network (cNN), but their practical and theoretically-guaranteed performance is still to be investigated. In contrast, cNNs and especially deep cNNs, have acquired several solid theoretical basis; one of those basis is the neural tangent kernel (NTK) theory, which can successfully explain the mechanism of various desirable properties of cNNs, particularly the global convergence in the training process. In this paper, we study a class of qcNN composed of a quantum data-encoder followed by a cNN. The quantum part is randomly initialized according to unitary 2-designs, which is an effective feature extraction process for quantum states, and the classical part is also randomly initialized according to Gaussian distributions; then, in the NTK regime where the number of nodes of the cNN becomes infinitely large, the output of the entire qcNN becomes a nonlinear function of the so-called projected quantum kernel. That is, the NTK theory is used to construct an effective quantum kernel, which is in general nontrivial to design. Moreover, NTK defined for the qcNN is identical to the covariance matrix of a Gaussian process, which allows us to analytically study the learning process. These properties are investigated in thorough numerical experiments; particularly, we demonstrate that the qcNN shows a clear advantage over fully classical NNs and qNNs for the problem of learning the quantum data-generating process.


I. INTRODUCTION
A. Background -Quantum/classical neural networks and classical neural tangent kernel Quantum neural networks (qNNs) or quantum classical hybrid neural networks (qcNNs) are systems that, based on their rich expressibility in the functional space, have potential of offering a higher-performance solution in various problems over classical means [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].However, there remain two essential issues to be resolved.First, the existing qNN and qcNN models have no theoretical guarantee in their training process to converge to the optimal or even a "good" solution.The vanishing gradient (or the barren plateau) issue, stating that the gradient vector decays exponentially fast with respect to the number of qubits, is particularly serious [16]; several proposals to mitigate this issue have been proposed [17][18][19][20][21][22][23][24][25][26], but these are not general solutions.Secondly, despite the potential advantage of the quantum models in their expressibility, they are not guaranteed to offer a better solution over classical means, especially the classical neural networks (cNNs).Regarding this point, the recent study [27] has derived a condition for the quantum kernel method to presumably outperform a class of classical means and provided the idea using the projected quantum kernel to satisfy this advantageous condition.Note that the quantum kernel has been thoroughly investigated in several theoretical and experimental settings [28][29][30][31][32][33][34][35].However, designing an effective quantum kernel (including the projected quantum kernel) is a highly nontrivial task; also, the kernel method generally requires the computational complexity of O(N 2 D ) with N D the number of data, whereas the cNN needs only O(N D ) as long as the computational cost of training does not scale with N D .Therefore, it is desirable if we could have an easy-trainable qNN or qcNN to which the above-mentioned advantage of quantum kernel method are incorporated.
On the other hand, in the classical regime, the neural tangent kernel (NTK) [36] offers useful approaches to analyze several fundamental properties of cNNs and especially deep cNNs, including the convergence properties in the training process.The NTK is a time-varying nonlinear function that appears in the dynamical equation of the output function of cNN in the training process.Surprisingly, NTK becomes timeinvariant in the so-called NTK regime where the number of nodes of CNN becomes infinitely large; further, it becomes positive-definite via random initialization of the parameters.As a result, particularly when the problem is the least square regression, the training process is described by a linear differential (or difference) equation, and the analysis of the training process boils down to that of the spectra of this time-invariant positive-definite matrix.The literature studies on NTK that are related to our work are as follows; the relation to Gaussian process [37], relation between the spectra of NTK and the convergence property of cNN [38], and the NTK in the case of classification problem [39][40][41][42].

B. Our contribution
In this paper, we study a class of qcNN that can be directly analyzed in the NTK regime.In this proposed qcNN scheme, the classical data is first encoded into the state of a quantum system and then retransformed to a classical data by some appropriate random measurement, which can thus be regarded as a feature extraction process in the high-dimensional quantum Hilbert space.We then input the reconstructed classical data vector into a subsequent cNN.Finally, a cost function is evaluated using the output of cNN, and the parameters contained in the cNN part are updated to lower the cost.Note that, hence, the quantum part is fixed, implying that the vanishing gradient issue does not occur in our framework.The following is the list of results.
• The output of qcNN becomes a Gaussian process in the infinite width limit of the cNN part while the width of the quantum part is fixed, where the unitary gate determining the quantum measurement and the weighting parameters of cNN are randomly chosen from unitary 2-designs and Gaussian distributions, respectively.The covariance matrix of this Gaussian process is given by a function of projected quantum kernels mentioned in the first paragraph.That is, our qcNN certainly exploits the quantum feature space.
• In the infinite width limit of cNN, the training dynamics in the functional space is governed by a linear differential equation characterized by the corresponding NTK, meaning the exponentially-fast convergence to the global solution if NTK is positive-definite; a condition to guarantee the positivedefiniteness is also obtained.At the convergent point, the output of qcNN is of the form of kernel function of NTK.Because NTK is a nonlinear function of the above-mentioned covariance matrix composed of the quantum projection kernels, and because the computational cost of training is low, our qcNN can be regarded as a method to generate an effective quantum kernel with less computational complexity than the standard kernel method.
• Because the NTK has an explicit form of covariance matrix, theoretical analysis on the training process and the convergent value of cost function is possible.As a result, based on this theoretical analysis on the cost function, we derive sufficient condition for our qcNN model to lower the cost function than some other full-classical models.Note that, when the size of quantum system is large, classical computers will have a difficulty to simulate the feature extraction process of the qcNN model; this may be a factor that leads to such superiority.
In addition to the above theoretical investigations, we carry out thorough numerical simulations to evaluate the performance of the proposed qcNN model, as follows.
• The numerically computed time-evolution of cost function along the training process well agrees with the analytic form of time-evolution of cost (obtained under the assumption that NTK is constant and positive definite), for both the regression and classification problems, when the width of cNN is bigger than 100.This shows the validity of using NTK to analytically investigate the performance of the proposed qcNN.
• The convergence speed becomes bigger (i.e., nearly the ideal exponentially-fast convergence is observed) and the value of final cost becomes smaller, when we make the width of cNN bigger.Moreover, we find that enough reduction of the training cost leads to the decrease of generalization error.That is, our qcNN has several desirable properties predicted by the NTK theory, which are indeed satisfied in many classical models.
• Both the regression and classification performance largely depend on the choice of quantum circuit ansatz for data-encoding, which is reasonable in the sense that the proposed method is essentially a kernel method.Yet we found an interesting case where the ansatz with bigger-expressibility (due to containing some entangling gates) decreases the value of final cost lower than that achieved via the ansatz without entangling gates.This implies that the quantumness may have a power to enhance the performance of the proposed qcNN model, depending on the dataset or selected ansatz.
• The proposed qcNN model shows a clear advantage over full cNNs and qNNs for the problem of learning the quantum data-generating process.A particularly notable result is that, even with much less parameters (compared to the full cNNs) and smaller training cost (compared to the qNNs), the qcNN can execute the regression and the classification task with sufficient accuracy.Also, in terms of the generalization capability, the qcNN model shows much better performance than the others, mainly thanks to the inductive bias.

C. Related works
Before finishing this section, we address related works.Recently (after submitting a preprint version of this manuscript), the following studies on quantum NTK have been presented.Their NTK is defined for the cost function of the output state of a qNN.In Ref. [43], the authors studied the properties of the linear differential equation of the cost (which corresponds to Eq. ( 12) shown later), obtained under the assumption that the NTK does not change much in time.This idea was further investigated in the subsequent paper [44], showing in both theory and numerical simulation that the dynamics of cost exponentially decays when the number of parameters is large, i.e., when the system is within the over-parametrization regime, as suggested by the conventional classical NTK theory.This behaviour was also supported by numerical simulations provided in Ref. [45].Also, in Ref. [46], a relation between their NTK and the vanishing gradient issue was discussed; that is, to satisfy the assumption that the NTK does not change in time, the qNN has to contain O(4 n ) parameters with n the number of qubits, which actually has the same origin as the vanishing gradient issue.In Ref. [47] the authors gave a method for mitigating this demanding requirement; they study the training dynamics in a space with effective dimension d eff instead of the entire Hilbert space with dimension 2 n , which as a result allows O(d 2 eff ) parameters to guarantee the exponential convergence.All these studies focus on fully-quantum systems, while in this paper we focus on a class of classical-quantum hybrid systems where the tunable parameters are contained only in the classical part and the NTK is defined with respect to those parameters.A critical consequence due to this difference is that our NTK becomes time-invariant (Theorem 5) and the output function becomes Gaussian (Theorems 3 and 4) in the over-parametrization regime, while these provable features were not reported in the above literature works.In particular, the time-invariancy is critical to guarantee the exponential convergence of the output function; as mentioned above, they rely on the assumption that the NTK does not change much in time.It may look like that our NTK is a fully classical object and as a result we are allowed to have such provable facts, but certainly it can extract features of the quantum part in the form of nonlinear function of the projected quantum kernel, as mentioned above.

D. Structure of the paper
The structure of this paper is as follows.Section II reviews the theory of NTK for cNNs.Section III begins with describing our proposed qcNN model, followed by showing some theorems.Also we discuss possible advantage of our qcNN over some other models.Section IV is devoted to give a series of numerical simulations.Section V concludes the paper.

II. PRELIMINARY: CLASSICAL NEURAL TANGENT KERNEL THEORY
The NTK theory, which was originally proposed in [36], offers a method for analyzing the dynamics of an infinitely-wide cNN under the gradient-descent-based training process.In particular, the NTK theory can be used for explaining why deep cNNs with much more parameters than the number of data (i.e., overparametrized cNNs) work quite well in various machine learning tasks in terms of training error.We review the NTK theory in Sections from II A to II D. Importantly, the NTK theory can also be used to conjecture when cNNs may fail.As a motivation for introducing our model, we discuss one of the failure conditions of cNN in terms of NTK, in Section II E.

A. Problem settings of NTK theory
The NTK theory [36] focuses on supervised learning problems.That is, we are given , where x a is an input vector and y a is the corresponding output; here we assume for simplicity that y a is a scalar, though the original NTK theory can handle the case of vector output.Suppose this dataset is generated from the following hidden (true) function f goal as follows; Then the goal is to train the model f θ(t) , which corresponds to the output of a cNN, so that f θ(t) becomes close to f goal in some measure, where θ(t) is the set of the trainable parameters at the iteration step t.An example of the measure that quantifies the distance between f θ(t) and f goal is the mean squared error: which is mainly used for regression problems.Another example of the measure is the binary cross entropy: which is mainly used for classification problems where σ s is the sigmoid function and y a is a binary label that takes either 0 or 1.
The function f θ(t) is constructed by a fully-connected network of L layers.Let n ℓ be the number of nodes (width) of the ℓ-th layer (hence ℓ = 0 and ℓ = L correspond to the input and output layers, respectively).Then the input x a is converted to the output f θ(t) (x a ) in the following manner: where W (ℓ) ∈ R n l ×n l−1 is the weighting matrix and b (ℓ) ∈ R n l is the bias vector in the ℓ-th layer.Also σ is the activation function that is differentiable.Note that the vector of trainable parameters θ(t) is now composed of all the elements of {W jk } and b (ℓ) .The parameters are updated by using the gradient descent algorithm where for simplicity we take the continuous-time regime in t.Also, η is the learning rate and θ j is the j-th parameter.All parameters, {W jk } and b (ℓ) , are initialized by sampling from the mutually independent normal Gaussian distribution.

B. Definition of NTK
NTK appears in the dynamics of the output function f θ(t) , as follows.The time derivative of f θ(t) is given by where K (L) (x, x ′ , t) is defined by The function K (L) (x, x ′ , t) is called the NTK.In the following, we will see that the trajectory of f θ(t) can be analytically calculated in terms of NTK in the infinite width limit

C. Theorems
The key feature of NTK is that it converges to the time-invariant and positive-definite function Θ (L) (x, x ′ ) in the infinite width limit, as shown below.Before stating the theorems on these surprising properties, let us show the following lemma about the distribution of f θ(0) : Lemma 1. (Proposition 1 in [36]) With σ as a Lipschitz nonlinear function, in the infinite width limit n ℓ → ∞ for 1 ≤ ℓ ≤ L − 1, the output function at initialization, f θ(0) , obeys a centered Gaussian process whose covariance matrix Σ (L) (x, x ′ ) is given recursively by where the expectation is calculated by averaging over the centered Gaussian process with the covariance Σ (ℓ) .
The proof can be found in Appendix A.1 of [36].Note that the expectation for an arbitrary function z(h(x), h(x ′ )) can be computed as where Σ(ℓ) is the 2 × 2 matrix the vector h is defined as h = (h(x), h(x ′ )) T , and | Σ(ℓ) | is the determinant of the matrix Σ(ℓ) .From Lemma 1, the following theorem regarding NTK can be derived: Theorem 1. (Theorem 1 in [36]) With σ as a Lipschitz nonlinear function, in the infinite width limit n ℓ → ∞ for 1 ≤ ℓ ≤ L − 1, the neural tangent kernel K (L) (x, x ′ , t) converges to the time-invariant function Θ (L) (x, x ′ ), which is given recursively by where ] and σ is the derivative of σ.
Note that, by definition, the matrix (Θ (L) (x a , x b )) is symmetric and positive semi-definite.In particular, when L ≥ 2, the following theorem holds: Theorem 2. (Proposition 2 in [36]) With σ as a Lipschitz nonlinear function, the kernel Θ (L) (x, x ′ ) is positive definite when L ≥ 2 and the input vector x is normalized as x T x = 1.
The above theorems on NTK in the infinite width limit can be utilized to analyze the trajectory of f θ(t) as shown in the next subsection.

D. Consequence of Theorem 1 and Theorem 2
From Theorems 1 and 2, in the infinite width limit, the differential equation ( 6) can be exactly replaced by The solution depends on the form of L C t ; of particular importance is the case when L C t is the mean squared loss.In our case (2), the functional derivative of the mean squared loss is given by and then we obtain the ordinary linear differential equation by substituting (13) for (12).This equation can be solved analytically [48] at each data points as where V = (V jb ) is the orthogonal matrix that diagonalizes Θ (L) (x, x ′ ) as The eigenvalues λ j are non-negative, because Θ (L) (x, x ′ ) is positive semi-definite.
When the conditions of Theorem 2 are satisfied, then Θ (L) (x, x ′ ) is positive definite and accordingly λ j > 0 holds for all j.Thus in the limit t → ∞, the solution (14) states that f θ(t) (x a ) = y a holds for all a; namely, the value of the cost L C t reaches the global minimum L t = 0.This fine convergence to the global minimum explains why the over-parameterized cNN can be successfully trained.
We can also derive some useful theoretical formula for general x.In the infinite width limit, from Eqs. ( 12), (13), and (14) we have This immediately gives where Now, if the initial parameters θ(0) are randomly chosen from a centered Gaussian distribution, the average of f θ(t) (x) over such initial parameters is given by The formula (18) can be used for predicting the output for an unknown data, but it requires O(N 3 D ) computation to have V via diagonalizing NTK, which may be costly when the number of data is large.To the contrary, in the case of cNN, the computational cost for its training is O(N D N P ), where N P is the number of parameters in cNN.Thus, if N D is so large that O(N 3 D ) classical computation is intractable, we can use the finite width cNN with N P ≤ O(N D ), rather than (18) as a prediction function.In such case, the NTK theory can be used as theoretical tool for analyzing the behaviour of cNN.
Finally, let us consider the case where the cost is given by the binary cross entropy (3); the functional derivative in this case is given by where in the last line we use the derivative formula for the sigmoid function: By substituting ( 21) into (12), we obtain and similarly for general input x These are not linear differential equations and thus cannot be solved analytically, unlike the mean squared error case; but we can numerically solve them by using standard ordinary differential equation tools [48].

E. When may cNN fail?
The NTK theory tells that, as long as the condition of Theorem 2 holds, the cost function converges to the global minimum in the limit t → ∞.However in practice we must stop the training process of cNN at a finite time t = τ .Thus, the speed of convergence is also an important factor for analyzing the behaviour of cNN.In this subsection we discuss when cNN may fail in terms of the convergence speed.We discuss the case when the cost is the mean squared loss.
Recall now that the speed of convergence depends on the eigenvalues {λ j } N D j=1 .If the minimum of the eigenvalues, λ min , is sufficiently larger than 0, the cost function quickly converges to the global minimum in the number of iteration O(1/λ min ).Otherwise, the speed of convergence is not determined only by the spectrum of the eigenvalues, but the other factors in (14) need to be taken into account; actually many of the reasonable settings correspond to this case [38], and thus we will consider this setting in the following.
First, the formula ( 14) can be rewritten as where w j (t) = a V ja f θ(t) (x a ) and g j = a V ja y a .Let us assume that we stop the training at t = τ < O(1/λ min ).With S ητ = {j|λ j < 1/ητ, 1 ≤ j ≤ N D }, if we approximate the exponential function as then we obtain By using the same approximation, the cost function at the iteration step τ can be calculated as Since w j (0) is the sum of centered Gaussian distributed variables, w j (0) also obeys the centered Gaussian distribution with covariance: Thus, we have Since the covariance matrix can be diagonalized with an orthogonal matrix V ′ as the first term of Eq. ( 30) can be rewritten as where . Also, the second term of (30) can be written as where y is the label vector defined by y = {y a } N D a=1 .Thus, we have The cost L C τ becomes large, depending on the values of the first and the second terms, characterized as follows: (i) the first term becomes large if the eigenvectors of Σ (L) (x b , x c ) with respect to large eigenvalues align with the eigenvectors of Θ (L) (x b , x c ) with respect to small eigenvalues and (ii) the second term becomes large if the label vector aligns with the eigenvectors of Θ (L) (x b , x c ) with respect to small eigenvalues.Of particular importance is the condition where the latter statement (ii) applies.Namely, the cNN cannot be well optimized in a reasonable time if we use a dataset whose label vector aligns with the eigenvectors of Θ (L) (x b , x c ) with respect to small eigenvalues.If such a dataset is given to us, therefore, an alternative method that may outperform the cNN is highly demanded, which is the motivation of introducing our model.

Remark 1:
If some noise is added to the label of the training data, we need not aim to decrease the cost function toward precisely zero.For example, when the noise vector ϵ is appended to the true label vector ỹ in the form y = ỹ + ϵ, it may be favorable to stop the optimization process at time t = τ before j∈Sητ (ϵ • v) 2 becomes small, for avoiding the overfitting to the noise; actually in the original NTK paper [36] the idea of avoiding the overfitting by using early stopping is mentioned.In this case, instead of j∈Sητ (y • v) 2 , we should aim to decrease the value of j∈Sητ (ỹ • v) 2 , to construct a prediction function that has a good generalization ability.

III. PROPOSED MODEL
In this section, we introduce our qcNN model for supervised learning, which is theoretically analyzable using the NTK theory.Before describing the detail, we summarize the notable point of this qcNN.This qcNN is a concatenation of a quantum circuit followed by a cNN, as illustrated in Fig. 1.Likewise the classical case shown in Section II D, we obtain the time-invariant NTK in the infinite width limit of the cNN part, which allows us to theoretically analyze the behaviour of the entire system.Importantly, NTK in our model coincides with a certain quantum kernel computed in the quantum data-encoding part.This means that the output of our qcNN can represent functions of quantum states defined on the quantum feature space (Hilbert space); hence, if the quantum encoder is designed appropriately, our model may have advantage over purely classical systems.In the following, we discuss the detail of our model from Section III A to Section III C, and discuss possible advantage in Section III D.

𝒙
The overview of the proposed qcNN model.The first quantum part is composed of the encoding unitary U enc (x a ) for the data x a followed by the random unitary U i and measurement of an observable O for extracting a feature of the quantum state, f Q (x a ) i .We run n 0 different quantum circuits to construct a feature vector ), which is the input vector to the classical part composed of n 0 -nodes multi-layered NN.

A. qcNN model
We consider the same supervised learning problem discussed in Section II.That is, we are given Then the goal is to train the model function f θ(t) so that f θ(t) becomes closer to f goal in some measure, by updating the vector of parameters θ(t) as a function of time t.Our qcNN model f θ(t) is composed of the quantum part f Q and the classical part f C θ(t) , which are concatenated as follows: Only the classical part has trainable parameters in our model as will be stated later, and thus the subscript θ(t) is placed only on the classical part.
The quantum part first operates the n-qubits quantum circuit (unitary operator) U enc that loads the classical input data x a into the quantum state in the manner |ψ(x a )⟩ = U enc (x a )|0⟩ ⊗n .We then operate a random unitary operator U i on the quantum state |ψ(x a )⟩ and finally measure an observable O to have the expectation value We repeat this procedure for i = 1, . . ., n 0 and collect these quantities to construct the n 0 -dimensional vector ), which is the output of the quantum part of our model.The randomizing process corresponds to extracting features of |ψ(x a )⟩, likewise the machine learning method using the classical shadow tomography [49,50]; but our method does not construct a tomographic density matrix (called the snapshot) but directly construct the feature vector f Q (x a ) which will be further processed in the classical part.Note that, as shown later, we will make n 0 bigger sufficiently so that the NTK becomes time-invariant and thereby the entire dynamics is analytically solvable.Hence it may look like that the procedure for constructing the n 0 -dimensional vector f Q (x a ) is inefficient, but practically a modest number of n 0 is acceptable, as demonstrated in the numerical simulation in Section IV C.
In this paper, we take the following setting for each component.The classical input data x a is loaded into the n-qubits quantum state through the encoder circuit U enc .Ideally, we should design the encoder circuit U enc so that it reflects the hidden structure (e.g., symmetry) of the training data, as suggested in [28,51]; the numerical simulation in Section IV C considers this case.As for the randomizing unitary operator U i , it is of the tensor product form: where m is an integer called the locality, and we assume that independently sampled from unitary 2-designs and is fixed during the training.Note that a unitary 2-design is implementable on a circuit with the number of gates O(m 2 ) [52].Lastly, the observable O is the sum of n Q local operators: where I u is the 2 u -dimensional identity operator and O is a 2 m -dimensional traceless operator.
Next we describe the classical part, f C θ(t) .This is a cNN that takes the vector f Q (x a ) as the input and returns the output ).We implement f C θ(t) as an L-layers fully connected cNN, which is the same as that introduced in Section II: where As in the case of cNN studied in Section II, W (ℓ) is the n ℓ+1 × n ℓ weighting matrix and b (ℓ) is the n ℓ -dimensional bias vector; each element of W and b (ℓ) are initialized by sampling from the mutually independent normal Gaussian distributions.The parameter θ(t) is updated by the gradient descent algorithm where L Q t is the cost function that reflects a distance between f θ(t) and f goal .Also η is the learning rate and θ p (t) (p = 1, 2, • • • , P ) is the p-th element of θ(t) that corresponds to the elements of W (1) , W (2) , • • • , W (L−1) and b (1) , b (2)  1) .The task of updating the parameters only appears in the classical part, which can thus be performed by applying some established machine learning solver given the , and the cached output from the quantum part at initialization.

B. Quantum neural tangent kernel
As proven in Section II, when the parameters are updated via the gradient descent method (41), the output function f θ(t) changes in time according to Here K Q (x, x ′ , t) is the quantum neural tangent kernel (QNTK), defined by It is straightforward to show that K Q (x, x ′ , t) is positive semi-definite.We will see the reason why we call K Q (x, x ′ , t) as the quantum neural tangent kernel in the next subsection.

C. Theorems
We begin with the theorem stating the probability distribution of the output function f θ(0) in the case L = 1; this setting shows how a quantum kernel appears in our model, as follows.
Theorem 3.With σ as a Lipschitz function, for L = 1 and in the limit is a centered Gaussian process whose covariance matrix Σ Here ρ k x is the reduced density matrix defined by where Tr k is the partial trace over the entire Hilbert space except from the (km−m)-th qubit to the (km−1)-th qubit.
The proof is found in Appendix A. Note that the term ) coincides with one of the projected quantum kernels introduced in [27] with the following motivation.That is, when the number of qubits (hence the dimension of Hilbert space) becomes large, the Gram matrix composed of the inner product between pure states, Tr(ρ x ρ x ′ ) = |⟨ψ(x)|ψ(x ′ )⟩| 2 , becomes close to the identity matrix under certain type of feature map [27,35,53], meaning that there is no quantum advantage in using this kernel.The projected quantum kernel may cast as a solution for this problem; that is, by projecting the density matrix in a high-dimensional Hilbert space to a low-dimensional one as in (45), the Gram matrix of kernels defined by the inner product of projected density matrices can take some quantum-intrinsic structure which largely differs from the identity matrix.
The covariance matrix Σ Q (x, x ′ ) inherits the projected quantum kernel, which can be more clearly seen from the following corollary: Corollary 1.The covariance matrix obtained in the setting of Theorem 3 is of the form Namely, Σ Q (x, x ′ ) is exactly the projected quantum kernel up to the constant factor, if we suitably choose the coefficient of the bias vector given in Eq. (40).
Based on result in the case of L = 1, we can derive the following Theorem 4 and Theorem 5. First, the distribution of f θ(0) when L > 1 can be recursively computed as follows.
Theorem 4. With σ as a Lipschitz function, for L > 1 and in the limit where the expectation value is calculated by averaging over the centered Gaussian process with covariance matrix Q .The proof is found in Appendix B. Note that the only difference between the quantum case (48) and the classical case ( 8) is that the covariance matrix corresponding to the first layer in the entire network.
The infinite width limit of the QNTK can be also derived in a similar manner as Theorem 1, as follows.
Theorem 5.With σ as a Lipschitz function, in the limit where ] and σ is the derivative of σ.
The proof is in Appendix C. Note that the above two theorems can be proven with almost the same manner as in [36].
When L = 1, the QNTK directly inherits the structure of the quantum kernel, and this is the reason why we call K Q (x, x ′ , t) the quantum NTK.Also, such inherited structure in the first layer propagates to the subsequent layers when L > 1; the resulting kernel is then of the form of a nonlinear function of the projected quantum kernel.Considering the fact that designing an effective quantum kernel is in general quite nontrivial, it is useful for us to have a method to automatically generate a nonlinear kernel function appearing when L > 1.Note that, when the ReLU activation function is used, the analytic form of As in the classical case, Theorem 5 is the key property that enables us to analytically study the training process of the qcNN.In particular, let us recall Theorem 2 and the discussion below Eq. ( 15), showing the importance of positive semi-definiteness or definiteness of the kernel ) Actually, we now have an analogous result to Theorem 2 as follows.
Theorem 6.For a non-constant Lipschitz function σ, QNTK Θ (L) Q (x, x ′ ) is positive definite unless there exists {c a } N D a=1 such that (i) a c a ρ k x a = 0 (∀k), a c a = 0, and c a ̸ = 0 (∃a) or (ii) ξ = 0, a c a ρ k x a = I m /2 m (∀k) and a c a = 1.
We give the proof in Appendix E. Note that condition (i) can be interpreted as the data embedded reduced density matrices being linearly dependent, which can be avoided by removing redundant data.It is difficult to give a proper interpretation on the condition (ii), but it is still avoidable by setting ξ larger than zero.
Based on the above theorems, we can theoretically analyze the learning process and moreover the resulting performance.In the infinite-width limit of cNN part, the dynamics of the output function f θ(t) (x) given by Eq. ( 42) takes the form Because the only difference between this dynamical equation and that for the classical case, Eq. ( 12), is in the form of NTK, the discussion in Section II D can be directly applied.In particular, if the cost L Q t is the mean squared error (2), the solution of Eq. ( 50) is given by where V Q is the orthogonal matrix that diagonalizes ), which is generally positive semi-definite.If Theorem 6 holds, then ) is positive definite or equivalently {λ Q j } are all positive; then Eq. ( 51) shows f θ(t) (x a ) → y a as t → ∞ and thus the learning process perfectly completes.Note that, if the cost is the binary cross-entropy (3), then we have

D. Possible advantage of the proposed model
In this subsection, we discuss two scenarios where the proposed qcNN has possible advantage over other models.
Possible advantage over pure classical models First, we discuss a possible advantage of our qcNN over classical models.For this purpose, recall that our QNTK contains features of quantum states in the form of a nonlinear function of the projected quantum kernel, as proven in Theorem 5. Hence, under the assumption of the classical intractability for the projected quantum kernel [27], our QNTK may also be a classically intractable object.As a result, the output function (51) or (53) may potentially achieve the training error or the generalization error smaller than that any classical means cannot reach.Now, considering the fact that designing an effective quantum kernel is in general quite nontrivial, it is useful for us to have a NN-based method for synthesizing a nonlinear kernel function that really outperforms any classical means for a given task.
To elaborate on the above point, let us study the situation where a quantum advantage would appear in the training error.More specifically, we investigate the condition where holds.Here we assume that the time τ is sufficiently large such that further training does not change the cost.Also, F is the set of differentiable Lipschitz functions, L is the number of layer of cNN, and the average is taken over the initial parameters.If (54) holds, we can say that our qcNN model is better than the pure classical model regarding the training error.To interpret the condition (54) analytically, let us further assume that the cost is the mean squared error.Then, the condition ( 54) is approximately rewritten by using Eq. ( 34) as where k=1 are pairs of the eigenvalues and eigenvectors of ητ and S Q ητ are the sets of indices where λ C j < 1/ητ and λ Q j < 1/ητ , respectively; we call the eigenvectors corresponding to the indices in S C ητ or S Q ητ as the bottom eigenvectors.That is, now the condition ( 54) is converted to the condition (55), which is represented in terms of the eigenvectors of the covariance matrices and the NTKs.Of particular importance is the second terms in both sides.These terms depend only on how well the bottom eigenvectors of Θ (L) (x, x ′ ) or Θ (L) Q (x, x ′ ) align with the label vector y.Therefore, if the bottom eigenvectors of classically intractable QNTK do not align with y at all, while that of classical counterparts align with y, Eq. ( 55) is likely to be satisfied, meaning that we may have the advantage of using our qcNN model over classical models.This discussion also suggests the importance of the structure of dataset to have quantum advantage; see Section 7 of Supplemental Materials of Ref. [27].
In our case, we may even manipulate y so that j∈S C ητ (y 2 for all possible classical models and thereby obtain a dataset advantageous in the qcNN model.A comprehensive study is definitely important for clarifying practical datasets and corresponding encoders that achieve (54), which is left for future work.

Note on the quantum kernel method
The proposed qcNN model has a merit in the sense of computational complexity for the training process, compared to the quantum kernel method.As shown in [33], by using the representer theorem [54], the quantum kernel method in general is likely to give better solutions than the standard (i.e., the data encoding unitary is used just once) variational method for searching the solution, in terms of the training error.However, the quantum kernel method is poor in scalability, as in the case of the classical counterpart; that is, O(N 2 D ) computation is needed to calculate the quantum kernel.To the contrary, our qcNN is exactly the kernel method in the infinite width limit of the classical part, and the computational complexity to learn the approximator is O(N D T ) with T the number of iterations.Therefore, as far as the number of iterations satisfies T ≪ N D , our qcNN model casts as a scalable quantum kernel method.
Specific setting where our model outperforms pure quantum or classical models Secondly, we discuss the possible advantage of the proposed qcNN model over some other models, for the training error in the following feature prediction problem of quantum states.That is, we are given the training set {ρ(x a ), y a }, where ρ(x a ) is an unknown quantum state with x a the characteristic input label such as temperature and y a is the output mean value of an observable such as the total magnetization; the problem is, based on this training set, to construct a predictor of y for a new label x or equivalently ρ(x).Let us now assume that the proposed model can directly access to ρ(x a ); then clearly it gives a better approximator to the training dataset and thereby a better predictor compared to any classical model that can only use {x a , y a }.Also, as shown below Theorem 5, our model can represent a nonlinear function of the projected quantum kernel and thus presumably approximates the training dataset better than any full-quantum model that can also access to ρ(x a ) yet is limited to produce a linear function y = Tr[AU (θ)ρ(x)U † (θ)] with an observable A. These advantage will be actually numerically demonstrated in Section IV C.Moreover, Ref. [50] proposed a model that makes a random measurement on ρ(x a ) to generate a classical shadow for approximating ρ(x a ) and then constructs a function of the shadows to predict y for a new input ρ(x).Note that our model constructs an approximator directly using the randomized measurement without constructing the classical shadows and thus includes the class of systems proposed in [50]; hence the former can perform better than the latter.Importantly, Ref. [50] identifies the class of problems that can be efficiently solved by their model; hence, in principle, this class of problems can also be solved by our model.Lastly, Ref. [55] identifies a class of similar feature-prediction problems that can be solved via a specific quantum model with constant number of training data but via any classical model with an exponential number of training data.We will be trying to identify the setting that realizes this provable quantum advantage in our qcNN framework.

IV. NUMERICAL EXPERIMENT
The aim of this section is to numerically answer the following three questions: • How fast is the convergence of QNTK, stated in the theorems in the previous section?In other words, how much is the gap between the training dynamics of an actual finite-width qcNN and that of the theoretical infinite-width qcNN?
• How much does the locality m (i.e., the size of randomization in qcNN for extracting the features of encoded data) affect on the training of qcNN?
• Is there any clear merit of using our proposed qcNN over fully-classical or fully-quantum machine learning models?
To examine these problems, we perform the following three types of numerical experiments.As for the first question, in Sec.IV A we compare the performance of a finite-width qcNN with that of the infinite-width qcNN in specific regression and classification problems; in particular, various types of quantum data-encoders will be studied.We then examine the second question for a specific regression problem, in Sec.IV B. Finally, in Sec.IV C, we compare the performance of a finite-width qcNN with a fully-quantum NN (qNN) as well as a fully-classical NN (cNN), in special type of regression and classification problems such that the dataset is generated through a certain quantum process.Throughout our numerical experiments, we use qulacs [56] to simulate the quantum circuit.

A. Finite-width qcNN vs infinite-width qcNN
In this subsection, we compare the performance of an actual finite-width qcNN with that of the theoretical infinite-width qcNN, in a regression task and a classification task with various types of quantum dataencoders.

Experimental settings
Choices of the quantum circuit.For the quantum data-encoding part, we employ 5 types of quantum circuit U enc (x) whose structural properties are listed in Table I together with Fig. 2. In all 5 cases, the circuit is composed of n qubits, and Hadamard gates are first applied to each qubit, followed by RZ-gates that encode the data element x i ∈ [−1, 1] in the form RZ(x i ) = exp(−2πix i ); here, the data vector is meaning that the dimension of the data vector is equal to the number of qubits.The subsequent quantum circuit is categorized to type-A or type-B as follows.As for the type-A encoders, we consider three types of circuits named Ansatz-A, Ansatz-A4, and Ansatz-A4ne (Ansatz-A4 is constructed via 4 times repetition of Ansatz-A); they contain additional data-encoders composed of RZ-gates with cross-term of data values, i.e., x i x j (i, j ∈ [1, 2, • • • , n]).On the other hand, the type-B encoders, Ansatz-B and Ansatz-Bne, which also employ RZ gate for encoding the data-variables, do not have such cross-terms, implying that the type-A encoders have higher nonlinearity than the type-B encoders.Another notable difference between the circuits is the existence of CNOT gates; that is, Ansatz-A, Ansatz-A4, and Ansatz-B contain CNOT-gates, while Ansatz-Ane and Ansatz-Bne do not ("ne" stands for "non-entangled").In general, a large quantum circuit with many CNOT gates may be difficult to classically simulate, and thus Ansatz-A, Ansatz-A4, and Ansatz-B are expected to show better performance than the other two circuits for some specific tasks.The structures of the subsequent classical NN part will be shown in the following subsection.
Training method for the classical neural network.In our framework, the trainable parameters are contained only in the classical part (cNN), and they are updated via the standard optimization method.First, we compute the outputs of the quantum circuit, , for all the training data {(x a , y a )}, a ∈ [1, 2, . . ., N D ]; see Fig. 1.The outputs are generated through n 0 randomized unitaries {U 1 , U 2 , . . ., U n0 }, where U i is sampled from unitary 2-designs with the locality m = 1 ) are encoded into the angle of RZ-gates.They are followed by the entangling gate composed of CNOT-gates in (a) and (c).Also, (a) and (b) have RZ-gates whose rotating angles are the product of two data values, which are called as "Cross-term" in Table I.Note that a rotating angle of RZ(x) is 2πx in (a) and (b), and the dashed rectangle (shown as "Depth=1") is repeated 4 times both in Ansatz-A4 and Ansatz-A4ne.
[57].We calculate the expectation of U † i OU i directly using the state vector simulator instead of sampling (the effect of shot noise is analyzed in Sec.IV C), and these values are forwarded to the inputs to cNN (recall that n 0 corresponds to the width of the first layer of cNN).The training of cNN is performed by using some standard gradient descent methods, whose type and the hyper-parameters such as the learning rate are appropriately selected for each task, as will be described later.The parameters at t = 0 are randomly chosen from the normal distribution N (0, 2/N param ), where N param is the number of parameters in each layer (here N (µ, σ) is the normal distribution with mean µ and standard deviation σ).

Results
Result of the regression task.For the regression task, we consider the 1-dimensional hidden function f goal (x) = sin(x )+ϵ, where ϵ is the stochastic i.i.d.noise subjected to the normal distribution N (0, 0.05).The 1-dimensional input data x is embedded into the 4-dimensional vector Here the number of training data point is chosen as N D = 100.Also the number of qubit is set to n = 4.We use the mean squared error for the cost function and the stochastic gradient descent (SGD) with learning rate 10 −4 for the optimizer.The cNN has a single hidden-layer (i.e., L = 1) with the number of nodes n 0 = 10 3 , which is equal to the number of inputs and outputs of cNN.
The time-evolution of the cost function during the learning process obtained by the numerical simulation with n 0 = 10 3 and its theoretical expression assuming n 0 → ∞ are shown in the left "Simulation" and the right "Theory" figures, respectively, in Fig. 3.The curves illustrated in the figures are the best results in total 100 trials of choosing {U i } as well as the initial parameters of cNN.Notably, the convergent values obtained in the simulation well agree with those of theoretical prediction.This means that the performance of the proposed qcNN model can be analytically investigated for various quantum circuit settings.
Another important fact is that the type-B encoders show better performance than the type-A encoders.This might be because the type-A encoders have too high expressibility for fitting the simple hidden function, which can be systematically analysed as demonstrated in [58,59].That is, the number of repetition of encoding circuit determines the distribution of Fourier coefficients of the model function; if the model function contains more frequency components, then it has a bigger expressibility for fitting the target function.From this perspective, it is reasonable that the type-B encoders (which have only single-layer encoding block) show better performance than the type-A encoders (which have 4-time-repeating encoding block), since the target hidden function is the single-frequency sin function in our setting.This observation is actually supported by another result showing that Ansatz-A4 shows the best performance for a somewhat complicated hidden function f goal (x) = (x − 0.2) 2 sin(12x ).Summarizing, the encoder largely affects on the overall performance and thus should be designed with carefully tuning its expressibility.Result of the classification task.For the classification task, we use an dataset available at [60], which was used to demonstrate that the quantum support vector machine has some advantage over the classical counterpart [61].Each input data vector x is of 2 dimensional, and thus the number of qubit in the quantum circuit is set as n = 2.The default number of inputs into cNN, or equivalently the width of cNN, is chosen as n 0 = 10 3 ; in addition, we will test the cases n 0 = 10 2 and n 0 = 10 4 for the case of Ansatz-A4ne.Also, we study two different cases of the number of layers of cNN, as L = 1 and L = 2.As for the activation function in cNN, we employ the sigmoid function σ(q) = 1/(1 + e −q ) for the output layer of both L = 1 and L = 2 cases, and ReLU σ(q) = max(0, q) for the input later of the L = 2 case; also the number of nodes is n 0 = 10 3 for the L = 1 case and n 0 = n 1 = 10 3 for the L = 2 case.The number of output label y is two, and correspondingly the model yields the output label according to the following rule; if f C θ(t) (f Q (x a )) is bigger than 0.5, then the output label is "1"; otherwise, the output label is "0".The number of training data is n D = 50 for each class.As the optimizer for the learning process, Adam [62] with learning rate 10 −3 is used, and the binary cross entropy (3) is employed as the cost function.
The time-evolution of the cost function during the learning process obtained by the numerical simulation and its theoretical prediction corresponding to the infinite-width cNN are shown in Fig. 4. The curves illustrated in the figures are the best results in total 100 trials of choosing {U i } as well as the initial parameters of cNN.Clearly, the time-evolution trajectories in Simulation and Theory figures for the same ansatz are similar, particularly in the case of L = 1.However, there is a notable difference in Ansatz-A4 and Ansatz-A4ne; in the Theory figures, the former reaches the final value lower than that achieved by the latter, while in the Simulation figures this ordering exchanges.Now recall that Ansatz-A4 is the ansatz containing CNOT gates, which induce classically intractable quantum state.In this sense, it is interesting that Ansatz-A4 outperforms Ansatz-A4ne, which is though observed only in the case (b) L = 1 Theory.
In addition, to see the effect of enlarging the width of cNN, we compare three cases where the quantum part is fixed to Ansatz-A4ne and the width of cNN varies as n 0 = 10 2 , 10 3 , 10 4 , in the case of (a) L = 1 Simulation.(Recall that the curve in the Theory figure corresponds to the limit of n 0 → ∞.)The result is that the convergence speed becomes bigger and the value of final cost becomes smaller, as n 0 becomes larger, which is indeed consistent to the NTK theory.
In figures (c, d) for L = 2, the trajectory from the simulation closely mirrors that of the theory.In particular, the theoretical result successfully predicts which encoder is effective.We also observe that the convergence speed of the theoretical result when L = 2 is significantly slower than that for L = 1 due to small eigenvalues in the QNTK.Consequently, the training using a finite-width DNN does not converge within our 10,000-iteration experiment.This results in a large discrepancy between the final cost values in Simulation and Theory in the cases of type-B.In the long iteration limit, we anticipate that the final cost values of both the Simulation and Theory will almost align.Moreover, although the trajectories of type-A from Simulation reaches lower values in fewer iterations than in Theory, this does not necessarily imply that the convergence speed of the Simulation is faster.Even if the convergence speed is not faster than that in Theory, the cost values may still reach smaller values with small steps if the final cost values in the convergence in Simulation are smaller than those in Theory.To examine these properties in convergence, simulation with longer steps are required, which will be addressed in future research.
Finally, to see the generalization error, we input 100 test dataset for the trained qcNN models.Figure 5 shows the failure rate, which can be regarded as the generalization error, for some types of ansatz.Because the failure rate obtained when using the classical kernel method presented in [63] is 45%, Ansatz-A4 and -A4ne achieve better performance.This indicates that qcNN with enough expressibility could have higher performance than that of classical method.As another important fact, the result is consistent to that of training error; that is, the ansatz achieving the lower training error shows the lower test error.This might be inconsistent to the following general feature in machine learning; that is, too much expressibility leads to the overfitting and eventually degrades the machine learning performance.However, our model is a function of the projected quantum kernel, which may have a good generalization capability as suggested in [27].Hence our qcNN model achieving small training error would have a good generalization capability.Further work comparing the performance achieved by full-quantum and full-classical methods will be presented in Section.IV C.  Figures (c,  d) depict the results in the case of L = 1 and L = respectively.The dataset is used for each ansatz.

Effect of the locality on the machine learning performance
Here we focus on the locality m, i.e., the size of the unitary gate.our framework, this is regarded as a hyper-parameter, which determines the of Hilbert space, 2 m .Note that the system performance may degrade if m is too large as pointed in [27], and thus m should be carefully chosen.Also, m affects on the eigenvalue distribution of QNTK, which closely relates to the convergence speed of the learning dynamics.Considering the fact that a random circuit may extract essential quantum effect in addition to the above-mentioned practical aspect, in this subsection we study a specific system and a ML task to analyze how much the locality m affects on the convergence speed and the resultant performance.The ML task is the classification problem for Heart Disease dataset [64].This dataset has 12 features, meaning that we use a 12-qubits system to encode one feature into one qubit.The goal is to use the training dataset to construct a model system that predicts if a patient would have a heart disease.The number of training data is 100, half of which are the data of patients having a heart disease.We take the qcNN model with L = 1 and several values of m; in particular, we examine the cases m = 1, 2, 3, 4, 6 for the same dataset.The other setup including the cost function are the same as that used in the previous classification experiment discussed in Sec.IV A.
We use the theoretical expression of the training process given in Eq. ( 51), which is obtained for the infinitewidth qcNN, rather than simulating the cost via performing actual training.The learning curves are shown in Fig. 6.As expected, the convergence speed and the value of final cost largely change dependent on m.To understand the mechanism of this result, firstly, let us recall that the training curve is characterized by the eigenvalues of QNTK Θ (1) Q (x, x ′ ) where x is the data vector.More precisely, as explicitly shown in Eq. ( 51), the dynamical component of the index j with large eigenvalue λ j converges rapidly, while the component with small eigenvalue does slowly.As a result, the distribution of eigenvalues of QNTK determines the entire convergence property of training dynamics.In particular, the ratio of small eigenvalues is a key to characterize the convergence speed.In our simulation, we observe that the magnitude of the eigenvalues totally gets smaller with larger m; this implies that the entire convergence speed would decrease, and actually Fig. 6 shows this trend.On the other hand, the variance of the eigenvalues distribution also gets smaller with larger m; as a result, the minimum eigenvalue when m = 2 is larger than that when m = 1, implying that the training dynamics with m = 2 may have totally better performance in searching the minimum of the cost, than that with m = 1.Therefore, there should be a trade-off in m.Actually, Fig. 6 clearly shows that, totally, the case of m = 2 or m = 3 leads to better performance in training.This further suggests us to have a conjecture that, in general, a larger value of m may not lead to better performance and there is an appropriate value of m; considering the fact that a large random quantum circuit is difficult to classically simulate, this observation implies a limitation of the genuine quantum part in the proposed qcNN model.
We further note that the value of final cost, which determines the prediction capability for classifying an unseen data, largely changes depending on the type of ansatz.It is particularly notable that Ansatz-A4ne or Ansatz-Bne achieves the best score for all m.These are the ansatz that contain no CNOT-gate, and thus the corresponding quantum states are classically simulatable.That is, for the Heart Disease dataset, it seems that the genuine quantum property, including entanglement, is not effectively used for enhancing the classification performance of the qcNN model.This fact is consistent to the claim given in [27], stating that all quantum machine learning systems will improve the for some specific dataset.In the next subsection, therefore, we will show another learning task with special type of such that the proposed qcNN containing CNOT gates has certain advantage.Here we study a regression task and a classification task for the dataset generated by a quantum process, to demonstrate possible quantum advantage of the proposed qcNN model as discussed in Section III D. Our experimental setting is based on the concept suggesting that a quantum machine learning model, which is appropriately constructed with carefully taking into account the dataset, may have a good learnability and generalization capability over classical means.In particular, there are some argument discussing quantum advantage for the dataset generated through a quantum process; see for example [28].We show that our qcNN model has such desirable property and actually shows better performance, even with much less parameters and thus smaller training cost, compared to a fully classical and a fully quantum means for learning the quantum data-generating process.

Machine learning task and models
First, we explain the meaning of a data generated from a quantum process, which we simply call a "quantum data" (the case described here is a concrete example of the setting addressed in Section III D).Typically, a quantum data is the output state of a quantum system driven by a Hamiltonian H(x a ), i.e., ρ(x a ) = e −iH(x a ) ρ 0 e iH(x a ) , where the input x a represents some characteristics of the state such as controllable temperature.The state ρ(x a ) further evolves through an unknown quantum process including a measurement process.Finally, the output y a is obtained by measuring some observables; thus, y a represents a feature of the process or ρ(x a ) itself.Given such training dataset {x a , y a } N D a=1 , the task is to construct a function that approximates this input-output mapping with good generalization capability for an unseen input data.This problem is related to the general quantum phase recognition (QPR) problem [12,65,66] in condensed-matter physics [67], which is inherently classically hard but some quantum machine learning methods may solve efficiently [68][69][70].
Here we study a specialized version of the above-described problem such that the training dataset {x a , y a } N D a=1 is provided as follows.The input dataset {x a } N D a=1 is simply generated from the n-dimensional uniform distribution on [0, 2π] n .Then ρ(x a ) is generated via an unknown quantum dynamical process U enc (x) = e −iH(x) ; in the simulation, we assume that this process is given by the quantum circuit shown in Fig. 7 (a) composed of single qubit RX-rotation gates followed by a random multi-qubit unitary operator U random , the detail of which is shown in Appendix G.The output y a is determined depending on the task.For the regression task, it is given by y a = cg(x a ) + ϵ a , where g(x) = Tr [ρ(x)O] and ϵ is a Gaussian noise with Var [ϵ] = 10 −4 .This measurement process may contain some uncertainties, and thus we assume that the observable O is unknown for the algorithms.Also c is the normalized constant introduced to satisfy Var [g(x)] = 1.For the classification task, if g(x a ) ≥ n/2 then y a = 1, and otherwise y a = 0, where again g(x) = Tr [ρ(x)O].In the simulation, we take O = n i=1 (σ z is the Pauli z operator and 1 (i) is the identity operator on the i-th qubit.The number of training dataset is chosen as N D = 1000 for the regression task and N D = 3000 for the classification task.Moreover, we evaluate the generalization capability using N test = 100 test dataset, which is common for both tasks.We employ three types of learning models, which are shown in Fig. 7. First, the figure (b) shows our qcNN model.The point that this model contains the same encoder U enc (x) as that used for generating the training data; that is, the qcNN model has a direct access to the quantum data ρ(x a ).This process is then followed by the random unitary operator given by the product of single qubit gate (i.e., m = 1).The output of the quantum circuit is generated by measuring the observable O = n i=1 (σ Note that this is the same observable as that used for generating the training data, which is assumed to be unknown to the algorithm.However, the random unitary process before the measurement makes this assumption still valid; actually we found that choosing a different observable other than O does not largely change the final performance.Then the expectation of measurement results is transferred to the input to the single-layer (i.e., L = 1) cNN.The activation function of the output node is chosen depending on the task; we employ the function the regression task, while the sigmoid function σ(q) = 1/(1 + e −q ) for the classification task.Finally, the cNN generates y pred as the raw output for the regression task or y pred as the binarized output of the cNN for the classification task; that is, as for the latter, y pred = 1 if the output of cNN is bigger than the threshold 0.5 and y pred = 0 if the output is below 0.5.Note that Ref. [50] uses the random measurement to generate a classical shadow for approximating ρ(x a ) and then constructs a classical machine learning model in terms of the shadows to predict y for unseen ρ(x); in contrast, our approach construct a machine learning model directly using the randomized measurement result without constructing the classical shadows and thus has a clear computational advantage.
The second model is the quantum neural network (qNN) depicted in Fig. 7 (c).This model also has the quantum data ρ(x a ) directly as input, as in the qcNN model; that is, in the simulation, the data vector x a is encoded to the same quantum circuit U enc (x) shown in Fig. 7 (a).Then a parametrized quantum circuit depicted in the dotted box follows the encoder; this circuit is repeated L q times, where each block contains different parameters.The output of qNN is computed as y pred = wTr [ρ ′ (x)O ′ ], where ρ ′ (x) is the output state of the entire quantum circuit and O ′ is chosen as z .Lastly, w is a scalar parameter; the parameter w is optimized together with the circuit parameters {θ i,j }, to adjust the output range in the case of the regression task, while w is fixed to 1 in the case of the classification task.As in the first model, we use y pred for the regression task and its binarized version for the classification task.
The third model is the 3-layers cNN depicted in Fig. 7 (d), composed of the n-nodes input layer, n 0 -nodes hidden layer, and a single-node output layer.The input and the hidden layers are fully connected; the output node is also fully connected to the hidden layer.We input the data vector x a so that its i-th component x a i is the input to the i-th node.The activation function is chosen depending on the layer and the task; we employ the sigmoid function σ(q) = 1/(1 + e −q ) in the hidden layer and the identity function in the output node for the regression task, while ReLU σ(q) = max{0, q} in the hidden layer and the sigmoid function in the output node for the classification task.Hence, this pure classical model knows neither the quantum data ρ(x a ) and the observable O (or the model does not have enough power for computing these possibly large matrices).In all the above three models, we will test four cases in the number of qubit, n = {2, 3, 4, 5}.Also, the number of nodes in the cNN chosen as n 0 = 10 3 for the qcNN model and the pure cNN model.The expectation value of the observable is calculated using the statevector.Adam [62] is used for optimizing the parameters.

Results
Result of the regression task.The resulting performance of the regression task, for both training and test process, is shown in Fig. 8.We plot the root mean squared errors (RMSE) between the predicted value y pred and the true value, versus the number of qubit n.For each n, we performed five trials of experiments and computed the mean and the standard deviation of RMSE.Note that the three models have different number of parameters and the measured observables (the latter is applied only to the qcNN and qNN models); a detailed information is given in Table II.In particular, the number of parameters of qcNN is much less than that of cNN though they have the same width of nodes (n 0 ).Also, the number of measured observables required for optimizing qcNN is much less than that for qNN, because qcNN model does not need to optimize the quantum part by repeatedly measuring the output quantum state.
That is, in total, the qcNN is a compact machine learning model compared to the other two.Nonetheless, qcNN achieves the best performance for all cases in the number of qubits and in both training and test dataset, as shown in Fig. 8.This is mainly thanks to the "inductive bias" [28], meaning that qcNN model has an inherent advantage of having the quantum data itself as input.This bias is also given to the qNN model, but this model fails to approximate the training and test data when the number of qubits increases; this is presumably because the model does not have a sufficient expressibility power for representing the target function in the Hilbert space.However, even if the qNN model would have such power, it still suffers from several difficulties in the learning process such as the barren plateau issue and the issue of increasing number of measurement.It is also notable that both qcNN and qNN show almost the same performance for the training and test dataset, meaning that they do not overfit at all to the target data, while the performance of cNN model becomes worse for the test data.This difference might be because the quantum models can access to the data quantum state; this is indeed an inductive bias which may be effectively used for having a good generalization capability as suggested in [28].
Result of the classification task.The resulting performance of the classification task is shown in Fig. 9 which plots the value of Accuracy depending on the number of qubits, n.We perform five trials of experiments and compute the mean and the standard deviation of Accuracy.In this task, we observed a similar performance trend as that for the regression task, where qcNN shows the best performance for all n presumably due to the same reason discussed above; that is, the inductive bias in qcNN model and the lack of expressibility of qNN.In addition to the two class classification task, we also executed the multiclass classification task with the same type of data and models, and observed a similar performance trend (found in Appendix F).Just for reference, we also plot Accuracy of qcNN for the case of N ite = 3000 and n 0 = 3000, indicated by 'qcNN(tuned)' in Fig. 9 (recall that Accuracies of the other three are obtained when N ite = 1000 and n 0 = 1000).This means that the performance of qcNN model can be improved via modifying purely classical part.Effect of shot noise.Finally, we study how the regression/classification performance of the qcNN model would change with respect to the number of shots (measurements); recall that the previous numerical simulations shown in Figs. 8 and 9 use the statevector simulator, meaning that the number of shots is infinite.The problem settings, including the type of dataset and the learning model, are the same as those studied in the previous subsections.The result is summarized in Fig. 10, showing (a) the RMSE for the regression task and (b) the Accuracy for the classification task.In both cases, we examined different number of qubits n, where the number of layers of cNN part is 1; but we also examined the 2-layers cNN only for the case of n = 5, where the width of the 1st and 2nd layers are n 0 = n 1 = 10 3 .
In the regression task, we observe a clear statistical trend between RMSE ϵ and the number of shots N shot as ϵ ∼ O(1/ √ N shot ), except for the case of 2-layers cNN denoted as 'n = 5(L2)'.The figure suggests that 10 5 shots seems to offer a sufficient performance comparable to the ideal statevector simulator.It is notable that the 2-layers cNN significantly reduces RMSE, especially when the number of shots is relatively small; this seems that the higher nonlinearlity of cNN compensates the shot noise.As for the classification task shown in Fig. 10 (b), it is notable that the necessary number of shots is much smaller compared to the regression case.In particular, 10 2 shots achieves a comparable performance to the ideal statevector simulator.This means that the shot noise, or equivalently the noise contained in the input data to the cNN part, does not largely affect on the performance in the classification task.II.The first and the second row is calculated based on the structure of the models, and the third row is the specific N p in our numerical experiments.Note that N p differs depending on the model.The value of N p of qNN seems to be much smaller than others, but this was chosen so that the computational cost for training the parameters in the quantum part becomes practical (recall that the operation speed of the quantum device is slower than that of state-of-the-art classical computers).Table III shows the number of iterations for training, N ite , together with N qc that is calculated by substituting N ite for N qc in Table II.In this paper, we studied a qcNN composed of a quantum data-encoder followed by a cNN, such that the seminal NTK theory can be directly applied.Actually with appropriate random initialization in both parts and by taking the large width limit of nodes of the cNN, the QNTK defined for the entire system becomes time-invariant and accordingly the dynamics of training process can be explicitly analyzed.Moreover, we find that the output of the entire qcNN becomes a nonlinear function of the projected quantum kernel.That is, the proposed qcNN system functions as a nontrivial quantum kernel that can processes the regression and classification tasks with less computational complexity than that of the conventional quantum kernel method.Also, thanks to the analytic expression of the training process, we obtained a condition of the dataset such that qcNN may perform better than classical counterparts.In addition, for the problem of learning the quantum data-generating process, we gave a numerical demonstration showing that the qcNN shows a clear advantage over full cNNs and qNNs, the latter of which is somewhat nontrivial.
As deduced from the results in Section IV as well as those of the existing studies on the quantum kernel method, the performance heavily depends on the design of data-encoder and the structure of dataset.Hence, given a dataset, the encoder should be carefully designed so that the resulting performance would be quantum-enhanced.A straightforward approach is to replace the fixed data-encoding quantum part with a qNN and train it together with the subsequent data-processing cNN part.Actually, we find a general view that a deep learning uses a neural network composed of the data-encoding (or feature extraction) part and the subsequent data-processing part.Hence, such a qNN-cNN hybrid system might have a similar functionality as the deep learning, implying that it would lead to better prediction performance and further, hopefully, achieve some quantum advantages.However, training the qNN part may suffer from the vanishing gradient issue; hence relatively a small qNN might be a good choice.We leave this problem as a future work.
Then the elementwise QNTK is computed as where the last line is derived in the proof in Theorem 3. Therefore K (ℓ) Qjk (x, x ′ , t) → Θ (1) Q (x, x ′ ) is proved for ℓ = 1.
For a c a = β ̸ = 0, the left hand side is proportional to β 2 , thus we can obtain the general condition that (E4) is satisfied even if we set β = 1.Let us define ρ k ≡ a c a ρ k x a .Then ρ k is Hermitian with Tr(ρ k ) = 1.Therefore, given the eigenvalues of ρ k as {γ k i } 2 m i=1 , where equality is attained when γ k i = 1/2 m , meaning that Tr ρ k 2 ≥ 1/2 m and the equality is satisfied when ρ k = I m /2 m .Thus by using the equality condition, we see that if and only if a c a ρ k x a = I m /2 m .Therefore (E4) is satisfied unless ξ 2 = 0 and there exists c that satisfies a c a = 1, and a c a ρ k x a = I m /2 m , which corresponds to the condition (ii).Since Σ

FIG. 3 :
FIG. 3: Cost function versus the iteration steps for the regression problem.The time-evolution of the cost function obtained by the numerical simulation with n 0 = 10 3 and its theoretical expression assuming n 0 → ∞ are shown in the left "Simulation" and the right "Theory" figures, respectively.

FIG. 4 :
FIG. 4: Cost function versus the iteration steps for the classification problem.Figures(a, b) and Figures (c, d) depict the results in the case of L = 1 and L = respectively.The dataset is used for each ansatz.

FIG. 5 :
FIG. 5: Failure rate for the test data (L = 1, n 0 = 10 3 ).Colored bars represent the median, and the lower (upper) edge of the error bar represents the best (worst) score in total 100 trials.Each scores are calculated with 100 test data.The dashed horizontal line shows the score via the random guess, which is 50% since this is a 2-class classification task.

FIG. 6 :
FIG. 6: Theoretical prediction of learning curve of qcNN for the classification task with Heart Disease Data Set.Figures (a)-(e) correspond to different locality m.The quantum circuit is composed of 12 qubits.We use the classical NN with L = 1.The other setting, including the cost (the binary cross entropy), is the same as that studied in the classification task in Section IV A.

FIG. 7 :
FIG. 7: The models used in Sec.IV C: (a) Quantum circuit for generating the quantum dataset, which is used for the simulation purpose.(b) Quantum-classical hybrid neural network (qcNN), (c) quantum neural network (qNN), and (d) pure classical neural network (cNN).We use L q = 10 in the figure (c), i.e. a 10-layer qNN.In the figure, the box M depicts the measurement and U 3 (α, β, γ) depicts the generic single-qubit rotation gate with 3 Euler angles.

FIG. 8 :
FIG. 8: The root mean squared errors (RMSE) versus the number of qubits, for (a) the training dataset and (b) the test dataset, for the regression task.

FIG. 9 :
FIG. 9: The accuracy versus the number of qubits, for (a) the training dataset and (b) the test dataset, for the classification task.

FIG. 10 :
FIG. 10: (a) The root mean squared errors (RMSE) versus the number of shots for the training dataset, for the regression task.(b) The accuracy versus the number of shots (measurements) for the training dataset, for the classification task.

FIG. 11 :FIG. 12 :
FIG. 11: The accuracy versus the number of qubits, for (a) the training dataset and (b) the test dataset, for the multiclass classification task.

TABLE I :
Specific structural properties of U enc (x).
… (d) Ansatz-Bne FIG.2: Configuration of U enc (x).First, Hadamard gates are applied to each qubit.Then, the normalized data values x i

TABLE III :
The number of iterations and different quantum circuits for the training process in the problem studied in Sec.IV C