On the expressivity of embedding quantum kernels

One of the most natural connections between quantum and classical machine learning has been established in the context of kernel methods. Kernel methods rely on kernels, which are inner products of feature vectors living in large feature spaces. Quantum kernels are typically evaluated by explicitly constructing quantum feature states and then taking their inner product, here called embedding quantum kernels. Since classical kernels are usually evaluated without using the feature vectors explicitly, we wonder how expressive embedding quantum kernels are. In this work, we raise the fundamental question: can all quantum kernels be expressed as the inner product of quantum feature states? Our first result is positive: Invoking computational universality, we find that for any kernel function there always exists a corresponding quantum feature map and an embedding quantum kernel. The more operational reading of the question is concerned with efficient constructions, however. In a second part, we formalize the question of universality of efficient embedding quantum kernels. For shift-invariant kernels, we use the technique of random Fourier features to show that they are universal within the broad class of all kernels which allow a variant of efficient Fourier sampling. We then extend this result to a new class of so-called composition kernels, which we show also contains projected quantum kernels introduced in recent works. After proving the universality of embedding quantum kernels for both shift-invariant and composition kernels, we identify the directions towards new, more exotic, and unexplored quantum kernel families, for which it still remains open whether they correspond to efficient embedding quantum kernels.


I. INTRODUCTION
Quantum devices carry the promise of surpassing classical computers in certain computational tasks [1][2][3][4][5][6].With machine learning playing a crucial role in predictive tasks based on training data, the question arises naturally to investigate to what extent quantum computers may assist in tackling machine learning (ML) tasks.Indeed, such tasks are among the potential applications foreseen for near-term and intermediate-term quantum devices [7][8][9][10][11][12].
In the evolving field of quantum machine learning (QML), researchers have explored the integration of quantum devices to enhance learning algorithms [13][14][15][16][17][18][19].The moststudied approach to QML relies on learning models based on parametrized quantum circuits (PQCs) [20], sometimes referred to as quantum neural networks.When considering learning tasks with classical input data, PQCs must embed data into quantum states.This way, PQCs are built from encoding and trainable parts, and real-valued outputs are extracted from measuring certain observables.Since the inception of the field, a strong parallelism has been drawn between PQC-based QML models and kernel methods [14,15,21].
Kernel methods, like neural networks, have been used in ML for solving complex learning tasks.Yet, unlike neural networks, kernel methods reach the solution by solving a linear optimization task on a larger feature space, onto which input data is mapped.Consequently, the kernel approach is very well suited by our ability to map classical data onto the * emgilfuster@gmail.comHilbert space of quantum states.These maps are called quantum feature maps, and they lead to quantum kernel methods.Although kernel methods are more costly to implement than neural networks, they are guaranteed to produce optimal solutions.Much the same way, with quantum kernel methods, we are guaranteed to find better solutions than with other PQCbased models (where "better" means solutions which perform better on the training set; see Ref. [22] for a discussion on when this guarantee is not enough to ensure a learning advantage).
For quantum kernel methods, plenty of knowledge is inherited from classical ML -including kernel selection tools [19,23], optimal solution guarantees [21,22], generalization bounds [24], and approximation protocols [25][26][27].Nevertheless, there is one large difference between quantum and classical kernel methods, and namely one that affects the cornerstone of these techniques: the kernel function (or just kernel).Formally, all kernels correspond to the inner product of a pair of feature maps.Yet, first constructing the feature vector and second evaluating the inner product is often inefficient.Fortunately, many cases are known in which the inner product can be evaluated efficiently, by means other than constructing the feature map explicitly.For instance the Gaussian kernel is a prominent instance of this case.This is sometimes called "the kernel trick", and as a result it is often the case that practitioners do not even specify the feature vectors when using kernel methods.In contrast to this, it is fair to say that quantum kernels hardly ever use this trick, with some exceptions [28,29].
Almost all quantum kernels conceived in the literature are constructed explicitly from a quantum feature map, or quantum embedding [21,30], as discussed below.Specifically, one considers as quantum kernel κ the inner product κ(x, x ′ ) := FIG. 1. Illustration of the main question of this paper.Embedding Quantum Kernels (EQKs) have the form of an explicit inner product on the Hilbert space of quantum density matrices, which is evaluated using a quantum circuit.The box "Kernel functions" indicates that EQKs correspond to an inner product of feature vectors on a Hilbert space.The box "Efficient Quantum functions" restricts EQKs to functions that can be evaluated using a quantum computer in polynomial time, for instance these would include preparing a state-dependent state ρ(x, x ′ ) and then measuring the expectation value of an observable M on the data-dependent state.The box "Efficient Embedding Quantum Kernels" then clearly lives in the intersection of the two other boxes.The question we address here is then whether EQKs do cover the whole intersection.Said otherwise, can every efficient quantum kernel function be expressed as efficient EQKs?Or, on the contrary, do there exist efficient quantum kernels which are not expressible as efficient EQKs? ⟨ψ(x), ψ(x ′ )⟩, where x → ψ(x) is a representation of a quantum state, either a state vector or a density operator, and ⟨•, •⟩ is the appropriate inner product.In particular, quantum embeddings map classical data onto the Hilbert space of quantum states, or said otherwise, on the Hilbert space of quantum computations.We call embedding quantum kernels (EQKs) the kernels which come from quantum embeddings.
This difference between quantum and classical kernels raises some interesting questions, as for example: Are EQKs the whole story for quantum kernel methods?Can all quantum kernels be expressed as EQKs?
In this manuscript we analyze what families of kernels are already covered by EQKs, see Fig. 1.Our contributions are the following: 3. We introduce a new class of kernels, called composition kernels, containing also non-shift-invariant kernels.We prove that efficient EQKs are universal in the class of efficient composition kernels, from where we can show that the projected quantum kernel from Ref. [28] can in fact also be realized as an EQK efficiently.
In all, we unveil the universality of EQKs in two important function domains.
The rest of this work is organized as follows.The mathematical background and relevant definitions appear in Section II.Related and prior work is elucidated in Section III.Next, we prove the universality of embedding quantum kernels and formally state Question 1 in Section IV.Our results on shift-invariant kernels are in Section V, and the extension to composition kernels and the projected quantum kernel in Section VI.Finally, Section VII contains a collection of questions left open.A summary of the manuscript and closing remarks constitute Section VIII.

II. PRELIMINARIES
In this section we fix notation and introduce the necessary bits of mathematics on quantum kernel methods.

A. Notation
For a vector v = (v i ) i ∈ R m , we call the 1-norm (or ℓ 1norm) ∥v∥ 1 = m i=1 |v i | and the 2-norm ∥v∥ 2 = m i=1 v 2 i .We denote as ℓ m 1 the set of m-dimensional unit vectors with respect to the 1-norm, and similarly ℓ m 2 the set of vectors that are normalized to have a unit 2-norm.
For a Hilbert space H, we use ⟨•, •⟩ H to denote the inner product on that space.In the case of Euclidean spaces, we drop the subscript.Let A, B ∈ C m×m be square complex matrices.Then, we call the Hilbert-Schmidt inner product of A and B ⟨A, B⟩ HS = tr A † B = m i,j=1 For Hermitian matrices, we have A = A † , and so the HS inner product becomes just ⟨A, B⟩ HS = tr {AB} = m i,j=1 A j,i B i,j . ( The Frobenius norm ∥•∥ F of a matrix A is the square root of the sum of the magnitude square of all its entries, which is also equal to the root of the Hilbert-Schmidt inner product of the matrix with itself, as In what follows, we call X ⊆ R d a d-dimensional compact subset of the reals.We reserve n to denote qubit numbers.
As we explain below, we use k to refer to arbitrary kernel functions, while κ is used exclusively for embedding quantum kernels.
When talking about efficient and inefficient approximations, we consider sequences of functions {k s } s∈N .We refer to s as the scale parameter.Scale parameters can correspond to different qualities of the sequence, as for example the dimension of the input data, or the number of qubits involved in evaluating a function.Efficiency then means at most polynomial scaling in s, and inefficiency means at least exponential scaling in s, hence the name scale parameter.

B. Kernel methods
Kernel methods solve ML tasks as linear optimization problems on large feature spaces, sometimes implicitly.The connection between the input data and the feature space comes from the use of a kernel function.In this work we do not busy ourselves with how the solution is found, but rather we focus on our ability to evaluate kernel functions, which are defined as follows.
Definition 1 (Kernel function).A kernel function k : X × X → R is a map from pairs of inputs on the reals fulfilling two properties: 1. Symmetry under exchange: k(x, x ′ ) = k(x ′ , x) for every x, x ′ ∈ X .
This is equivalent to saying that the Gram matrix K := [k(x i , x j )] m i,j=1 is positive semi-definite for any m and any {x i } m i=1 .
Other standard definitions exclude the PSD property.In this work we do not study indefinite, non-PSD kernels, although the topic is certainly of interest.The common optimization algorithms used in kernel methods (SVM, KRR) require kernels to be PSD.Even though we said we do not deal with the optimization part in this manuscript, we study only the kernels that would be used in a (Q)ML context.
Symmetry and PSD are properties usually linked to inner products, which partly justifies our definition of a Gram matrix as the one built from evaluating the kernel function on pairs of inputs.Indeed, by Mercer's Theorem (detailed in Appendix A) for any kernel function k, there exists a Hilbert space H and a feature map ϕ : X → H such that, for every pair of inputs x, x ′ ∈ X , it holds that evaluating the kernel is equivalent to computing the H-inner product of pairs of feature vectors k(x, x ′ ) = ⟨ϕ(x), ϕ(x ′ )⟩ H .We say every kernel k has an associated feature map ϕ, which turns each datum x into a feature vector ϕ(x), living in a feature space H.
One notable remark is that different kernel functions have different feature maps and feature spaces associated to them.Each learning task comes with a specific data distribution.The ultimate goal is to, given the data distribution, find a map onto a feature space where the problem becomes solvable by a linear model.The model selection challenge for kernel methods is to find a kernel function whose associated feature map and feature space fulfill this linear separation condition.Intuitively, in a classification task, we would want data from the same class to be mapped to the same corner of Hilbert space, and data from different classes to be mapped far away from one another.Then our project can be framed within the quantum kernel selection problem, we want to identify previously unexplored classes of quantum kernels.

C. Embedding quantum kernels
In the following, we introduce the relevant concepts in quantum kernel methods.Our presentation largely follows the lines of Ref. [21].We use the name "embedding quantum kernels" following Ref.[19]; many other works use simply "quantum kernels" when referring to the same concept.Hopefully our motives are clear by now.
Learning models based on PQCs are the workhorses in much of today's approaches to QML.With some notable exceptions, these PQC-based models comprise a two-step process: 1. Prepare a data-dependent quantum state.For every x ∈ X , produce |ϕ(x)⟩ = U (x)|0⟩ with some dataembedding unitary U (x).
2. Measure the expected value of a variationally-tunable observable on the data-dependent state.Given a parametrized observable M(ϑ), evaluate For instance, for binary classification, one would then consider labeling functions like h(x) = sign ⟨M(ϑ)⟩ ϕ(x) + b and optimize the variational parameters ϑ and b.
Examples of how U (x) could look like in practice include amplitude encoding [21], different data re-uploading encoding strategies [17,31,32], or the IQP ansatz [14].In turn, variationally-tunable observables are usually realized as a fixed "easy" observable (like a single Pauli operator), preceded by a few layers of brickwork-like 2-local trainable gates.In our current presentation, we do not restrict a particular form for U (x) or M(ϑ), but rather allow for any general form possible, as long as it can be implemented on a quantum computer.
This approach can also be understood through the lens of kernel methods.In this case, the feature space is explicitly chosen to be the Hilbert space of Hermitian matrices (of which density matrices are a subset, so also quantum states).Given a data-dependent state preparation x → |ϕ(x)⟩, one could choose to call it a quantum feature map ρ and promote it to quantum density operators, or quantum states, as Another fitting name would be quantum feature state.In a slightly more general view, we call data embedding any map from classical data onto quantum density matrices of fixed dimension ρ : X → Herm(2 n ) for n-qubit systems.With this we abandon the need for a unitary gate applied to the |0⟩ state vector.
Together with the Hilbert-Schmidt inner product, quantum feature maps give rise to an important family of kernel functions: Definition 2 (Embedding quantum kernel (EQK)).Given a data embedding x → ρ(x) used to encode classical data x in a quantum density operator ρ(x), we call embedding quantum kernel (EQK) κ ρ the Hilbert-Schmidt inner product of pairs of quantum feature vectors: Fig. 2 illustrates this construction.
Since κ ρ is defined explicitly as an inner product for any ρ, it follows that it is a PSD, symmetric function.For ease of notation, we will write just κ whenever ρ is unimportant or clear from context.There are, from the outset, a few good reasons to consider EQKs as QML models: 1.It is possible to construct EQK families that are not classically estimable unless BQP = BPP, thus opening the door to quantum advantages [33].
2. A core necessity for successful kernel methods is to map complex data to high-dimensional spaces where feature vectors become linearly separable.The Hilbert space of quantum states thus becomes a prime candidate due to its exponential dimension, together with our ability to estimate inner products efficiently 1 .
FIG. 2. Schematic of the different ingredients that form embedding quantum kernels.The data input is mapped onto the "quantum feature space" of quantum density operators via a quantum embedding.
There the Embedding Quantum Kernel is defined as the Hilbert-Schmidt inner product of pairs of quantum features.
3. Since EQKs are explicitly defined from the data embeddings, we are free to design embeddings with specific desired properties.
And, a priori, other than the shot noise2 , EQKs do not add drawbacks to the list of general issues for kernel methods.So, EQKs are a well-founded family of quantum kernel functions.Nevertheless, their reliance on a specific data embedding could be a limiting factor a priori.
In the case of classical computations we are not accustomed to think in terms of the Hilbert space of the computation (even though it does exist).There are numerous examples of kernels which have been designed without focusing on the feature map.The most prominent example of a kernel function used in ML is the radial basis function (RBF) Gaussian kernel (or just Gaussian kernel, for short), given by On the one hand, we know the Gaussian kernel is a PSD function by virtue of Bochner's theorem (stated in Section V as Theorem 4), which explains it via the Fourier transform (FT).
On the other hand, though, one can prove that the Gaussian kernel corresponds to an inner product in the Hilbert space of all monomials of the components of x, from where it follows that the Gaussian kernel can learn any continuous function on a compact domain Ref. [34], provided enough data is given.The feature map corresponding to the Gaussian kernel is infinite-dimensional.This fact alone makes the Gaussian kernel not immediately identifiable as a reasonable EQK, which motivates the question: can all quantum kernels be expressed as embedding quantum kernels?
In this section, we have introduced kernel functions, embedding quantum kernels, and the question whether EQKs are all there is to quantum kernels.In the following sections, we first see that EQKs are universal, then we formalize a question about the expressivity of efficient EQKs, and finally we answer the question by proving universality of EQKs in two restricted kernel families.

III. RELATED WORK
An introduction to quantum kernel methods can be found in the note [21] and in the review [30].We refrain from including here a full compendium of all quantum kernel works to date, as good references can already be found in Ref. [12].Instead, we make an informed selection of papers that bear relation with the object of our study: which quantum kernels are embedding quantum kernels (EQKs).
Kernel methods use different optimization algorithms depending on each task, the most prominent examples being support vector machine and kernel ridge regression.The early years of QML owed much of their activity to the HHL algorithm [35].The quantum speed-up for linear algebra tasks was leveraged to propose a quantum support vector machine (qSVM) algorithm [36].The qSVM algorithm is listed among the first steps of QML historically.Since qSVM is a quantum application for kernel methods, it is no wonder that the term quantum kernel methods was introduced for concepts around it.This, however, is not what we mean by quantum kernel methods in this work.Instead, we occupy ourselves with kernel methods where the kernel function itself requires a quantum computer to be evaluated, independently of the nature of the optimization algorithm.The optimization step comes only after the kernel function has been evaluated on the training data, so the object of study of qSVM is not the same as in this manuscript.We study the expressivity of a known kind of kernel functions, and not how to speed up the optimization of otherwise classical algorithms.
Among the first references mentioning the evaluation of kernel functions with quantum computers is Ref. [13], where the differences between quantum kernels and qSVM have been made explicit.References [14,15] have showcased the link between quantum kernels and PQCs and have demonstrated their implementation experimentally.In them, the authors have mentioned the parallelisms between quantum feature maps and the kernel trick.In particular, Ref. [15] has coined the distinction between an implicit quantum model (quantum kernel method), and an explicit quantum model (PQC model with gradient-based optimization).Implicit models are in this way an analogous name for EQKs, where em-phasis is made on the distinction from other PQC-based models.
Quantum kernels have enjoyed increasing attention since they have been used to prove an advantage of QML over classical ML in Ref. [33].The authors have morphed the discrete logarithm problem into a learning task and then used a quantum kernel to solve it efficiently, which cannot be done classically according to well-established cryptographic assumptions.In that, the approach taken is similar to that of Ref. [37] (showing a quantum advantage in distribution learning).As such, this has been among the first demonstrations of quantum advantage in ML, albeit in an artificially constructed learning task.Importantly, the quantum kernel used in this work has explicitly been constructed from a quantum embedding, so it is an EQK in the sense of this work.
One important difference between EQKs and other PQCbased approaches is that, for EQKs, the only design choice is the data embedding itself.Ref. [18] has wondered how to construct optimal quantum feature maps using measurement theory.The work has proposed constructing embeddings specific to learning tasks, which resonates with the idea that some feature maps are better than others for practically relevant problems.Ref. [19], where the term EQKs has been introduced for the first time, has presented the possibility of optimizing the feature map variationally, drawing bridges to ideas from data re-uploading [17,31].Other than the trainable kernels of Ref. [19], the data re-uploading framework did not fit with the established quantum kernel picture up until this point in time.
In Ref. [22], the differences between explicit models, implicit models, and re-uploading models have been analyzed.On the one hand, the authors have found that the optimality of kernel methods might not be optimal enough, since they show a learning task in which kernels perform much worse than explicit models when evaluated on data outside the training set.On the other hand, a rewriting algorithm has been devised to convert re-uploading circuits into equivalent encoding-first circuits, so re-uploading models are explicit models.By construction, each explicit model has a quantum kernel associated to it.So, for the first time, using the rewriting via explicit models, Ref. [22] made data re-uploading models fit in the quantum kernel framework.This reference takes the explicit versus implicit distinction from [15], which means that it also only considers EQKs when it comes to kernels.Some examples of different kinds of quantum kernels can be found in Refs.[28,29], which upon first glance do not resemble the previous proposals.In Ref. [28] a type of kernel functions based on the classical shadows protocol has been proposed, under the name of projected quantum kernel.These functions have been analyzed in the context of a quantum-classical learning separation, now considering realworld data, as opposed to the results of Ref. [33].Even though the projected quantum kernel uses an explicit feature map which requires a quantum computer, the feature vectors are only polynomial in size, and they are stored in classical memory.Next, the kernel function is the Gaussian kernel evaluated on pairs of feature vectors, and not their Euclidean or Hilbert-Schmidt inner product.These two differences set the pro-jected quantum kernel apart from the rest.In turn, the authors of Ref. [29] set out to address a looming issue for EQKs called exponential kernel concentration [38], or vanishing similarity, which is tantamount to the barren plateau problem [39] that could arise also for kernels.They show that the new construction, called "anti-symmetric logarithmic derivative quantum Fisher kernel", does not suffer from the vanishing similarity problem.Here, the classical input data is mapped onto an exponentially large feature space.Given a PQC, classical inputs are mapped to a long array, with as many entries as trainable parameters in the PQC.Each entry in this array is the product of the unitary matrix implemented by the PQC and its derivatives with respect to each variational parameter.Interestingly, this kernel can be seen as the Euclidean inner product of a flattened vector of unitaries, with a metric induced by the initial quantum state.In this way, classical data is not mapped onto the Hilbert space of quantum states, but the inner product used is the same as in regular EQKs.This realization paves the way to expressing these kernels as EQKs.
Recent efforts in de-quantizing PQC-based models via classical surrogates [40] have touched upon quantum kernels.In Refs.[25][26][27], the authors propose using a classical kernelapproximation protocol based on random Fourier features (RFF) to furnish classical learning models capable of approximating the performance of PQC-based architectures.The techniques used in these works are not only very promising for de-quantization of PQC-based learning models, but also they are relevant to our discussion.Below, we also use the RFF approach, albeit in quite a different way, as it is not our goal to find classical approximations of quantum functions, but rather quantum approximations of other quantum functions.The goal of the study in Refs.[25][26][27] is to, given a PQC-based model, construct a classical kernel model with guarantees that they are similarly powerful.In this way, the input to the algorithms is a PQC (either encoding-first, or data re-uploading), and the output is a classical kernel.Conversely, in our algorithms, the input is a kernel function, and the output is an EQK-based approximation of the same function.
In Ref. [41], we find an earlier study of RFFs in a QML scenario.The authors focus on a more advanced version of RFFs, called optimized Fourier Features, which involves sampling from a data-dependent distribution.In the classical literature, it was found that sampling from this distribution could be "hard in practice" [42], so the authors of Ref. [41] set out to propose an efficient quantum algorithm for sampling.This way, a quantum algorithm is proposed to speed up a training algorithm for a classical learning architecture, similar to the case of qSVM [36].
Similarly, Ref. [43] considered combining notions from EQKs, the projected quantum kernel, and their respective approximations with the RFF algorithm.The authors explored the options of combining distance-based and inner-productbased kernels in an attempt to address the vanishing similarity problem.In turn, this paper tackles the suboptimal scaling of kernel methods by making use of RFFs.Some of the techniques introduced in this reference are very much in line with the composition kernels we introduce in Section VI.The main similarity between our work and Ref. [43] is the identification of feature maps for different quantum kernels, including the use of RFFs and the study of the projected quantum kernel.The main difference is the perspective in which these objects are studied.In Ref. [43] the authors look for kernel constructions that have advantageous properties when it comes to solving learning tasks.Conversely, in this work we want to establish what are the ultimate limits in expressivity of using kernel functions based on quantum feature maps.

IV. THE UNIVERSALITY OF QUANTUM FEATURE MAPS
In the previous section we saw how we can define quantum kernel functions explicitly from a given data embedding.This section, together with the following two sections, contains the main results of this manuscript.The statements shift around different notions of efficiency and different classes of kernel functions.In Fig. 3 we provide a small sketch of how each of our results relates to one another from a zoomed-out perspective.The leitmotiv is that we can always restrict further what restrictions we are satisfied with when concocting EQKbased approximations of kernels.We find all kernels can be approximated as EQKs if our only restriction is to use finitely many quantum resources.But then, we specialize the search for kernels which can be approximated efficiently as EQKs, with a distinction between space-efficiency (number of qubits required) and total run-time efficiency.It should also be said that when we talk about time efficiency we always consider "quantum time", so we assume we are always allowed access to a quantum computer.This way, the analysis from now on departs from the usual quantum-classic separation mindset.
From the outset, two basic results combined, namely Mercer's feature space construction (elaborated in Appendix A), together with the universality of quantum circuits, already certify the possibility of all kernel functions to be realized as EQKs.If we demanded mathematical equality, Mercer's construction could require infinite-dimensional Hilbert spaces, and thus quantum computers with infinitely many qubits.Instead, for practical purposes, from now on we do not talk about "evaluating" functions in one or another way, but rather about "approximating" them to some precision.With this, Theorem 1 confirms that we can always approximate any kernel function as an EQK to arbitrary precision with finitely many resources.
For this universality statement, we allow for extra multiplicative and additive factors.Instead of talking about ε- These extra factors come from Lemma 3, introduced right below, and they do not represent an obstacle against universality.
Theorem 1 (Approximate universality of finite-dimensional quantum feature maps).Let k : X × X → R be a kernel function.Then, for any ε > 0 there exists n ∈ N and a data embedding ρ n onto the Hilbert space of quantum states of n qubits such that for almost all x, x ′ ∈ X .
The statement that Eq. ( 11) holds for almost all x, x ′ ∈ X comes from measure theory, and it is a synonym of "except in sets of measure 0", or equivalently "with probability 1".That means that although there might exist individual adversarial instances of x, x ′ ∈ X for which the inequality does not hold, these "bad" instances are sparse enough that the event of drawing them from the relevant probability distribution has associated probability 0.
Theorem 1 says every kernel function can be approximated as an EQK up to a multiplicative and an additive factor using finitely-many qubits.Before we give the theorem proof, we introduce the useful Algorithm 1 to map classical vectors to quantum states that we can then use to evaluate Euclidean inner products as EQKs.Then, Lemma 2 contains the correctness statement and runtime complexity of Algorithm  Notice that the output of Algorithm 1, 1  2 n (I ± P ), is a single (pure) eigenstate of a Pauli operator P with eigenvalue ±1.Nevertheless, as Line 3 involves drawing an individual index i ∈ {1, . . ., 4 n − 1}, we see Algorithm 1 as a random algorithm, which prepares a mixed state as a classical mixture of pure states.

Lemma 2 (Correctness and runtime of Algorithm 1). Let r ∈ ℓ d
1 ⊆ R d be a unit vector with respect to the 1-norm, ∥r∥ 1 = 1.
Take n = ⌈log 4 (d + 1)⌉ and pad r with 0s until it has length 4 n − 1.Let (P i ) 4 n −1 i=1 be the set of all Pauli matrices on n qubits without the identity.Then Algorithm 1 prepares the following state as a classical mixture The total runtime complexity t of Algorithm 1 fulfills t ∈ O(poly(d)).
Lemma 3 (Euclidean inner products).Let r, r ′ ∈ R d be unit vectors with respect to the 1-norm ∥r (′) ∥ 1 = 1.Then, for ρ r , ρ r ′ as produced in Algorithm 1, the following identity holds Proof of Lemmas 2 and 3.The proofs are presented in Appendix B.
Remark (Prefactors).The 2 n multiplicative factor is of no concern, since n ∈ O(log(d)) and we are interested in methods that are allowed to scale polynomially in d.In general ρ r will be a mixed quantum state which can be efficiently prepared.Also, the map is injective, but not surjective.
We are now in the position to prove Theorem 1.We note that the first step in the proof is to furbish a "classical" feature map, and in a second step we turn the resulting feature vector into a quantum state.This way, Theorem 1 makes no pretense at quantum advantage, but rather establishes the ultimate expressivity of embedding quantum kernels.We also point out that not imposing space efficiency breaks the division between quantum and classical computing.Every quantum circuit running in finite time can be fully simulated by a classical circuit also running in finite time.This way, the universality statement coming from Theorem 1 is not unique to quantum kernels, since formally at this point there is no distinction between classical and quantum functions.
Proof of Theorem 1.We prove this statement directly using a corollary of Mercer's theorem and the universality of quantum computing.First, we invoke Corollary A.3 in Appendix A, which ensures the existence of a finite-dimensional feature map Φ m : X → R m for which it holds that Without loss of generality, we assume ∥Φ m (x)∥ 1 = 1 for all x ∈ X .Now, we can prepare the quantum state ρ Φm(x) , which will use ⌈log 4 (m + 1)⌉ many qubits.By preparing two such states, one for Φ m (x) and one for Φ m (x ′ ), we can compute their inner product as the Hilbert-Schmidt inner product of the quantum states as in Lemma 3, to get For reference, notice tr{ρ Φm(x) ρ Φm(x ′ ) } could be computed using the SWAP test, to additive precision in the number of shots.With this, we can ensure good approximation to polynomial additive precision efficiently for almost every x, x ′ ∈ X .This completes the proof.
Notice this is an conceptually motivated existence result only.Noteworthy is also that no claims are made about practicality neither in Theorem 1 nor in Algorithm 1.These results only aim at establishing the ultimate existence of EQKs for any kernel, and they are not claims that computing these kernels as EQKs should lead to any form of advantage over the inner product of the classical feature vectors.
The only statement is that there exists an EQK using a finite number of qubits, but it does not get into how quickly the number of qubits will grow for increasingly computational complicated kernel function k, or increased required precision ε > 0. The number of qubits n will depend on some properties of the kernel k and the approximation error ε, and if for example we had exponential scaling of the required number of qubits on some of these quantities, then Theorem 1 would bring no practical application.Similarly, one should also consider the time it would take to find such an EQK approximation, independently of the memory and run-time requirements of preparing the feature vectors and computing their inner product.
Let us take the anti-symmetric logarithmic derivative quantum Fisher kernel of Ref. [29] as an example.Upon first inspection, evaluating that kernel does not seem to rely in the usual quantum feature map and Hilbert-Schmidt inner product combination.Yet, a feature map can be identified by rewriting some of the variables involved as a vector of exponential length.That means the scaling in the number of qubits required to encode the feature vector is at worst polynomial, according to Lemma 3. The normalization requirement from Lemma 3 could still prevent the feature map to be encoded using the construction we propose, but for the sake of illustration let us assume normalization is not a problem.Then, we have found a way of realizing the same kernel as an EQK.In this case, even though the scaling of the qubit number is no more than polynomial, the scaling in total run-time of the EQK approximation would still be at worst exponential.
The message remains that, although all kernel functions can be realized as EQKs, there could still exist kernel functions which cannot be realized as EQKs efficiently.Pointing back to Fig. 1, this explains why we added the word "efficient" to the sets of quantum functions and EQKs.In order to talk about efficiency we need to replace individual functions k by function sequences {k s } s∈N , where s is the scale parameter.When we refer to an efficient ε-approximation, we refer to an algorithm making use of up to poly(s, 1/ε) resources, for each k s in the sequence.We now formally present the question we aim at answering.Question 1 (Expressivity of efficient EQKs).Let {k s } s∈N be a sequence of kernel functions, let ε > 0 be a precision parameter, and consider the properties: 1.Quantum efficiency: There is an algorithm that takes a specification of k s as input and produces an εapproximation of k s with a quantum computer efficiently in s and 1/ε.

Embedding-quantum efficiency:
There is an algorithm that takes a specification of k s as input and produces an ε-approximation of k s as an EQK efficiently in s and 1/ε.

Classical inefficiency:
Any algorithm that takes a specification of k s as input and produces an ε-approximation of k s with a classical computer must be inefficient in either s or 1/ε.
Then, assuming {k s } s fulfills classical inefficiency, does quantum efficiency imply embedding-quantum efficiency?
The question above contains a few moving pieces which still need to be made fully precise, as for instance: the meaning of the scaling parameter s, the sequence of domains X s from where each k s takes its input, any restrictions on the functions k s , in what form must the functions k s be specified, the choice of notion of ε-approximation, and the choice between space and time efficiency.These are left open on purpose to admit for diverse approaches to studying the question.
We could have required a stronger sense of inefficiency, namely that there cannot exist an efficient and uniform construction for ε-approximating k s with classical computers.Instead, we judge it enough to require that, even if such an efficient approximation could exist, it would be impossible to find efficiently, conditional on BQP ⊈ P/poly.We added classical inefficiency because otherwise, in principle, one could have all: quantum efficiency, embedding-quantum efficiency, and classical efficiency, in which case the result would not be interesting for QML.An interesting question is then to imagine how classical kernel functions could also be embedding classical kernels, in the sense of data being mapped onto the Hilbert space of classical computation.
In later sections, we fix all of these in our answers to Question 1.For instance, s for us is the dimension of the domains X s from where data is taken, the kernel functions can be specified either as black-boxes or as the description of circuits, we will take infinity-norm approximation almost everywhere, and we alternate between qubit number efficiency and run-time efficiency.
As an aside, Question 1 also invites research in the existence of quantum kernels beyond EQKs, so efficient quantum kernels which do not admit efficient EQK-based approximations.At the moment of writing, the authors are not aware of any concrete example of a quantum efficient kernel function which is provably not embedding-quantum efficient.Moreover, we do not know any quantum efficient kernel for which we do not have an explicit way of constructing the efficient quantum embedding.The outlook in Section VII, and Appendices A and E contain several directions in which we expect candidate kernels beyond efficient EQKs to appear.It should be said that searching for quantum kernels beyond EQKs is slightly contradictory to the foundational philosophy of QML in the beginnings, which actively sought to express everything in terms of inner products in the Hilbert space of density operators.The foundational works [13][14][15] have motivated the use of embedding quantum kernels precisely because the inner product could be taken directly efficiently.Nevertheless, we point out the possibility of alternative constructions, keeping focus on methods which could still harbour quantum advantages.

V. THE UNIVERSALITY OF EFFICIENT SHIFT-INVARIANT EMBEDDING QUANTUM KERNELS
In this section we present our second result: All shiftinvariant kernels admit space-efficient EQK approximation provided they are smooth enough.We also give sufficient conditions for a constructive time-efficient EQK approximation.We arrive at these results in two steps: We first prove an upper bound on the Hilbert space dimension required for an approximation as an explicit inner product, classically.We next construct and EQK based on this classical approximation.FIG. 3. Venn diagram with set relations of different classes of kernels and quantum kernels.Each of the arrows represents a reduction found in this manuscript, they should be read as "for any element of the first set, there exists an element of the second set which is a good approximation."In the case of Theorem 1, elements are individual functions.In every other case, elements are sequences of kernel functions, for which the notions of efficiency make sense.In summary, we find that efficient embedding quantum kernels (EQK) can approximate two important classes of kernels: shift-invariant and composition kernels.
Shift-invariant kernels have enjoyed significant attention in the ML literature.On the one hand, the Gaussian RBF kernel (arguably the most well-known shift-invariant kernel) has been found useful in a range of data-driven tasks.On the other hand, as we see in this section, shift-invariant kernels are more amenable to analytical study than other classes of kernels.The property of shift-invariance, combined with exchange symmetry and PSD, allows for deep mathematical characterization of functions.
Let us first introduce the class of functions of interest: As is standard, we then write shift-invariant kernels as a function of a single argument k(x − x ′ ) (which amounts to taking ξ = −x ′ .)We define ∆ := x − x ′ and then talk about k(∆).
One motivation for using shift-invariant kernels, as a more restricted function family, is that they have a few useful properties that ease their analysis.Also, it is difficult to decide whether an arbitrary function is PSD, so characterizing general kernel functions is difficult.Conversely, for shiftinvariant functions, Bochner's theorem gives a condition that is equivalent with being PSD.
Theorem 4 (Bochner [46]).Let k be a continuous, even, shiftinvariant function on R d .Then k is PSD if and only if k is the Fourier transform (FT) of a non-negative measure p, Furthermore, if k(0) = 1, then p is a probability distribution Bochner's theorem is also a central ingredient in a powerful kernel-approximation algorithm known as random Fourier features (RFF) [44] presented here as Algorithm 2.
Algorithm 2 random Fourier features [44], RFF The input to Algorithm 2 is any shift-invariant kernel function k, and the output is a feature map z such that the same kernel can be ε-approximated as an explicit inner product.In Step 1, the inverse Fourier transform p of the kernel k is produced.In Step 2, D/2 many samples ω i are drawn i.i.d.according to p.In Step 3, the feature map z is constructed using the samples ω i and the sine and cosine functions.That the inner product of z gives a good approximation to k is further elucidated in Appendix A.
With RFF, given a kernel function, Algorithm 2 produces a randomized feature map such that the kernel corresponding to the inner product of pairs of such maps is an unbiased estimator of the initial kernel.The vital question is how large the dimension D has to be in order to ensure ε approximation error, which is the object of study of Theorem 5.
Theorem 5 (Random Fourier features, Claim 1 in Ref. [44]).Let X ⊆ R d be a compact data domain.Let k be a continuous shift-invariant kernel acting on X , fulfilling k(0) = 1.Then, for the probabilistic feature map z(•) : X → R D produced by Algorithm 2, holds for almost every x, x ′ ∈ X .Here diam(X ) = sup x,x ′ ∈X {∥x − x ′ ∥} is the diameter of X , and σ p is the variance of the inverse FT of k interpreted as a probability distribution In particular, it follows that for any constant success probability, there exists an ε-approximation of k as an Euclidean inner product where the feature space dimension D satisfies  [44]) or introduced in this manuscript (Algorithm 1; QRFF, as Algorithm 3; RFF pp, as Algorithm 4; and QRFF pp, as Algorithm 5).The details for each of the three families are elucidated in the corresponding sections: the universality of EQKs for general kernels is explained in Section IV, with Theorem 1; the universality of efficient EQKs for shift-invariant kernels appears in Section V, studied formally in Corollaries 6 and 8; finally, composition kernels are introduced in Section VI, with Proposition 9 stating their efficient approximation as EQKs, and Proposition 11 confirming that composition kernels contain the so-called projected quantum kernels presented in Refs.[28,45].
The RFF construction can fail to produce a good approximation with a certain probability, but the failure probability can be pushed down arbitrarily close to 0 efficiently.This theorem can be understood as a probabilistic existence result for efficient embedding-quantum approximations of kernel functions, which we present next as our second main contribution.We consider the input dimension d as our scaling parameter, which plays the role of s.We further take εapproximation to be the supremum of the pointwise difference almost everywhere, which we inherit from Theorem 5.In this result we do not need to fix how the input kernel sequence {k s } s∈N is specified, but we assume it is specified in a way that allows us to approximate it using a quantum computer.When it comes to what definition of efficiency we need to use, we consider first space efficiency, since we talk about the required number of dimensions to approximate the kernel as an explicit inner product.Later on we pin down the time complexity as well.
In the following, for clarity of presentation, we decouple the construction of EQKs via RFFs as a two-step process: first produce the (classical) feature map z, and second realize the inner product as an EQK as in Lemma 3. As of now, Algorithm 2 produces a classical approximation via Random Fourier Features, not an EQK yet.Next we add a smoothness assumption to Theorem 5 to produce Corollary 6.Further below we introduce Algorithm 3, which takes the feature map z from Algorithm 2 and produces an EQK from it using Algorithm 1.
Corollary 6 (Smooth shift-invariant kernels).Let X d ⊆ [−R, R] d be a compact domain.For ε > 0, let k d be a continuous shift-invariant kernel function.Assume the kernel fulfills k d (0) = 1 and it has bounded second derivatives at the origin Then Algorithm 2 produces an ε-approximation of k d as an explicit inner product.In particular, the scaling of the required dimension D of the (probabilistic) feature map is Proof.From Bochner's theorem (Theorem 4), we know that k d is the FT of a probability distribution p d .Next, a standard Fourier identity allows us to relate the variance σ 2 d of p d and the trace of the Hessian H(k d ) at the origin ∆ = 0 (see, e.g., Refs.[44,47]): Using the assumption that In parallel, we can upper bound the diameter of X d by the diameter of By plugging the bounds on σ 2 d and diam(X d ) into the bound of Theorem 5 we obtain the claimed result In Appendix C, we provide Corollary C.1, which fixes the number of bits of precision required to achieve a close approximation to the second derivative using finite difference methods.
Indeed, Corollary 6 ensures that the Hilbert-space dimension required to ε-approximate any shift-invariant kernel fulfilling mild conditions of smoothness scales at most polynomially in the relevant scale parameters, if R and B are considered to be constant.This does not provide an answer to Question 1 yet, as the result only talks about existence of a space-efficient approximation, and not about the complexity of finding such an approximation only from the specification of each kernel in the sequence {k d } d .Noteworthy is that so far we have not made any assumptions on the complexity of finding the inverse FT of k d , nor the complexity of sampling from it.
Before going forward, one could enquire whether the upper bound from Corollary 6 could be forced to scale exponentially in d.In this direction, one would e.g.consider cases where R and B are not fixed, but rather also depend on d.But, since both R and B appear inside a logarithm, in order to achieve a scaling exponential in d overall, we would need to let either R or B to grow doubly-exponentially.While that remains a possibility, we point out that in such a case we would easily lose the ability to ε-approximate each of the kernels k d quantumefficiently to begin with, so we judge these scenarios as less relevant.And even then, it should be noted that Theorem 5 offers only an upper bound to the required dimension D. In order to discuss whether D can be forced to scale exponentially, we would also need a lower bound.The same reasoning applies for ε.
Notice in Corollary 6 we talk about the required feature dimension, not the number of qubits.Indeed, we can encode the feature vectors (which are nicely normalized) onto quantum states, as presented in Algorithm 3, which we call quantum random Fourier features (QRFF).What QRFF does is first obtain the probablistic map z from RFF, and then encode it into quantum states and take their inner product as in Lemma 3. By construction, the feature maps produced by RFF are unit vectors with respect to the 2-norm, ∥z(x)∥ 2 = 1 for any x ∈ X .For Algorithm 1 we require unit vectors with respect to the 1-norm, so we need to renormalize the vectors and this introduces another multiplicative factor, which now depends on the input vectors: holds, where r(′) = r (′) /∥r (′) ∥ 1 ∈ ℓ d 1 corresponds to renormalizing with respect to the 1-norm.Here ρ r refers to encoding r onto a quantum state using Algorithm 1.
Proof.The proof is given in Appendix B.
Remark (Bounded pre-factors).The re-normalization is of no concern, since the fact r, r ′ are 2-norm unit vector implies that their 1-norm is bounded This explains the extra factor g(x)g(x ′ ) appearing in Algorithm 3, where we have Algorithm 3 Quantum random Fourier features, QRFF The number of qubits required n scales logarithmically in the dimension of the feature vector n ∈ O(log(D)).So the number of qubits n necessary for ε-approximating shift-invariant kernel functions as EQKs has scaling n ∈ Õ log d/ε 2 , where the tilde hides doubly-logarithmic contributions.
With these, we can almost conclude that Corollary 6 results in a positive answer to Question 1 for shift-invariant kernels provided they are smooth, with smoothness quantified as the magnitude of the second derivative at the origin.We are still missing the complexity of producing samples from the inverse FT of each k d , as in Step 2 of Algorithm 2. The efficiency criterion taken here would be the required number of qubits.Nevertheless, it is true that for any smooth shift-invariant kernel, there exists an ε-approximation as an EQK using at most logarithmically many qubits following Algorithm 3.
The complexity of Algorithm 2 could become arbitrarily large based on the difficulty of sampling from the distribution p d corresponding to the inverse FT of k d .In order to ensure embedding-quantum efficiency, we need to add the requirement of efficient sampling from the inverse FT of the kernel.With this, we proceed to state our main result: efficient EQKs are universal within the class of smooth shift-invariant kernels whose inverse FT is efficiently sampleable.In turn, the first step of Algorithm 3 is calling Algorithm 2, which we just saw takes time polynomial in d.
Step 2 uses Algorithm 1, which also runs in time polynomial in d.With this it is also clear that the total run-time complexity of Algorithm 3 is at most polynomial in d.
With the added condition of producing samples in polynomial time, we obtain a positive answer to Question 1. Namely, we show the existence of time-efficient EQK approximations for any kernel fulfilling the smoothness and efficient sampling conditions from Corollaries 6 and 8. Notice our result does not even require quantum efficiency of k d , so in fact we have proved something stronger.
One point to address is whether assuming quantum efficiency for evaluating k d directly implies the ability to sample efficiently from the inverse FT of k d .In the cases where this is true, adding quantum-efficiency to the assumptions of Corollary 6 suffices to prove that efficient EQKs are universal within the class of efficient shift-invariant kernels.At the face of it, it is unclear whether for all reasonable ways of specifying k d the capacity to efficiently evaluate it using a quantum computer implies an efficient algorithm of sampling from the distribution obtained by the inverse FT.If the task were to sample from k d itself (in case k d were a probability distribution), then we know that being able to evaluate k d does not imply the capacity to efficiently sample from k d3 .Then, it is unclear why sampling from the inverse FT of k d would be easy in every case, especially, e.g., in the black mox model.Resolving this question is left as an open problem, and we note it is not an entirely new one [49,50].
Finally, it should be noted that, although the feature map produced by Algorithm 2 can be stored classically (assuming D ∈ O(d/ε 2 )), this is not a de-quantization algorithm.Algorithm 2 requires sampling from the inverse FT of the input kernels k d .Reason dictates that, if the kernel is quantum efficient and classically inefficient to ε-approximate, in general sampling from its inverse FT should also be at least classically inefficient.This is what we meant earlier when we said we consider quantum time: even though the intermediate variable z(x) can be stored classically, producing it requires usage of a quantum computer.
In this section, we show that smooth shift-invariant kernels admit space-efficient EQK approximations.We also give sufficient conditions for the same result to hold for embeddingquantum efficiency in total runtime.In the next section, we see how the same ideas extend to another class of kernel functions, beyond shift-invariant ones.

VI. COMPOSITION KERNEL AND PROJECTED QUANTUM KERNEL
In this section we introduce a new class of quantum kernels which can still be turned into EQKs using another variant of Algorithm 2. We show that the new class also admits time-efficient approximations as EQK, and that the projected quantum kernel from Ref. [28] belongs to this class.During the publication phase of this manuscript we were made aware of the similarities between the composition kernels we introduce here and the "distance kernel with RDM" proposed and studied in Ref. [43].
Again, we separate the EQK-based approximation into two steps: first we propose a variant of RFF producing a classical feature map, and second we construct an EQK that evaluates the inner product of pairs of features.
First we introduce the new class.Consider the usual Gaussian kernel with parameter σ > 0 defined as only now we allow for some pre-processing of the inputs x → f (x), resulting in The introduction of f breaks shift invariance in general, hence the need to specify both arguments independently x, x ′ ∈ X .Since k f is a PSD kernel for any function f , we refer to such constructions as composition kernel.Next we propose a generalization of Algorithm 2 that also works for composition kernels, as Algorithm 4, which we call simply random Fourier features with pre-processing (RFF pp).
▷ Apply the pre-processing function.3: return z f Since the core feature of Algorithm 4 is to call Algorithm 2, we inherit the ε-approximation guarantee.Notice in line 1 of Algorithm 4 we invoke Algorithm 2, but taking as input domain the range of the pre-processing function, f (X ), instead of the original domain X .If we assume f to be continuous, it follows that f (X ) is also compact, which we require for the application of Theorem 5.
Proposition 9 (Performance guarantee of Algorithm 4).Let f : X → [−B, B] g1(d) be a pre-processing function, and let k f be the Gaussian kernel composed with f , as introduced in Eq. (31).Let finally the parameter of the Gaussian kernel be σ = g 2 (d).If g 1 (d) ∈ O(poly(d)) and g 2 (d) ∈ Ω(poly(d) −1 ), then Algorithm 4 produces an εapproximation of k f (x, x ′ ) as an explicit inner product.In particular, the required dimension D of the randomized feature map is at most polynomial in the input dimension d and the inverse error 1/ε.
Proof.The proof is provided in Appendix D.
Notice Proposition 9 shares a deep similarity with Corollary 6.Both are direct applications of Theorem 5 to kernels fulfilling different properties.They have nevertheless one stark difference.The assumptions in Proposition 9 are sufficient to guarantee we can sample efficiently from the inverse FT of the kernels in the sequence k d .This is because the probability distributions involved, p d , are nothing but Gaussian distribution themselves.As shown in Algorithm 4, the preprocessing function f shows up only after the sampling step.So, unlike Corollary 6, Proposition 9 already represents a direct positive answer to Question 1, relying on of Algorithm 5, which follows.
In Algorithm 5, quantum random Fourier features with preprocessing (QRFF pp), we take the output of Algorithm 4 and convert it into a quantum feature map with Algorithm 1, leading to the corresponding EQK approximation.Again now the multiplicative term g(f (x))g(f (x ′ )) appears from the renormalization in Lemma 7.
The scaling in the number of qubits is once more logarithmic in the scaling of the number of required dimensions given in Proposition 9.For the number of qubits n, the scaling is n ∈ Õ log(d/ε 2 ) .Moreover, the run-time complexity of Algorithm 5 is polynomial in d and in the run-time complexity of evaluating f , since now the sampling step corresponds to sampling from a product Gaussian distribution.
For completeness sake, it would be good also to rule out the possibility of classically efficient approximation.Although it might be counter-intuitive, in principle there could exist preprocessing functions f that are hard to evaluate but which result in a composition kernel k f that is not hard to evaluate.In the following we just confirm that there exist pre-processing functions which are hard to evaluate classically which result in composition kernels which are also hard to evaluate classically.

Proposition 10 (No efficient classical approximation).
There exists a function f : X → [0, 1] d which can be ε-approximated quantum efficiently in d for which the composition kernel k f cannot be ε-approximated classically efficiently in d, with We take σ 2 = 1/2 for simplicity.
Proof.Here, the proof is presented in Appendix D.
By selecting pre-processing functions that are quantum efficient but classically inefficient, we reach a class of quantum kernels that are not shift-invariant, but for which the RFF pp construction still applies, from Algorithm 4. Next, we show that this class of kernels contains the recently introduced projected quantum kernel [28].
Going back to the quantum feature map x → ρ(x), the authors of Ref. [28] proposed a mapping from the exponentially-sized ρ(x), to an array of reduced density matrices (ρ k (x)) N k=1 , for an N -qubit quantum state 4 .According to our definitions, we would not call this a quantum feature map, since the data is not mapped onto the Hilbert space of quantum states.Rather, it is natural to think of this as a quantum pre-processing function.A nice property of this alternative mapping is that the feature vectors can be efficiently stored in classical memory, even though obtaining each of the ρ k (x) matrices can only be done efficiently using a quantum computer in general.Once these are stored, the projected quantum kernel k PQ is the composition kernel as we introduced earlier, just setting f to be the function that computes the entries of all the reduced density matrices where γ > 0 can be safely taken as γ = 1/(2σ 2 ), ρ k (x) is the reduced density matrix of the k th qubit of ρ, and recall ∥•∥ F is the Frobenius norm.The number of qubits N is left as a degree of freedom, but for d-dimensional input data, reason would say the number of qubits used would be N ∝ d, or at most N ∈ O(poly(d)).
The projected quantum kernel enjoys valuable features.On the one hand, it was used in an effort to prove quantumclassical learning separations in the context of data from quantum experiments [28].On the other hand, the projected quantum kernel is less vulnerable to the exponential kernel concentration problem [38].The projected quantum kernel is also deeply related to the shadow tomography formalism and guarantees can be placed on its performance for quantum phase recognition [45] among others.
Proposition 11 (Projected quantum kernel as an efficient EQK).The projected quantum kernel k PQ fulfills the assumptions of Proposition 9, so it can be efficiently approximated as an EQK with the number of dimensions required D fulfilling Proof.The proof of this statement is presented in Appendix D.
Remark (General projected quantum kernels).In the proof, we have assumed that the projected quantum kernel is taken with respect to the subsystems being each individual qubit.
A general definition of the projected quantum kernel would allow for other subsystems, but the requirement that the feature map must be efficient to store classically prevents that the number of subsystems scales more than polynomially, and that the local dimension of each subsystem scales more than logarithmically, in which case we still have g 1 (d) ∈ O(poly(d)).
The projected quantum kernel poses a clear example of a quantum kernel that is not constructed from a feature map onto the Hilbert space of density operators.Its closeness to the classical shadow formalism has earned it attention from the moment it was proposed.Thus, proving that also the projected quantum kernel admits an efficient approximation as an EQK results in another important kernel family that is also covered by EQKs.
In order to identify scenarios in which the scaling D could become super-polynomial in the relevant parameters d, 1/ε, one would require for example g 1 (d) ∝ exp(d), or g 2 (d) ∝ 1/exp(exp(d)), using the definitions from Proposition 9. Indeed, in those scenarios we would not be able to guarantee the space efficiency of Algorithm 4 while keeping good εapproximation.Nevertheless the conditions g 1 (d) ∝ exp(d) or g 2 (d) ∝ 1/exp(exp(d)) alone would not be sufficient to prove that Algorithm 4 must fail.Said otherwise, this is again because from Theorem 5 we only have an upper bound on the required space complexity, and not a lower bound.At the same time, though, both cases would prevent us from being able to quantum-efficiently ε-approximate the projected quantum kernel in the first place, so these scenarios must be ruled out as they break the hypotheses.And even if they were allowed, recall that we can store classical vectors in quantum states using only logarithmically many qubits in the length of the classical vector.That means that single exponential g 1 (d) and inverse doubly-exponential g 2 (d) would not be enough to require exponential embedding-quantum complexity in the number of qubits.In order to have the number of qubits to be, e.g., n ∈ O(exp(d)), we would require either g 1 (d) ∈ O(exp(exp(d))), or g 2 (d) ∈ Ω(1/exp(exp(exp(d)))), which would prevent our ability to evaluate the projected kernel even further.
With these considerations, we have shown that projected quantum kernels, despite their using a different feature map and inner product, can be efficiently realized as EQKs.As a recapitulation, Fig. 3 summarizes all our contributions as a collection of set inclusions in a Venn diagram.With this, to the best of our knowledge, we conclude that all quantum kernels in the literature are either EQKs directly, or they can be efficiently realized as EQKs.

VII. OUTLOOK
This manuscript so far offers restricted positive answers to Question 1.In this section, we list some promising directions to search for other restricted answers to the same question.Answering Question 1 negatively would involve proving an non-existence result, for which one needs a different set of tools than the ones we have used so far.In particular, one would need to wield lower bounds for non-existence, which in this corner of the literature appear to be trickier to find.

A. Time efficient EQKs
It should be possible to come up with a concrete construction that ensures also time efficiency of Algorithm 2 while applying Corollary 8.For that to happen, it would be enough to give not only the kernel function, but also an efficient algorithm to sample from its inverse FT.We recognize this question as the clearest next step in this new research direction.

B. Non-stationary kernels and indefinite kernels
The theory of integral kernel operators and their eigenvalues has been already extensively developed in the context of functional analysis [51][52][53], also with applications to random processes and correlation functions [54].In particular, for 1-dimensional kernel functions, it is known that reasonable smoothness assumptions lead to kernel spectra which are concentrated on few eigenfunctions.This fact alone hints strongly toward the existence of low-dimensional EQK-based approximations for a larger class of kernel functions than the ones we explored here.While we restricted ourselves to shift-invariant and composition kernels, the theory of eigenvalues of integral kernel operators seems to suggest that similar results could be obtained for any smooth kernel function.That being said, the generalization from 1 to d-dimensional data might well come with factors which grow exponentially with d, somewhat akin to the well-known curse of dimensionality.
Alternatively, instead of arbitrary PSD kernel functions, one might also consider other restricted classes of known kernels.For example rotation-invariant kernels, of which polynomial kernels are a central example, offer an interesting starting point.Polynomial kernels are not shift-invariant, but they are often derived with an explicit feature map in mind.A potentially interesting research line would be to generalize polynomial kernels in a way that is less straightforward to turn into an EQK.
Our results relied strongly on the random Fourier features (RFF) algorithm of Ref. [44].The RFF approximation in turn rests on a sampling protocol that owes its success to Bochner's theorem for shift-invariant PSD functions (Theorem 4).Yaglom's theorem (Theorem A.4 in Appendix A) is referred to as the generalization of Bochner's theorem to functions that are not shift-invariant.It would be interesting to see whether there are other, non shift-invariant kernels for which Yaglom's theorem could be used to furbish an explicit feature map with approximation guarantees akin to those of Theorem 5.
The extension from PSD to non-PSD functions could be similar to the extension from shift-invariant to smooth functions we just described.Intuitively, finding an EQK approximation for a kernel is similar to finding its singular value decomposition in an infinite-dimensional function space.In this space, PSD functions result in the "left"-and "right"-hand matrices in the singular value decomposition to be equal (up to conjugation).Conversely, for non-PSD functions, the main difference would be the difference between left and right feature maps.
In this direction, we identify four promising research lines: first, generalize our study to the general class of 1-dimensional smooth functions; second, study restricted classes of higherdimensional kernel functions for which the integral kernel operator retains desirable qualities for approximation; third, find restricted cases where a kernel approximation based on Yaglom's theorem is possible; and fourth, extend similar results for indefinite kernel functions.We comment on these directions in Appendix A.

C. The variational QML lens
When studying EQK-based approximations to given kernel functions, we have not restricted ourselves to PQC-based functions.Even though the literature on quantum kernel methods up to this point has not dealt exclusively with variational circuits, a large body of work has indeed focused on functions estimated as the expectation value of a fixed observable with respect to some parametrized quantum state.In this sense, combining the results from Ref. [31] and the rules for generating PSD-functions from Ref. [47], one can develop a framework for how to approach quantum kernel functions holistically.We give some first steps explicitly in Appendix E. We expect many discoveries to result from an earnest study of what are the main recipes to construct PQC-based quantum kernels other than by the established EQK principles.

VIII. CONCLUSION
In this manuscript we raise and partially answer a fundamental question regarding the expressivity of mainstream quantum kernels.We identify that most quantum kernel approaches are of a restricted type, which we call embedding quantum kernels (EQKs), and we ask whether this family covers all quantum kernel functions.If we leave notions of efficiency aside, we show that all kernel functions can be realized as EQKs, proving their universality.Universality is an important ground fact to establish, as it softly supports the usage of EQKs in practice.
Learning whether EQKs are indeed everything we need for QML tasks or whether we should look beyond them is important in the context of model selection.In ML for kernel methods, model selection boils down to choosing a kernel.For ML to be successful, it is long known that models should be properly aligned with the data; for different learning tasks, different models should be used.This notion is captured by the concept of inductive bias [55,56], which is different for every kernel.Given a learning task with data, the first step should always be to gather information about potential structures in the data in order to select a model with a beneficial inductive bias.In this scenario, it becomes crucial to then have access to a model with good data-alignment.In Ref. [55] some simple EQKs were found to possess inductive biases which are uninteresting for practically relevant data sets.These findings fueled the need to ask whether EQKs can also have interesting inductive biases, which is softly confirmed based on their universality.When searching for new quantum advantages in ML, the more interesting classes of kernels we have access to, the better.
We propose Question 1 as a new line of research in quantum machine learning (QML).Characterizing a relation of order between general quantum kernel functions and the structured EQK functions when considering computational efficiency could have an important impact in the quest for quantum advantages.Indeed, if it were found that all practically relevant kernels can be realized as EQKs, then researchers in quantum kernel methods would not need to look for novel, different models to solve learning tasks with.Nevertheless, for now there is still room for practically interesting kernels beyond efficient EQKs to exist.
Operationally, the results presented in this manuscript aim at confirming that the class of kernels currently used in practice already provably covers large, promising families of ker-nel functions.The question we focus on is what is ultimately possible with efficient EQKs, and not so much what is the best way of evaluating kernel functions, nor how to construct new, essentially different quantum kernels.For instance, given a kernel function and an efficient algorithm to evaluate it, it is not our goal to provide a new evaluation algorithm which is more efficient in some sense.Instead, we are interested in knowing whether there exists another similarly efficient algorithm but which fulfills the added structural restriction of being based on an efficient quantum embedding.Whether the embedding-based evaluation algorithm has a slightly higher or lower computational cost is beyond our scope, we are interested in the embedding-based algorithm to be efficient by itself.
After raising Question 1, we give an answer for the restricted case of shift-invariant kernels.In Corollary 6 we show that, under reasonable assumptions, all shift-invariant kernels admit a memory-efficient approximation as EQKs.Shift-invariant kernels are widely used also in classical ML, so showing that EQKs are still universal for shift-invariant kernels with efficiency considerations is an important milestone.
While shift-invariant kernels enjoy a privileged position in the classical literature, they have not been as instrumental in QML so far.Indeed, many well-known quantum feature maps lead to EQK functions which are not shift-invariant.Also, the milestone work [28] has introduced the so-called projected quantum kernel, which has been used to prove learning separations between classical and quantum ML models.The projected quantum kernel is not shift-invariant.In Proposition 9, we adapt our result for shift-invariant kernels so that they carry over to another class of kernel functions, which we call composition kernels, in which the projected quantum kernel is contained.This way, we present the phenomenon that even kernel functions that did not arise from an explicit quantum feature map can admit an efficient approximation as EQKs.
In all, we have seen that two relevant classes of kernels, namely shift-invariant and composition kernels, always admit an efficient approximation as EQKs.Our results are clear manifestations of the expressive power of EQKs.Nevertheless, many threads remain open to fully characterize the landscape of all efficient quantum kernel functions.We invite researchers to join us in this research line by listing promising approaches as outlook in Section VII.only this time with different sampled frequencies for the left and right sides.Nevertheless, this would once again allow us to exploit the link between Monte Carlo approximation and finite sample approximation.Sampling (ω 1 , ω ′ 1 ), . . ., (ω D , ω ′ D ) ∼ f , and defining z and z ′ to be the left and right probabilistic feature vectors, we recover the familiar formula A similar construction could be furbished if some notion of sampling from f were available even if f were not a probability distribution.Notice this approximation requires two different feature maps, z and z ′ .So even if k is guaranteed to be PSD by assumption, we would be using different left and right feature maps, similarly to what we described for non-PSD earlier in this appendix.Nevertheless, it would be interesting to find restricted cases where such an approximation is possible, as they could potentially hint towards further de-quantization protocols for QML models.

Towards diagonalizing smooth and indefinite kernel functions
Earlier in this appendix we showcased Mercer's construction for kernels as explicit inner products.Here we add considerations from the literature [51][52][53] that aim at quantifying the required resources of Mercer's finite-dimensional kernel approximation, as well as generalize it to indefinite (non-PSD) kernels.Drawing parallels to matrices, we refer to "the number of dimensions used to approximate a kernel function as an explicit inner product of feature vectors" as the rank of said kernel approximation.The thesis of Mercer's theorem is that there always exist finite-rank approximations for any kernel function.Yet, the relevant question is whether there always exist low-rank approximations for kernel functions.The following claims indicate there always exist low-rank approximations for any kernel function on one-dimensional input data, assuming smoothness.These could be taken as starting point for further studying the existence of quantum kernel functions beyond embedding quantum kernels for non-shift-invariant kernels. Let then there exist sequences (s l ) l ⊆ R ≥0 of non-negative numbers and (U l (x)) l ⊆ R X of functions such that: where (s l ) l is monotonically decaying and has finite norm (which means its only aggregation point is 0), and (U l (x)) l form an orthonormal basis of R X .
The following is known about how quickly the eigenvalues (s l ) l must decay, which confirms the existence of low-rank approximations: then for sufficiently large l we have the asymptotic scaling s , where the "little-o" notation means this is an upper bound that is strictly not tight.
Truncating after the first L terms of this series gives the optimal L-rank approximation to the function in 2-norm.
Low rank approximation.Having quickly decaying eigenvalues is a sufficient condition to ensure existence of a good lowrank approximation.For any orthonormal function basis (U l ) l , each kernel function in L 2 admits an exact series expansion.When truncating the series to finitely-many terms, orthonormality implies that optimal approximation comes from keeping the terms with the largest eigenvalues.Since the sequence of eigenvalues has finite norm, there is a correspondence between the decay speed and the relative magnitude of the largest eigenvalues with respect to the smaller ones.For instance, if the decay is exponential, we know that keeping logarithmically-many eigenvalues suffices to ensure a constant approximation error.
Proof.We prove this directly by just invoking Lemma 3 using the re-normalized vectors r, r′ , to get This completes the proof.Then, for any ε > 0, the number of bits of precision required to achieve Proof.For the finite difference second derivative, we take the central version where êi is the i th basis vector.We drop the subscript referring to the i th second derivative for ease of notation, the following holds for any i ∈ Following our assumptions, we have |∂ (4) k d (ξ + ) + ∂ (4) then the approximation error of the finite difference partial derivative is Now, since we can represent inputs with up to P bits of precision, we can afford to set the finite difference to be machine precision h = 2 −P , which results in Notice in practice we are interested for second derivatives at the origin ∆ = 0, so we need the bound on the magnitude to hold only in a small environment around ∆ = 0.
Appendix D: Proofs from Section VI Proposition 9 (Performance guarantee of Algorithm 4).Let f : X → [−B, B] g1(d) be a pre-processing function, and let k f be the Gaussian kernel composed with f , as introduced in Eq. (31).Let the parameter of the Gaussian kernel be σ = g 2 (d).
If g 1 (d) ∈ O(poly(d)) and g 2 (d) ∈ Ω(poly(d) −1 ), then, k f (x, x ′ ) can be ε-approximated efficiently in the number of feature dimensions.In particular, the required dimension of a (probabilistic) feature map D is at most Proof.In order to reach the claim, we only need to use the random Fourier feature approach of Algorithm 2, but taking f (X ) as input domain for the kernel.Indeed, we give the Gaussian kernel with parameter σ = g 2 (d) as input to the Algorithm, and we obtain the probabilistic feature map is a probability distribution.One can check that p(ω) > 0 everywhere, and that R D p(ω)dω = 1.Also, one can compute the variance of the distribution p, which we now do.
Lemma D.1 (Gaussian kernel variance).Let p be the inverse FT of the Gaussian kernel with parameter σ.Let σ 2 p = E p [∥ω∥ 2 ] be the variance of the distribution p.It holds that σ 2 p = d/σ.Proof.We prove the proposition directly, plugging in the formulas.We need the two identities: The proof is complete.

FIG. 4 .
FIG.4.Conceptual sketch of three constructions to approximate kernel functions as Embedding Quantum Kernels (EQKs).The three parts correspond to different kernel families: (a) General kernels refers to any PSD kernel function, as introduced in Definition 1; (b) Shift-invariant kernels are introduced in Definition 3; and (c) Composition kernels are a new family that we introduce in Section VI.The boxes refer to routines taken either from the existing literature (Mercer, from Corollary A.3 in Appendix A; and RFF, from Theorem 5 originally in Ref.[44]) or introduced in this manuscript (Algorithm 1; QRFF, as Algorithm 3; RFF pp, as Algorithm 4; and QRFF pp, as Algorithm 5).The details for each of the three families are elucidated in the corresponding sections: the universality of EQKs for general kernels is explained in Section IV, with Theorem 1; the universality of efficient EQKs for shift-invariant kernels appears in Section V, studied formally in Corollaries 6 and 8; finally, composition kernels are introduced in Section VI, with Proposition 9 stating their efficient approximation as EQKs, and Proposition 11 confirming that composition kernels contain the so-called projected quantum kernels presented in Refs.[28,45].

Corollary 8 (
Polynomial run-time).Under the assumptions of Corollary 6, let p d be the inverse FT of the kernel k d .If obtaining samples from p d can be done in time t ∈ O(poly(d)), then both Algorithms 2 and 3 have total run-time complexity polynomial in d.Proof.For Algorithm 2, Step 1 is only a mathematical definition, Step 2 runs in time linear in D and polynomial in d by assumption, finally Step 3 also runs in time linear in D and polynomial in d.Corollary 6 further states that D is at most essentially linear in d.In total, Algorithm 2 has run-time complexity at most polynomial in d.

Appendix C :
Precision required for estimating second derivatives with finite difference methods Corollary C.1 (Finite precision derivative accuracy).Let X d ⊆ [−R, R] d be a compact domain, and let X d,P ⊆ X d be a subset which we can represent with up to P bits of precision.For ε > 0, let k d be a continuous kernel function such that it can be quantum efficiently ε-approximated on X d,P .Assume the fourth derivatives of k d have bounded magnitude∂ (4) i k d (∆) ≤ L(d).(C1) [d].Start by taking the Lagrange formulation of Taylor's theorem, for h > 0 there existsξ + ∈ [∆, ∆ + h] such that k d (∆ + h) = k d (∆) + ∂k d (∆)h + ∂ 2 k d (∆) 2 h 2 + ∂ (3) k d (∆) 3! h 3 + ∂ (4) k d (ξ + ) 4! h 4 .(C4)Next, consider the same expansion for k d (∆ − h), which will result in a different ξ − ∈ [∆ − h, ∆], and allows us to use the following trick

2 p = E p ∥ω∥ 2
, we merely need to expand E p ∥ω∥ 2 as a sum, and then substitute the formulas for the moments, to arrive at σ 1, and Lemma 3 shows the relation between the Euclidean inner product of the encoded real vectors and the Hilbert-Schmidt inner product of the encoding quantum states.