Faster variational quantum algorithms with quantum kernel-based surrogate models

We present a new optimization method for small-to-intermediate scale variational algorithms on noisy near-term quantum processors which uses a Gaussian process surrogate model equipped with a classically-evaluated quantum kernel. Variational algorithms are typically optimized using gradient-based approaches however these are difficult to implement on current noisy devices, requiring large numbers of objective function evaluations. Our scheme shifts this computational burden onto the classical optimizer component of these hybrid algorithms, greatly reducing the number of queries to the quantum processor. We focus on the variational quantum eigensolver (VQE) algorithm and demonstrate numerically that such surrogate models are particularly well suited to the algorithm's objective function. Next, we apply these models to both noiseless and noisy VQE simulations and show that they exhibit better performance than widely-used classical kernels in terms of final accuracy and convergence speed. Compared to the typically-used stochastic gradient-descent approach for VQAs, our quantum kernel-based approach is found to consistently achieve significantly higher accuracy while requiring less than an order of magnitude fewer quantum circuit evaluations. We analyse the performance of the quantum kernel-based models in terms of the kernels' induced feature spaces and explicitly construct their feature maps. Finally, we describe a scheme for approximating the best-performing quantum kernel using a classically-efficient tensor network representation of its input state and so provide a pathway for scaling these methods to larger systems.

We present a new optimization strategy for small-to-intermediate scale variational quantum algorithms (VQAs) on noisy near-term quantum processors which uses a Gaussian process surrogate model equipped with a classically-evaluated quantum kernel. VQAs are typically optimized using gradient-based approaches however these are difficult to implement on current noisy devices, requiring large numbers of objective function evaluations. Our approach shifts this computational burden onto the classical optimizer component of these hybrid algorithms, greatly reducing the number of quantum circuit evaluations required from the quantum processor. We focus on the variational quantum eigensolver (VQE) algorithm and demonstrate numerically that these surrogate models are particularly well suited to the algorithm's objective function. Next, we apply these models to both noiseless and noisy VQE simulations and show that they exhibit better performance than widely-used classical kernels in terms of final accuracy and convergence speed. Compared to the typically-used stochastic gradient-descent approach to VQAs, our quantum kernel-based approach is found to consistently achieve significantly higher accuracy while requiring less than an order of magnitude fewer quantum circuit executions. We analyse the performance of the quantum kernelbased models in terms of the kernels' induced feature spaces and explicitly construct their feature maps. Finally, we describe a scheme for approximating the best-performing quantum kernel using a classically-efficient tensor network representation of its input state and so provide a pathway for scaling this strategy to larger systems.

I. INTRODUCTION
Quantum computation (QC) promises an alternative computational framework for solving an array of classically intractable problems. Developments in quantum error correction and the engineering of low-noise qubits have allowed recent experiments to push into the domain of fault tolerant single and two qubit operation [1][2][3][4][5]. These results present us with a roadmap of how scalable fault-tolerant universal quantum computation may be achieved however the exact timescale for this remains uncertain [6][7][8]. Until the threshold for fault tolerance is achieved the kinds of algorithms that can be run on-device with appreciable fidelity are limited to circuits of low depth on relatively few qubits.
These noisy intermediate scale quantum (NISQ [6]) algorithms provide us with a useful set of benchmarking tools for quantifying the performance of prototype quantum computers while being an interesting field of study in their own right [9]. These include algorithms specifically designed to showcase near-term forms of quantum advantage [10][11][12]. While experimental demonstrations of these "quantum supremacy" tasks are an important step forward in demonstrating the potential of quantum computers, they are not designed to solve useful problems and the extent of their advantage over classical methods is an area of ongoing study [13][14][15]. More modest proposals for interesting NISQ applications include variational quantum algorithms (VQAs) and, the often related, quantum machine learning (QML) [9,16,17].
QML attempts to generalize the methods and results used in classical machine learning to a quantum setting. Quantum kernel methods are a promising NISQ-friendly sub-field of QML focused around the generalization of classical kernel-based algorithms. Generally speaking, kernel methods are a broad class of machine learning techniques which use kernel functions to quantify the similarity between data [18,19]. A kernel function is a positive (semi-)definite function which quantifies the similarity between its two inputs; it is equivalent to an inner-product between these inputs mapped into a higher-dimensional feature space. Quantum analogues to classical kernel methods have primarily been applied to the former set of problems, particularly classification tasks [20][21][22][23], while applications to the latter have been largely unstudied in the literature [24].
VQAs are hybrid quantum-classical algorithms which couple the limited computational capabilities of NISQ devices with classical computational resources in an attempt to solve useful tasks [16]. Typically, a quantum state is prepared with a parameterized ansatz circuit. The parameters of the ansatz are adjusted by a classical optimizer to minimize a cost function calculated from measurements of the state. The ease of implementing low-depth ansatzes and VQAs' resilience against certain types of coherent noise [25] make these algorithms a popular choice for demonstrations on near-term devices [26][27][28].
While it has been suggested that scaled-up VQAs could provide a route to quantum advantage in the near-term [16,29,30], large hurdles exist to achieving this. In particular, barren plateaus can emerge in the optimization landscape due to device noise [31] or the use of overly expressive many-qubit ansatz circuits [32][33][34]. These make the optimization in VQAs exponentially expensive preventing them from being scaled to large system sizes. Any direct quantum advantage from large-scale VQE would likely require highly problem-specific ansatzes to avoid these barren plateaus and designing such ansatz circuits is an area of ongoing research [35,36]. Despite these hurdles to scalability, VQAs algorithms have emerged as a popular tool for benchmarking current noisy devices.
There are several key factors to consider when implementing a VQA on a NISQ device. The ansatz circuit must be capable of representing a state which solves the task at hand (to a good approximation) but should not be so expressive that training becomes infeasible. Device noise must also be considered when choosing the ansatz's depth but is typically difficult to fully characterize. Large amounts of noise can deform a VQA's cost function to the extent that the obtained solution does not solve the original task while compounding trainability issues [31]. This imposes limitations on the depth of ansatz circuits which can be used and has prompted the development of a plethora of noise mitigation strategies [37][38][39][40][41][42]. Finally, a classical optimization strategy must be chosen which makes the best use of available data while taking into consideration factors such as noise and its effect on sampling costs.
In this work we combine concepts from both VQAs and quantum kernel theory to demonstrate how a surrogate model based on classically-evaluated quantum kernels allows noisy VQAs to be solved quickly and with high accuracy. We focus on the variational quantum eigensolver (VQE) algorithm in which the cost function is the energy expectation value for the ansatz's output state with respect to a Hamiltonian of interest. Our Bayesian strategy only uses quantum circuit executions to query the energy expectation value for ansatz parameters, and not the gradients of this quantity. These points of interest are chosen through classical optimization of a surrogate model built from the observed energy values. By avoiding on-device estimation of energy gradients and employing an easily-optimized surrogate model, our strategy requires fewer quantum circuit executions per-point-queried than gradient based approaches while also converging in far fewer iterations, making it much more sample-efficient in terms of total quantum circuit executions. Being a global optimization strategy, it is also resilient against local minima. Because our surrogate model is built with a classically-evaluated quantum kernel, which itself is based on the ansatz circuit, our strategy is able to effectively leverage both explicit knowledge about the ansatz circuit with implicit information about the on-device noise processes contained in the observed energy values.
As the optimization strategy outlined here uses direct classical simulations of quantum kernel functions its applicability is limited to relatively narrow and shallow ansatz circuits (up to ∼ 20 qubits) for which classical simulations are feasible. Even at these scales, performing VQE on-device, in order to physically prepare an approximation to a system's ground state is still a highly non-trivial problem due to device noise and the long queues often required for device access. To extend our strategy's applicability, we also outline a scheme to produce a classically tractable approximation to the most-suitable quantum kernel for VQE and so provide a framework for applying the optimization strategy to larger systems.

II. VQE ON NISQ DEVICES
The VQE algorithm is a near-term quantum algorithm that prepares approximations of the ground state and ground state energy of a Hamiltonian and has a wide range of potential applications [26,43]. This is done by adjusting the parameters of an ansatz circuit to minimize the measured energy of its output state. For Hamiltonian H and an ansatz state |ψ(θ)⟩ = U (θ) |0⟩ or ρ(θ) = |ψ(θ)⟩⟨ψ(θ)|, which is produced by a parameterized unitary evolution U (θ) with a vector of parameters θ, the goal of VQE is to find θ opt such that where Tr is the trace operation. By the variational principle, the ansatz state that minimizes the energy is an approximation to the ground state and its energy expectation value is an upper bound to the ground state energy. The presence of noise on current quantum processors provides a justification for why on-device small system VQE is still an interesting and important task. On-device VQE gives not only an approximation to the desired ground state energy but also an ansatz circuit that prepares a close approximation to the ground state on the noisy device. Obtaining such circuits with classical methods would first require accurate characterisation of the exact noise processes present during the device's operation, i.e. quantum process tomography, which for more than a few qubits is an incredibly resource-intensive task [44]. Being able to physically prepare a close approximation to the ground state of a Hamiltonian is a useful first step for other quantum computing tasks and for benchmarking the device [45,46]. Gradient descent-based optimization strategies are ubiquitous in noiseless functional minimization due to their favourable convergence guarantees and are frequently used in VQAs [16,35,[47][48][49][50][51]. The gradient of the energy function with respect to each ansatz parameter can be estimated directly either through finite difference approaches or using parameter shift rules [52]. Both approaches typically require sampling the cost function at two points per parameter (or more if a parameter appears in multiple gates) meaning these parameter-wise gradients are relatively quantum resource-intensive. Current devices are sufficiently noisy that only relatively shallow circuits on few qubits produce outputs not dominated by noise. At these scales ansatz-derived barren plateaus are less of a concern however noise-induced flattening of the cost function can still increase the number of samples needed to make accurate gradient estimates [31].
To avoid the large cost of estimating noisy parameter-wise gradients more sample-efficient and noise-resilient schemes such as the SPSA have been employed in VQAs [26,47,53] at the cost of slower convergence. This allows approximate gradient descent to be performed more efficiently on noisy objective functions because its update rule requires evaluations at just two points, regardless of how many parameters are being optimized. However SPSA is often slow to converge and can become stuck in local minima, requiring multiple restarts, this means a large number of energy evaluations are needed to solve even relatively small VQE problems. Other optimization strategies avoid gradients altogether and include heuristic schemes [43,54,55] and surrogate model based approaches [28,54].

III. SURROGATE MODELS AND QUANTUM KERNEL FUNCTIONS
An alternative approach to gradient descent-based VQE is to build a surrogate model for the energy function based on previously observed evaluations. This should, ideally, be differentiable such that it can be optimized more effectively thanẼ(θ), the noise-corrupted energy function. This approach forms the core of Bayesian optimization (BO) [56] in which a surrogate model is trained on observed data and then optimized to choose the most promising location to query the cost function. Previous work has shown how Bayesian VQE combined with a novel information-sharing approach can drastically improve the convergence times of a set of related VQE problems [28].
Typically, a Gaussian process (GP) surrogate model is used for BO and is primarily defined in terms of a kernel function. A kernel function, k ∶ X × X → R, is a symmetric positive (semi-)definite function of two arguments in an input space X such that k(x, x ′ ) quantifies the similarity between x and x ′ . They can also be understood as an inner product in a reproducing kernel Hilbert space (RKHS) such that k(x, x ′ ) = φ T (x ′ )φ(x), where φ ∶ X → F is a feature map which takes a point x in the input space to a point φ(x) in the feature space, F, induced by the kernel (a vector in the RKHS) [18]. We give an overview of Gaussian process surrogates and Bayesian optimization in sections VIII A and VIII B. For a GP model to make accurate predictions about a function's value at an unseen point its kernel function should properly quantify the similarity between this point and the points where the function value has already been observed. Ideally, the objective function being modelled should be a linear function in the feature spaced induced by the kernel. If this is true, the representer theorem [57] from classical kernel theory shows that the minimizer of any regularized empirical risk functional (the model that is "optimal" at describing a set of training data) is given by a weighted sum of kernel evaluations with the points in the training data.
In general the exact functional form of the cost function is unknown, making it difficult to know which kernel function to use. To combat this, flexible kernel functions equipped with hyperparameters are often used. Many of these have the universal approximation property [58], meaning an arbitrary well-behaved function at a point x * can be approximated by a finite weighted sum of kernel evaluations between x * and an appropriately sized "training set" of other points. To ensure that a GP model equipped with one of these flexible, agnostic kernels yields good predictive accuracy from the available training data the kernel's hyperparameters are typically chosen by maximising the marginal likelihood of the training data. Optimization of a kernel's hyperparameters is often the most computationally expensive stage of fitting a GP model to observed data (see section VIII A). To date, surrogate model-based approaches to VQAs have made use of these general-purpose hyperparameterized classical kernel functions [28,59], leaving open the question of whether more appropriate kernel functions exist for VQAs.
Schuld [60] has demonstrated that a very general class of quantum models are linear models in the feature space induced by a quantum kernel. A quantum model is described as a quantum circuit consisting of an encoding unitary followed by a (parameterized) measurement. If a VQA's cost function has a linear dependence on the measurement outcomes of a quantum circuit then the results in [60] suggest that a quantum kernel based on the same circuit could be used to produce a powerful surrogate model for the VQA's cost function.

A. Quantum kernel functions
The two quantum kernels that we will consider in this work are the "state kernel", which is equivalent to the fidelity between quantum states, and the "unitary kernel", which quantifies the similarity between two unitary operations in terms of their Hilbert-Schmidt norm. For an ansatz circuit |ψ(θ)⟩ = U (θ) |0⟩, with an n-qubit input state |0⟩ = |0⟩ ⊗n , the state kernel evaluated between gate angles θ and θ ′ is defined as: or equivalently by k s (θ, θ ′ ) ∶= Tr{ρ(θ ′ )ρ(θ)} (assuming a pure quantum state). As we discuss in section III B and explicitly show in appendix A, the energy function is linear in the feature space induced by the state kernel provided this kernel is based on the same circuit used to calculate E(θ). As a result the state kernel is still expected to be effective when building general surrogate models for E(θ).
The unitary kernel also induces a feature space in which the E(θ) is a linear function. For an ansatz imparting a d × d-dimensional unitary U (θ) the unitary kernel is defined as which is the (normalized) modulus-squared Hilbert-Schmidt norm between the unitaries U (θ) and U (θ ′ ). We show in appendices B and C that while E(θ) is linear in the feature space induced by this kernel this space typically has a much larger dimension than that of the state kernel and so more data is required to build accurate regression models using it.
We will now explicitly construct the feature maps of these two kernels for an ansatz circuit |ψ(θ)⟩ = U (θ) |0⟩. Note that while a kernel function and corresponding reproducing kernel Hilbert space are unique, the feature map is not unique and is only defined up to an isometry (which preserves the inner product and thereby the kernel). We consider n-qubit p-parameter ansatzes formed of an initial state |0⟩ = |0⟩ ⊗n acted upon by p parameterized Pauli rotations, with n-qubit Pauli operators {P 1 , . . . , P p }, which are interleaved with p + 1 arbitrary fixed unitaries {R 1 , . . . R p+1 }. The corresponding unitary U (θ) is given by: "Hardware efficient" ansatzes [26] admit this description and are formed of individual parameterized Pauli rotations (PPRs) with low-weight Pauli operators; these ansatzes are common in NISQ quantum computing as they often map efficiently onto a device's native gate set [16]. Ansatzes of the form given by (4) are extremely flexible, being able to implement arbitrary parameterized unitaries provided a one uses a suitably many PPRs, a sufficiently sophisticated encoding of the angles θ, and a careful choice of the {R i } (although these may require exponentially deep circuits [61]). We show in appendix A that the state kernel's feature space can be constructed in terms of an ansatz-dependent 3 p -element vector of operators on H ⊗ H * (where H is the qubits' Hilbert space) s, given by and an angle-dependent vector v(θ), given by Here we have made a distinction between the tensor products ⊗ between the physical and conjugated Hilbert spaces of the qubits and Kronecker products ⊗ K which are used to construct the kernels' feature maps. The ⊗ K act as tensor products between the sub-vectors in s and v imposing matrix multiplication in the former case and ordinary multiplication in the latter. Note that the elements of v(θ) are linearly independent Fourier components and so cannot be decomposed further.
The state kernel is then where (S) ij = ⟪ρ 0 |s † i s j |ρ 0 ⟫ is a Hermitian matrix of inner products between the (unnormalized) states s i |ρ 0 ⟫ and s j |ρ 0 ⟫ (s i and s j are the i th and j th components of s, respectively), ρ 0 = |0⟩ ⟨0|, and |X⟫ denotes the vectorization of X so that |ρ 0 ⟫ = |0⟩ ⊗ |0 * ⟩ [62]. As S is a Hermitian matrix of inner products it is a positive-semidefinite Gram matrix and admits a Cholesky decomposition S = Q † Q (where Q is an upper-triangular matrix). This means that the state kernel can be written as the inner-product of two 3 p -element vectors, k s (θ, The construction of the feature map for the unitary kernel is similar and involves the same two vectors v(θ) (6) and s (5). The unitary kernel is given by where, (T ) ij = Tr{s † i s j }/4 n is a Hermitian Gram matrix of Hilbert-Schmidt inner products between the operators s i and s j . It too is positive-semidefinite allowing it to be written in Cholesky form T = B † B (with B upper-triangular), meaning that the unitary kernel is So far we have assumed that all the gate angles used in the PPRs are independent. In more sophisticated ansatzes whose gate angles are linear combinations of the ansatz parameters one can apply linear transformations to v(θ) such that it contains a set of independent Fourier components and so apply a similar analysis. Provided the couplings between gate parameters are linear (for example if multiple parameterized gates share the same parameter), this will act to reduce the dimension of the induced feature space.

B. VQE in a kernel setting
For Bayesian VQE to be effective, the kernel used must result in a GP model which can well-approximateẼ(θ), the noisy energy function. An interesting question this poses is whether quantum kernels are more ideally suited to this task than typical classical kernels. To justify that these kernels are well suited, we can analyse noiseless energy landscape E(θ) for a Hamiltonian H and an ansatz unitary of the form given in (4). In appendix A we show that this is where h i = ⟪R p+1 HR † p+1 |s i |ρ 0 ⟫ and s is given in (5). Encouragingly, we see that the angle-dependent part of the quantum kernels' feature spaces v(θ) makes an appearance, mirroring the decomposition of the energy function in [51] in terms of a weighted sum of Fourier components. For a suitable set of energy observations, e.g. at the points {(θ 1 , ⋯, θ p ) ∈ {0, π/2, π} ×p } or {(θ 1 , ⋯, θ p ) ∈ {−2π/3, 0, 2π/3} ×p } it is straightforward to see how the components of h (the weights of the Fourier components in E(θ)) can be directly inferred (the latter set of points giving exactly the result in [51]). As the feature spaces of both the state and unitary kernels are related to v(θ) by linear transformations, the energy is a linear function in both these spaces. More explicitly, (8) and (10), respectively and are assumed to be invertible).
Given a set of training data (pairs of ansatz parameters and corresponding noiseless energy evaluations), the representer theorem [57] guarantees that the minimizer of any regularized empirical loss functional is given by a finite linear sum of weighted kernel evaluations evaluated at the training points (ansatz parameters whose energy values have been observed) [57]. For an l 2 -regularized least-squares loss the minimizer is kernel ridge regression [63], which is mathematically equivalent to the posterior mean of a Gaussian process (21). Because these kernels have finite dimensional feature spaces, kernel ridge regression or GP models using them should be able to achieve perfect global prediction accuracy (can perfectly predict E(θ) for all θ) if the size of the training data set is greater than or equal to the feature space dimension (assuming the data is linearly independent in the feature space).
We note that linearity of E(θ) in these kernels' induced feature spaces is achieved without any hyperparameters. This removes the need for any maximum likelihood hyperparameter optimization. It also means that the state or unitary kernel between any two observed points (used to construct the Gram matrix K) only needs to be evaluated a single time. This non-periodic ansatz yields a real wavefunction as its output state. The single-qubit gate RY (θj) imparts a unitary RY (θj) = e −iσy θ j /2 . As the ansatz has 4 layers of CX gates on alternating pairs of adjacent qubits we describe this circuit as being depth 4.

C. Gaussian processes with quantum kernels
While both quantum kernels look promising for VQE, one might suspect the state kernel to be more suitable as VQE directly concerns the energy of the state produced by the ansatz, rather than its unitary.
To evaluate the effectiveness of these quantum kernels versus typical classical kernels in kernel-based VQE regression we first consider the problem of attempting to build a GP model for observations of the noiseless E(θ). The Hamiltonian we will use throughout is the anti-ferromagnetic 1D transverse field Ising model with a longitudinal field, given by: where J is a coupling strength, h x is the transverse field strength, h z is the longitudinal field strength, X i and Z i are the Pauli X and Z operators acting respectively on the i th qubit, Z i Z j denotes simultaneous Pauli Z operators on the i th and j th qubits, and the first sum is taken over nearest-neighbour pairs in a 1D spin chain with periodic boundary conditions. Throughout we use J = h z = 0.5 and h x = −0.5. As this is a real-valued Hamiltonian, its eigenstates can also be taken to be real. This means that for VQE we can limit our ansatz circuit to only yielding real wavefunctions. The ansatz we will use is shown in Fig. 1 and is designed to be efficiently implementable on a quantum processor with nearest-neighbour connectivity between qubits (which helpfully mirrors the nearest-neighbour terms in the Hamiltonian). It consists of parameterised rotations RY (θ j ) = e −iσyθj /2 and fixed controlled-X gates to build up entanglement. Let G[µ(⋅), k(⋅, ⋅, α)] be a Gaussian process model with covariance/kernel function k, kernel hyperparameters α, and mean function µ. Given a set of observed training and validation energies, y t = {E(θ t1 ), . . . , E(θ t N t )} and whereŷ G (θ) = E[y * |θ, X t , y t ] is the prediction of model G at point θ trained with data (X t , y t ) (which, if appropriate, includes hyperparameter optimization). A validation score of R 2 v = 0 indicates that the GP model is performing as well (with respect to the L 2 loss) as a model that always predicts E[y v ], while a score of R 2 v = 1 shows perfect prediction across the unseen validation set. Fig. 2 shows the improvement in validation score as the size of the training data is increased for GP regression models equipped with different quantum and typical classical kernels. The models are presented with noiseless evaluations of E(θ) for the Hamiltonian given in (12) (J = h z = 0.5 and h x = −0.5). The training and validation sets (X t and X v ) are drawn uniformly at random from the interval {−π, π} (for each of the 16 parameters). Both the median R 2 v over 100 repeats (solid line) and the inter-quartile range of the data are shown. The quantum kernel and energy evaluations were performed noiselessly on a classical computer. In all cases, a small diagonal offset was added to the Gram matrices used in GP model prediction K → K + 10 −10 I (equivalent to σ 2 n = 10 −10 in (21)) to ensure numerical stability.
We see that a Gaussian process model equipped with the state kernel greatly outperforms the other kernels and reaches the optimal validation score of R 2 v = 1 once the size of the training set reaches N t = 136. The validation score of the unitary kernel improves much more slowly, and for N t = 136 achieves an R 2 v < 0.1. However this is still much higher than the scores for the various classical kernels which perform poorly at this task. These classical kernels yield validation scores R v < 0 and show negligible signs of improvement as the training data set grows. Figure 2. Validation score R 2 v for GP regression models with different kernels learning the energy landscape E(θ) for the ansatz given in Fig. 1 and Hamiltonian in (12) with J = hz = 0.5 and hx = −0.5. Horizontal axis is the size of the training data set Nt, vertical axis is the validation score (defined in (13)). Shown is the median validation scores over 100 repeats (solid lines/dots) and the inter-quartile range of this data (filled). The training and validation points (Xt and Xt) are chosen uniformly at random (for each of the 16 parameters) in the range [−π, π] and the corresponding energies y t and y v are evaluated noiselessly on a classical computer. Maximum likelihood hyperparameter optimization with respect to the training data was used to optimize any kernel hyperparameters.

4-qubit 16-parameter global GP regression
It is clear, and perhaps unsurprising, that a state kernel built from the same circuit used for a VQE problem provides a much more accurate similarity measure between data points than the other kernels we consider. This is because the energy function is linear in the finite-dimensional feature space induced by the state kernel (as discussed in section III B). While E(θ) is also linear in the unitary kernel's feature space, we show in section VI that this kernel's induced feature space is typically quadratically larger than that of the state kernel. To build a globally accurate GP model, the observations must span a significant portion of this space which explains why k s vastly outperforms k u at this task. The classical kernels we use are universal but do not appear particularly well-suited to the complicated periodic and highly oscillatory E(θ). As this is 16-parameter problem, the density of the couple of hundred observations in the input parameter space will be far lower than what is required for these classical kernels to make good use of their universal property.
For Bayesian optimization we are not always interested in creating an accurate global (i.e. θ i ∈ [−π, π]) regression model as in Fig. 2. While good global accuracy is helpful, in Bayesian optimization only promising regions of parameter space (those that look close to the desired extremum) are explored in detail. On smaller parameter scales the energy landscape will be less dramatically oscillatory and the density of observed data will be higher meaning that the classical kernels may have a better chance of accurately interpolating between observations.
To test this we examine the validation score for GP models when performing local regression of a VQE energy landscape, restricted to a smaller region of the parameter space. In this case, the training set X t and validation set X v are generated by first picking a common anchor point θ (0) whose elements are chosen uniformly at random in the range [−π, π]. The circuit parameters for X t and X v are then obtained by sampling parameter vectors θ with elements drawn uniformly at random from the interval where s is a scale reduction factor. Figure 3 illustrates this process, showing the log prediction error log 10 (1 − R 2 v ) for a repeat of the simulations in figure 2 on reduced parameter scales with scale factors s = 10 (3a) and s = 100 (3b). We see that when required to make only local predictions, GP models using the unitary and classical kernels show significant improvement although those using the state kernel are consistently the most accurate. This is to be expected as observations are much denser in the input space. Denser observations allow the universal classical kernel-based models to more effectively interpolate between observations. They also reduce the portion of the state and unitary kernel's induced feature spaces which the observations must span for k s -and k u -based GP models to perform well.
The state kernel's advantage is most pronounced for s = 10, which is to be expected as this is closer to global regression. The unitary, RQ, RBF kernels and the Matern kernel with ν = 5/2 all perform similarly well but the Matern kernel with ν = 3/2 struggles when presented with large amounts of data (particularly for s = 10). This is likely because a GP equipped with a Matern kernel is ⌈ν⌉ − 1 times differentiable (once differentiable for ν = 3/2 and twice for ν = 5/2) [63], while those using the unitary, RBF, and RQ kernels are infinitely differentiable. As the noiseless energy function (a finite weighted sum of Fourier components, see III B) is smooth and infinitely differentiable this may explain why the singly differentiable ν = 3/2 Matern kernel performs worse than the other, more-times differentiable 4-qubit 16-parameter local GP regression b.) 1/100 scale a.) 1/10 scale Figure 3. Log prediction error log 10 (1 − R 2 v ) for GP regression models with different kernels learning the energy landscape E(θ) over reduced scales for the same ansatz and Hamiltonian as in Figure 2. Horizontal axes are the size of the training data set Nt, vertical axes give the logarithm of the prediction error log 10 (1 − R 2 v ). Shown are the median over 100 repeats (solid lines/dots) and the inter-quartile range of the data (filled). For each repeat, the GP models are trained and their predictions validated using energy evaluations at ansatz parameters sampled around a randomly chosen anchor point θ (0) . To generate the training and validation points (Xt and Xv), each gate angle, θi, is sampled uniformly at random with θi ∈ [θ where s is a scale reduction parameter. The corresponding energies y t and y v and evaluated noiselessly on a classical computer. a.) s = 10. b.) s = 100. Maximum likelihood hyperparameter optimization with respect to the training data was used to optimize any kernel hyperparameters.

A. Limitations of on-device quantum kernels
For it to be useful, the predictions from a surrogate model should be significantly easier to compute and optimize than the cost function. Given m observations, making a prediction from a Gaussian process surrogate (see section VIII A), as is done when optimizing a GP-based surrogate model in BO, costs m kernel evaluations, while fitting the model requires a Gram-matrix of O(m 2 ) kernel evaluations to be calculated.
This raises a potential issue with quantum kernel-based surrogate models, namely that both fitting the surrogate model and using it to make predictions requires repeated executions of a large number of quantum circuits. Unless the number of data points is smaller than the number of circuits required to measure the cost/energy function one could simply measure and optimize the cost directly, avoiding any inaccuracies of the approximate surrogate model. Accordingly, quantum kernels evaluated on-device are unlikely to be useful for producing the kinds of easy-to-optimize surrogate models needed for Bayesian optimization.
It is also worth noting that unless the ansatz used is highly structured, both the state and unitary kernels evaluated for two randomly chosen sets of gate angles will shrink exponentially with the number of qubits used [64]. While this is less of an issue for small NISQ-scale VQE problems, this means that for problems on many qubits one would require exponentially-many shots to accurately resolve these exponentially-small kernel evaluations. This is closely related to the barren-plateau problem and further reinforces our assertion that quantum kernels evaluated on-device are unlikely to be useful for producing globally-accurate surrogate models for VQE.
Device noise also greatly complicates the evaluation of quantum kernels on-device. Although using a kernel based on the noisy operation of a quantum processor may be useful in quantifying the similarity between noisy energy evaluations such kernel functions may require large numbers of samples to evaluate accurately. This would greatly complicate any surrogate model optimization and means that the usual requirements of a kernel, for example positivity, would not necessarily be satisfied. It would also be difficult to ensure that the errors present when evaluating the kernel correspond to those encountered when estimating the cost/energy. For example, one typical implementation of the state kernel requires a circuits of twice the depth used to estimate the energy [65] and so involves more noise and decoherence than would be seen in the cost function. Alternatively, one can implement the kernel using a circuit of the same depth as the cost function but over two sets of qubits [66] which will likely have different noise characteristics.
While Gaussian processes naturally accommodate noisy observations it is normally assumed that the kernel function can be computed exactly. Circuit noise and finite sampling errors would mean an on-device quantum kernel evaluation is affected by statistical fluctuations. This would then affect the two key objects in GP modelling; the positive-definite Gram matrix K of pair-wise kernel evaluations for the observed points and the vector k of kernel evaluations between a point of interest and the observed points (see section VIII A for a full discussion of these). Randomness in these objects can be detrimental in two ways. Firstly, if the Gram matrix K is composed of noisy kernel evaluations then it may not be positive definite and numerical instabilities can occur when calculating its inverse (needed to make predictions). This can be partially alleviated by increasing the noise strength hyperparameter σ n (see section VIII A) to ensure an invertible and positive-definite K.
Another more serious issue comes when attempting to minimize a surrogate model built using a noisy kernel as part of a Bayesian optimization loop. Gradient descent methods are the standard approach for this however they require repeated accurate evaluations of ∇ x k, the gradient of kernel evaluations between the query point and the training data. While noise-resilient methods such as SPSA noise-resilient could be used for this optimization, these could be applied directly to the objective function, bypassing the need for a surrogate model. A possible exception would be if the surrogate model is considerably cheaper to evaluate than the cost function, which would be true if the number of observations is smaller than the number of circuits required to evaluate the cost.
Due to these complications, we do not expect significant practical advantages to VQE from Gaussian processes which use device-evaluated quantum kernels. Instead we evaluate all the quantum kernels using a classical computer, which can be done tractably provided the number of qubits is relatively small and depth of the quantum circuits involved is relatively low.
For many-qubit VQE problems, in which most kernel evaluations are exponentially suppressed, one could still attempt a highly localized form of Bayesian VQE with classically-evaluated quantum kernels. In such scenarios, the Gram matrix K calculated for a set of randomly chosen initial points will be exponentially close to the identity. Similarly, kernel evaluations between a new point of interest θ * and these initially chosen points will be close to zero unless θ * is in the immediate vicinity of an initial point. Bayesian VQE using a GP with such an exponentiallydecaying kernel will be highly localised around whichever initial point gives the lowest energy. This means that at each step in the BO, the maximum of the acquisition function (see section VIII B) will stay close to the most promising initial point (where the kernel is non-negligible). This optimization strategy would still differ from completely local methods like as gradient descent in that all previous observations (however close to each other they may be) are used to decide the next query point rather than just the most recent point seen. However for this to work, one would still need to find a useful starting point for the optimization (i.e. not in a barren-plateau) and have a method for evaluating the kernel. This could either be estimated on-device, would likely be more costly than optimizing the energy directly, or evaluated classically, which would be intractable unless the ansatz has some simple exploitable structure or admits a simplifying approximation (we discuss this in section V).

B. Noiseless Bayesian VQE
We have seen that the state kernel provides a significant advantage over typical classical kernels when performing GP regression of a circuit's energy landscape and now apply these results to VQE using Bayesian optimization. Bayesian optimization is a gradient-free strategy for optimizing noisy expensive-to-evaluate objective functions [56]. As current cloud-based NISQ computers are in high demand, the problem of variationally minimizing the noisyẼ(θ) is an apt use-case for Bayesian optimization. We describe Bayesian optimization in more detail in section VIII B.
We will quantify the performance of the optimization in terms of the best seen energy error E(y, θ opt ) which we define as the fractional difference between the lowest energy seen at the current stage in the optimization and the minimum achievable noiseless energy for the ansatz. For a set of observed energy values y and an optimal set of gate angles θ opt , as defined in (1), this is given by The optimal gate angles θ opt for the ansatz shown in Figure 1 were found using 10, 000 attempts of direct gradientbased minimization, giving a minimum energy of E(θ opt ) = −2.762194. Figure 4 shows the results of noiseless VQE simulations using Bayesian optimization with different kernels. The ansatz circuit used is shown in Fig 1 and the Hamiltonian is given in (12) with J = h z = 0.5 and h x = −0.5. At the start of each simulation, 25 initial points were drawn uniformly at random and their energies evaluated to form an initial training data set. The Expected Improvement [67] acquisition function was used with an exploration hyperparameter ξ = 0.01, this value is widely used in EI-based BO [68,69] and was found to yield the best BO performance across 4-qubit 16-parameter noiseless Bayesian VQE (antiferromagnetic TFIM with logitudinal field) a.) b.) Figure 4. Performance of 16 parameter noiseless Bayesian VQE using different kernels. The Hamiltonian is given in (12) with J = hz = 0.5 and hx = −0.5, while the ansatz used here is the 16-parameter 4-qubit ansatz in Figure 1. We use the Expected Improvement acquisition function with ξ = 0.01 and maximum likelihood hyperparameter optimization. All kernels were equipped with a signal variance hyperparameter σ 2 , i.e. k(θ, θ ′ ) → σ 2 k(θ, θ ′ ). a.) Vertical axis is the best seen energy error (14), horizontal axis is the number of data points seen. The median (solid lines), mean (dashed lines), and interquartile ranges (filled) are shown for data aggregated over 100 repeats of the optimization. The first 25 points are chosen at random to initialize the optimization (indicated by black vertical dashed line). b.) Horizontal axis is the log final best seen energy error log 10 E(y, θopt) after 80 points have been queried. Vertical axis is a Gaussian kernel density estimate (bandwidth 0.15) of the distribution of log 10 E(y, θopt) (solid line and filled), estimated from the 100 repeats. Also shown are histograms of the final best seen energy error for the 100 repeats (solid bars).
the classical kernels considered. For each kernel, 100 repeats of the optimization were performed. Figure 4a shows the median (solid lines), mean (dashed lines), and inter-quartile (filled) range of the energy error. Figure 4b gives histograms and Gaussian kernel density estimates (with bandwidth 0.15 [70]) of the final best seen energy errors (after 80 new points have been requested).
The state kernel outperforms the other kernels at this VQE task, both in terms of the final energy found and in the speed at which it converges. It frequently reaches an energy less than one part in 10 −3.5 ≈ 0.03% away from the minimum achievable with this ansatz. Its final achieved energies form clusters around a few values, the most noticeable being one at errors between 10 −4 and 10 −3.5 and one that is close to, but still lower than, the majority of energies observed with the unitary and classical kernels. The unitary kernel also performs well, quickly and consistently converging to an error of approximately 4% and exhibits the second lowest mean and median errors overall. However, the data is strongly clustered with this error, implying that achieving a more accurate final energy may be difficult. The classical kernels vary considerably in their performance, generally having much worse mean errors than the quantum kernels. The final energies achieved by the classical kernels form two clusters; one of relatively successful runs with errors ∼ 6% and one with errors between 30 − 60%. The errors in this latter cluster are similar to those seen in the randomly selected initialization data, implying that these optimization runs failed to achieve any significant reduction in the energy. This suggests that while these GP models using these classical kernels sometimes lead to relatively successful VQE, they often immediately and become stuck in an exploitative phase where points close to those in the initial training data are repeatedly queried. It is therefore likely that as well as their advantages in over-all performance, Bayesian VQE using the state or unitary kernel can be more resilient against initialization failures than when using the classical kernels we have considered.
For comparison, Figure 5 shows a repeat of this noiseless VQE using the SPSA optimization scheme. We used the Qiskit implementation [71] which closely follows the initial proposal in [72]. We see from the experimental traces 5a and histograms in 5b that the effectiveness of this optimization strategy can vary but the average final energies typically have errors in line with those obtained for BO using the unitary kernel at around 4%. These are also comparable to the best experimental runs for BO using the various classical kernels. The data is strongly clustered around this energy error and only a handful of SPSA runs manage to compete with the low energies frequently achieved with state kernel-based BO. The SPSA runs which cluster around 4% generally require several hundred energy evaluations to achieve this level of accuracy whereas only around 100 evaluations are needed for the various kernel based models. The two SPSA runs with the lowest mean final energies had mean final energy errors on the order of 0.1% which is comparable to the lowest energies achieved with state kernel-based BO. However, these were only seen after 1, 500 a.) b.) Figure 5. Performance of 16 parameter noiseless VQE using SPSA. The Hamiltonian is given in (12) with J = hz = 0.5 and hx = −0.5. The ansatz used is the 16-parameter 4-qubit ansatz in Figure 1. We used the Qiskit SPSA implementation [71,72] in which 50 initial energy evaluations are used to calibrate the optimizer. The initial ansatz parameters for each run were chosen uniformly at random. a.) Vertical axis is the energy error (14), horizontal axis is the number of energy evaluations at current point in optimization (including those to estimate the gradient). Data for 1,000 repeats are shown. b.) Horizontal axis is the logarithm of the mean energy error for the last 25 energy evaluations in each optimization run. Vertical axis is a Gaussian kernel density estimate (bandwidth 0.1) of the distribution of this data (solid line and filled), estimated from the 1,000 repeats. Also shown are histograms of the data for the 1,000 repeats (solid bars).
total energy evaluations to be made. This level of final accuracy with SPSA appears to be extremely rare and requires greater than an order of magnitude more energy evaluations than is needed with the state kernel, highlighting the effectiveness and economy of our optimization strategy.

C. Noisy Bayesian VQE
The question remains whether a GP surrogate using a noiseless classically-evaluated quantum kernel is useful for Bayesian VQE of noisy quantum circuits. The noisy energy function is unlikely to be a linear function in the feature spaces induced by the noiseless state and unitary kernels. However, provided the device noise is not too great, these noiseless classical evaluations should provide a good approximation to the correct feature space for describing the noisy energy. By using a quantum kernel function based on the ansatz circuit we are able to leverage our prior knowledge of the (noiseless) ansatz circuit whereas standard GP surrogates can only attempt to do this through hyperparameter optimization. Because the observed energy values provide us with information about the noise on the device we are also able to take this noise into account implicitly. Figure 6 shows the results of a repeat of the Bayesian VQE simulations illustrated in Figure 4 with noisy energy evaluations. These were performed with using Qiskit's noisy quantum circuit simulation framework [71] and a noise model derived from the errors present on ibmq quito in Oct 2021 [73]. The noise includes contributions from gate errors, state preparation and measurement errors (for which Qiskit's readout error mitigation was used), and finite circuit shots (10,000 per circuit per evaluation). The quantum kernel evaluations (where used) were simulated exactly using a tensor network quantum circuit implementation based on the Quimb Python package [74]. As in Figure  4, Figure 6a shows the median (solid lines), mean (dashed lines), and inter-quartile range (filled) of the best seen energy error; the difference between the lowest seen energy and the minimum possible for the ansatz. Figure 6b shows histograms and kernel density estimates (with bandwidth 0.15) of the final best seen energy errors (after 80 new points are queried).
The noise in the simulations makes it unlikely that the energy can reach the noiseless minimum value. Accordingly, we also show the mean energy error obtained from 10, 000 repeated noisy evaluations of the energyẼ(θ opt ), taken at θ opt , the noiseless optimum gate angles. While θ opt may differ slightly from the gate angles that yield the true noisy optimum these evaluations serve as an indicator of the best performance to be expected from a noisy VQE implementation.
We again see the Gaussian process models based on the state kernel greatly outperform those using the other kernels. The state kernel frequently manages to achieve energies lower than the meanẼ(θ opt ) (an error of ∼ 2%) while also showing the fastest convergence. The unitary kernel also performs well, consistently giving energies lower or at least as low as those found using classical kernels (around 6%) but with a significantly lower variation in its performance. 4-qubit 16-parameter noisy Bayesian VQE (antiferromagnetic TFIM with logitudinal field) a.) b.) Figure 6. Performance of 16 parameter noisy Bayesian VQE using different kernels. The Hamiltonian is given in (12) with J = hz = 0.5 and hx = −0.5, while the ansatz used here is the 16-parameter 4-qubit ansatz in Figure 1. The noise model used was derived from the errors present on ibmq quito in October 2021 [73]. We use the Expected Improvement acquisition function with ξ = 0.01 and maximum likelihood hyperparameter optimization. All kernels were equipped with a signal variance hyperparameter σ 2 and noise hyperparameter σ 2 n , i.e. k(θi, θj) → σ 2 k(θi, θj) + σ 2 n δij . a.) Vertical axis is the best seen energy error (14), horizontal axis is the number of data points seen. The median (solid lines), mean (dashed lines), and interquartile ranges (filled) are shown for data aggregated over 100 repeats of the optimization. The first 25 points are chosen at random to initialize the optimization (indicated by black vertical dashed line). b.) Horizontal axis is the log final best seen energy error log 10 E(y, θopt) after 80 points have been queried. Vertical axis is a Gaussian kernel density estimate (bandwidth 0.15) of the distribution of log 10 E(y, θopt) (solid line and filled), estimated from the 100 repeats. Also shown are histograms of the final best seen energy error for the 100 repeats (solid bars). The black horizontal dashed line in a and vertical dashed line in b shows the mean of 10, 000 evaluations ofẼ(θopt), the noisy energy at the noiseless optimal ansatz parameters.
When dealing with noisy evaluations, VQE using the classical kernels shows qualitatively similar performance to the noiseless case (although with different final energy errors). Almost all the simulations performed with the RBF kernel and approximately half of those with the other classical kernels give a final errors in the range 30 − 60%. As in the noiseless case, these errors are similar to those seen for the randomly-selected initialization points implying that these runs failed almost immediately. The remainder of the classical kernel simulations achieve a much better performance which is close to the ∼ 6% seen with the unitary kernel.

V. CLASSICAL SIMULATION OF QUANTUM KERNELS
We have demonstrated that Bayesian VQE is significantly more effective when using quantum kernel-based GP models than when using classical kernels, both in terms of final energy accuracy and reliability. However, our simulations have concerned circuits of sufficiently few qubits and low depth that classical simulation of the quantum kernels is tractable. Once the number of qubits and/or the depth of the variational ansatz becomes too large, a GP surrogate model built on the full quantum kernel would be too computationally expensive to be practically used. However, if the number of gates to be optimized is not too great, in some instances classical evaluation of quantum kernels can remain tractable even for large numbers of qubits.
One way to ensure this is to only perform optimization on a subset of gates at any given time. Suppose we have an ansatz of the form , where U C , U B , and U A do not commute and any constant gates which commute with U B have been included in U A or U C . If, at some stage in the VQE, we fix θ A and θ C and only optimize over θ B then the unitary kernel is given by k Both the gates in the past (U A ) and future (U C ) causal light-cones of U B cancel out due to the cyclic property of the trace meaning that the classical computational cost of evaluating k u (θ B , θ ′ B ) only depends on the complexity of U B . However, as we have seen in the previous sections, while the unitary kernel does provide some advantages over the typical classical kernels we have considered, it usually requires many more observed energy evaluations than the state kernel to produce a similarly accurate surrogate model. The state kernel for this block-wise parametrization is given by For this kernel, the gates in the past causal light-cone of U B (those in U A ) must still be retained. Equivalently the kernel can be written as where |ψ A ⟩ = U A (θ A ) |0⟩ is the input state for an equivalent state kernel only containing U B . If U A is a deep circuit across many qubits such that |ψ A ⟩ cannot be represented or manipulated in a classically efficient manner, then this could make classical evaluation impractical and so prevent the kernel's use in a classical GP model. Block or layer-wise VQE as we describe above has received significant attention in the literature as a potential solution to the barren-plateau issues that prevent large-scale VQE [51,75,76]. If we assume that VQE can be performed effectively by selecting small blocks or layers of gates (U B ) to optimize at a time then one could attempt to find a classically tractable approximation to |ψ A ⟩ in order to produce an approximate state kernel. In section VIII C we describe a scheme that approximates |ψ A ⟩ to a matrix product state (MPS).
Matrix product states are a classically-efficient representation for certain types of multi-partite quantum state [77,78]. They first saw widespread use in condensed matter physics as the underlying ansatz for the powerful density matrix renormalization group (DMRG) algorithm (and other closely related algorithms) for 1-dimensional quantum lattices [79] and are an example of the wider class of tensor network states [80]. A matrix product state (with periodic boundary conditions) on n sites is defined in terms of a collection of n rank-3 tensors with a single outgoing physical index (of dimension equal to that of the subsystem at the site) and one or two (depending on periodicity) virtual indices. The state is given as a contraction over pairs of virtual indices for adjacent sites, forming bonds between them and leaving only the physical indices. The amount entanglement present in the state and the complexity of representing and manipulating the MPS depends on the size of the bond indices.
Our scheme approximates the input state |ψ A ⟩ by starting with an MPS representation of the input state |0⟩ (starting with all bond dimensions equal to 1) and contracting each constant gate from U A into this MPS. Single qubit gates can be contracted into an MPS without changing its bond dimensions. To apply a two-qubit gate to neighbouring sites/qubits in an MPS one first contracts the gate unitary into the tensors at the two involved sites. This yields a new two-site tensor which is then broken back down into two single-site tensors using singular-value-decomposition (SVD). These new MPS tensors are connected by a bond with dimension at most a factor of 4 larger than the original bond between the sites. To ensure the bond dimensions do not grow exponentially as more gates are applied, the size of the final new bond is capped at a fixed χ max by retaining only the largest χ max singular values of the two-site tensor (and the resulting state is re-normalized). Discarding singular values in this way is likely to reduce the fidelity between the truncated state and the state without truncation, degrading the accuracy of the overall approximation. To mitigate this decay in accuracy the truncated tensors are optimized to maximise the fidelity between the state with and without truncation. After all gates have been applied in this way we arrive at a new input state |ψ (χmax) A ⟩ which is an MPS with a maximal bond dimension χ max which we then use to obtain an approximation to the k s given byk If U B is only finitely entangling (being either low width or low depth) it can be contracted into |ψ A ⟩ with only a finite multiplicative increase in the MPS's bond-dimension. This means that if χ max is kept relatively small we can evaluate the approximate state kernel at a cost that scales at most as O(nχ 3 max d), where d = 2 is the individual physical dimension of the n qubits [81].
While our proposed scheme approximates the input state to a MPS this is by no means the only classically-efficient representation one could use to approximate |ψ A ⟩. For simplicity we only consider nearest-neighbour entangling gates; two-qubit gates between non-neighbouring qubits can also be contracted into an MPS state by means of additional SWAP gates at the cost of large increases in the bond dimensions. Depending on the situation and structure of U A , one could instead attempt to use one of the numerous other classically-efficient tensor network states. For example, a more complicated 2-(or higher-) dimensional entangling structure may motivate the use of projected entangled pair states (PEPS) or tree tensor-networks [82].
By approximating |ψ A ⟩ gate-by-gate with an upper limit on the retained bond-dimensions, our scheme avoids explicitly calculating the potentially intractable |ψ A ⟩. However, truncating the bond-dimension inevitably leads to a reduction in the fidelity between |ψ A ⟩ and |ψ (χmax) A ⟩. Instead, one could choose a classically-efficient parameterization of the input state |ψ A ⟩ as a kernel hyperparameter to optimize. This would allow an input state to be chosen with reference to the observed data, rather than being the result of a series of approximations. However, optimizing only a handful of kernel hyperparameters can be extremely expensive. As we discuss in section VIII A, optimizing the marginal likelihood with respect to any hyperparameter involves repeated calculation of both the Gram matrix K, its 6-qubit 10-parameter GP regression (deep circuit input state) with MPS approximated state kernel a.) b.) Figure 7. Validation scores R 2 v and log Bayes factors log B for GP regression models with approximated state kernels of various bond dimensions learning the energy landscape E(θ) for the Hamiltonian in (12) with J = hz = 0.5 and hx = −0.5. The ansatz had the same structure as that in Figure 1 but involved six qubits and a circuit depth of 20. Only the last two layers of RY gates were varied (a U B with 10 parameters in total) and the remainder of the circuit (U A ) was initialized with uniform random gate angles and held constant. A classically-efficient approximation to the state kernel was produced by approximating the state |ψ A ⟩ = U A |0⟩ before the parameterized U B to a matrix product state with maximal bond dimension χmax. Results are also shown for simulations using the full un-approximated state kernel. The training and validation points are chosen uniformly at random (for each of the 10 parameters) in the range [−π, π] and the corresponding energies y t and y v are evaluated noiselessly on a classical computer. a.) Horizontal axis is the size of the training data set Nt, vertical axis is the validation score (defined in (13)). b.) Horizontal axis is the size of the training data set Nt, vertical axis is the log Bayes factor (defined in (17)) versus the full un-approximated state kernel for the different approximated kernels. In both plots we show the median over 100 repeats (solid lines/dots) and the inter-quartile range of the data (filled).
inverse, and its derivative (see (24)). Unless the representation of |ψ A ⟩ is extremely (hyper)parameter efficient, this would likely make such a scheme impractical.

A. Gaussian process regression with an MPS-approximated state kernel
To test our approximation strategy we compare the predictive accuracy of a GP model built with approximated state kernels of various χ max to the accuracy when using the full state kernel. We again consider the Hamiltonian in (12) with J = h z = 0.5 and h x = −0.5. To ensure that the full state kernel is tractable for large numbers of observations we use a brickwork ansatz with the same structure and open boundary conditions as in (1) but with 6 qubits and 20 layers of CX gates on alternating pairs of adjacent qubits separated by RY gates (a total of 50 CX gates and 106 RY s). To engineer a situation in which a low χ max MPS approximation is unlikely accurate we choose the parameterized portion of the ansatz, U B , as the last two layers of RY s (10 of which surround the final CX layer) and pick a fixed set of random gate angles for the remaining gates (which form U A ). For each simulation we generate two sets of random uniform gate angles X t and X v and their corresponding noiseless energies y t and y v as training and validation data. The size of the training data set is varied while the size of the validation set is fixed to N v = 100. Figure 7a shows the validation scores achieved by Gaussian process surrogates equipped with MPS approximated state kernels of different χ max compared to one equipped with the full state kernel. The fidelities of the MPS approximations with different maximum bond dimensions to |ψ A ⟩ are shown in table I. To help assess the relative suitability of the different GP models in explaining their training data, Figure 7b shows the Log Bayes factor between the GP models with different approximated kernels and a GP model using the full state kernel. The Bayes factor, B(X, y, A, B) = p(y|X, A)/p(y|X, B), [83] between two statistical models A and B is defined as the ratio of the marginal likelihoods of a given set of training data under the two models. Its logarithm is log B (X, y, A, B where M A (X, y) and M B (X, y) are the log marginal likelihoods of the data (X, y) with the models A and B respectively (which for the GP models we use is given by (23)). If the models A and B are assigned equal prior probabilities  then the Bayes factor is equivalent to the ratio of the posterior probabilities of the two models p(A|y, X)/p(B|y, X). This gives the Bayes factor a useful interpretation as how much more plausible one model is at explaining the data than the other [83]. A log Bayes factor B(X, y, A, B) > 1 generally implies that model A is more strongly supported by the data than model B [83].
We see that MPS approximations to |ψ A ⟩ with a higher χ max have a larger fidelity with |ψ A ⟩ state and their GP models show increasingly similar performance to the GP which uses the full state kernel. Interestingly, the GP models that use approximated state kernels with lower χ max often have a higher validation score and Bayes factor (relative to the full state kernel) ≫ 1, provided the size of the training data set is small. When dealing with small numbers of observations these models have both a higher predictive accuracy for unseen energies and are better supported by their training data than the model using k s . This may be because the induced feature spaces of these kernels have a smaller dimension than the full state kernel, being based on a restricted subspace of the n qubit Hilbert space. As a result, less data would be needed to span an appreciable portion of the feature space and so make accurate predictions.
For all the approximated kernels there comes a point where the validation score begins to decrease as more training data is presented. This happens when the size of the training data set becomes close to the dimension of the approximated kernel's feature space. As the energy function will not be completely linear in the approximated kernel's feature space, it becomes increasingly difficult to reconcile these different observations well with a linear model. This feature space saturation will also occur for the GP based on the full state kernel when the validation score reaches R 2 v = 1 and the data spans the whole feature space. However, this is not an issue because the energy function is truly linear in the full state kernel's feature space and so energy observations can always be fully reconciled by a linear model in this space. Linear dependence will occur when the number of observations exceeds the feature space dimension at which point the Gram matrix K also becomes rank deficient; even before this point small eigenvalues will be present in K causing the GP model's predictions to become numerically unstable.
Typically a small regularizing offset, e.g. K → K + 10 −10 I, is added to Gram matrices to ensure they are invertible by effectively adding additional dimensions to the feature vectors. Provided the offset is small, the regularizaiton has little effect on models with well-suited kernels, like the full state kernel, as the objective function has minimal dependence on the additional components of the feature vectors introduced by the regularization. It can however help smooth out apparent inconsistencies encountered when using a kernel whose feature space is improperly aligned with the objective function by increasing the small eigenvalues present in K and reducing their disproportionate contribution to the predictions.
Choosing the value for this offset is key to ensuring accurate and numerically stable predictions; it must be sufficiently large to damp the contributions of small eigenvalues of K but not so large that it dominates the model. We see in Figure 7b that when the feature space saturation occurs, degrading the validation score, the likelihood of the training data also drops precipitously. As this can be calculated from the training data alone, we can use this as a metric for calibrating the diagonal offset to avoid numerical instabilities. We do this by treating the weighting of the diagonal offset as a kernel hyperparameter σ 2 n such that K → K + σ 2 n I and optimizing this to maximize the marginal likelihood of the observations. Adding this hyperparameter is equivalent to assuming that we are dealing with observations corrupted by additive Gaussian noise with zero mean and variance σ 2 n ; discrepancies between the observations and the assumption that they are drawn from a linear function in the kernel's feature space are effectively treated as an additional source of noise. When dealing with real noisy data, adding this noise term to avoid feature space saturation issues introduces negligible additional cost as a this term can be used to simultaneously deal with the observation noise. Figure 8a shows the results of a repeat of the simulations in 7a when the approximated kernels are equipped with this additional noise term. For comparison with the classical kernels used in our other simulations, the plot in Figure  8b shows the validation scores obtained for GP models using these classical kernels as well as the unitary kernel. Because the U B used in these simulations only has a depth of 1 (a single layer of entangling gates) the unitary kernel can be calculated classically at low cost and so provides a benchmark against which the approximated state kernels should be compared.
When given an additional noise term, the approximated state kernels no longer suffer from an eventual drop in predictive accuracy due to feature space saturation. Instead, their performance continues to improve as more data is provided until the validation score eventually begins to saturate. For χ max > 1 the approximated state kernels all yield better performance than any of the classical kernels and the unitary kernel. Although the unitary kernel MPS approximated state kernels with noise terms a.) Unitary and classical kernels b.) 6-qubit 10-parameter GP regression (deep circuit input state) Figure 8. Validation scores R 2 v for GP regression models based on approximated (with added noise terms) and full quantum kernels and classical kernels, learning the energy landscape E(θ) for the Hamiltonian in (12) with J = hz = 0.5 and hx = −0.5. The ansatz and parameterization used was the same as in Figure 7. The approximated state kernels were the same as in Figure 7 but were given an extra noise termks(θi, θj) →ks(θi, θj)+σ 2 n δij to avoid kernel saturation issues. The training and validation points are chosen uniformly at random (for each of the 10 parameters) in the range [−π, π] and the corresponding energies y t and y v are evaluated noiselessly on a classical computer. In both plots the horizontal axis is the size of the training data set Nt, vertical axis is the validation score (defined in (13)). a.) Results for approximated state kernels with added noise terms and the full un-approximated state kernel. b.) Results for the full unitary kernel and various classical kernels. In both plots we show the median over 100 repeats (solid lines/dots) and the inter-quartile range of the data (filled). Maximum likelihood hyperparameter optimization with respect to the training data was used to optimize any kernel hyperparameters (including the strength σ 2 n of the noise terms added to approximated state kernels).
fares better than the classical kernels, its performance increases slowly with the number of training data due to it having a much larger feature space dimension than any of the (approximated) state kernels. Again we see for small amounts of data, the state kernel approximations with low χ max have higher validation scores than the full state kernel, particularly with χ max = 2 for small N t and χ max = 4 for intermediate N t . This reinforces the idea that having a simpler model (with a smaller feature space) may provide accuracy benefits in the small-data regime as well as being easier to calculate. This raises the possibility that one could adaptively switch between kernels of varying complexity as more data is obtained, using comparative tools like the Bayes factor to decide which is most suitable at the current optimization stage.

VI. ANALYSIS OF QUANTUM KERNEL FEATURE SPACES
Many of our numerical results can be understood by analysing the dimensions of the feature spaces induced by the state and unitary kernels. From (8), it is immediately apparent that the state kernel's feature space F s is finitedimensional, unlike the classical kernels we have considered which have infinite dimensional [63] feature spaces. This has the consequence that this kernel is not universal meaning that it cannot be used to approximate an unknown (well-behaved) function to an arbitrary degree of accuracy [57]. The feature map sends an input vector of p gate angles to vector with at most 3 p elements (which are linear sums of various Fourier components). Using the identity A ⊗ C T |B⟫ = |ABC⟫ one can identify the set of (unnormalized) states {s i |ρ 0 ⟫} whose inner products form S as a set of vectorized Hermitian operators on H. As there can only be at most d 2 linearly independent Hermitian operators in a d-dimensional Hilbert space this puts an additional upper bound on the state kernel's feature space dimension. The maximal dimension for an n-qubit, p-parameter state kernel's feature space is therefore min(3 p , 4 n ). In practice, the dimension of F s may be lower than this due to redundancies or constraints on the set of Pauli rotations used in the ansatz (which reduces the span of the set of vectorized operators spanned by s|ρ 0 ⟫ and so the rank of S), or correlations between parameters (which would reduce the number of linearly independent components of v(θ)).
Assuming no ansatz redundancies, this implies that for an n-qubit ansatz we can hit the upper bound for dim F s imposed by the dimension of H (i.e. 3 p > 4 n ) using ⌈ 2 log 2 3 n⌉ ≈ ⌈1.262n⌉ PPRs. However, in Appendix B we show that in some cases the feature space dimension is also limited by constraints on the linear independence of the elements of s|ρ⟫, meaning that more than ⌈ 2 log 2 3 n⌉ parameterized rotations are required. We derive the following recursion relation for the maximal dimension d (n,p) s of F s for an n-qubit circuit with arbitrary fixed unitaries and p parameterized Pauli rotations is given by where for completeness we define the base-case d (n,0) s = 1 for all n. By finding p such that the p + 1 th rotation first hits the linear independence bound, i.e. when 3 p+1 > 4 n 2 +3 p , we find that total required number of parameterized rotations to obtain a maximal dimension feature space is equal to the upper bound ⌈ 2 log 2 3 n⌉ if ⌈ 2 log 2 3 n − 1 log 2 3 ⌉ = ⌈ 2 log 2 3 n⌉ − 1, or ⌈ 2 log 2 3 n⌉ + 1 if ⌈ 2 log 2 3 n − 1 log 2 3 ⌉ = ⌈ 2 log 2 3 n⌉. The results shown in Figure 2 give evidence of additional ansatz-imposed constraints on the state kernel's feature space dimension. With 16 parameters in the ansatz and 4 qubits, the absolute upper bound on the feature space dimension and thereby the number of training data points for R 2 v = 1 should be 4 4 = 256 however this instead occurs with just 136 observations. The ansatz used for these simulations (shown in Fig. 1) creates wavefunctions with realvalued amplitudes (in the computational basis), restricting the state space it can explore. The states it can produce have density matrices with components containing only even numbers of Y Pauli operators (and are generated by Pauli rotations with Y s acting on odd numbers of qubits) [36]. By counting the number of real n-qubit Pauli operators we can obtain a tighter upper bound for the maximal feature space dimension of real ansatz circuits. As there are 3 n−2j combinations of {I, X, Z} for all ( n 2j ) placements of 2j Y s, this upper bound is given by ∑ ⌊n/2⌋ j=0 3 n−2j ( n 2j ) which for n = 4 gives a dimension of 136. This was confirmed by generating Gram matrices with this kernel for > 136 uniform randomly chosen points and calculating their rank (and so the dimension of the kernel), again yielding a dimension of 136.
We also note that I ⊗ I ± P ⊗ P * is a projection onto the ±1 eigenspace of the Pauli operator P ⊗ P * while I ⊗ iP * − iP ⊗ I = I ⊗ iP * (I ⊗ I − P ⊗ P * ) is a projection onto the −1 eigenspace of P ⊗ P * followed by a (Clifford) unitary I ⊗ iP * . This means that if the initial state is a stabilizer state (such as the usual computational basis state |0 . . . 0⟩) and all R 1 , . . . , R p are Clifford operations, the elements of s|ρ 0 ⟫ and the overlaps in S can be calculated efficiently on a classical computer as a 2n-qubit stabilier simulation [84,85]. An obvious limitation however is that the feature space dimension scales exponentially in the number of parameterized rotations. This is not at all surprising as an initial stabilizer state with only a few non-Clifford rotations applied can be simulated "efficiently" (i.e. at cost polynomial in the number of qubits n, but exponential in the number of non-Clifford gates) on a classical computer [86].
The feature map for the unitary kernel φ u is qualitatively similar to φ s , mapping input vectors into a finitedimensional feature space vectors whose elements are sums of Fourier components. As v(θ) makes a reappearance here we again have an upper-bound to dim F u of 3 p for a k-Pauli rotation ansatz. We show in appendix C that there is a further upper bound due to the finite dimensionality of the n qubits' Hilbert space of 4 2n − 2(4 n − 1) meaning that the overall maximal feature space dimension for the unitary kernel with n qubits and p PPRs is Broadly speaking, the global predictive accuracy of a kernel regression model depends on whether the objective function is linear in the kernel's feature space and, for finite-dimensional kernels, the fraction of the kernel's feature space spanned by the observed data [63]. If the (assumed noiseless) data fully spans the kernel feature space then any new point can be written as a linear combination of the feature mapped observations, meaning a GP model based on the kernel will be perfectly accurate. For both the state and unitary kernels the linearity condition holds (in the absence of noise) however the key difference between them is in the scaling of their feature spaces. In most practical applications an ansatz circuit will have a number of parameterized gates polynomial in the number of qubits p ∼ poly(n) - (18) suggests that this would typically lead to a F s of maximal dimension of 4 n (although this may be reduced by ansatz redundancies or specific circuit structure). To build a globally accurate (high predictive accuracy for all θ) state kernel-based regression model for the energy of an arbitrary n-qubit Hamiltonian we would need O(4 n ) training data. In contrast, the unitary kernel's maximal feature space dimension scales like O(4 2n ) and so one would require roughly quadratically more data to achieve the same global accuracy. This is less of an issue when considering local regression, which is equally important in Bayesian optimization, where the data only needs to span a subspace of the feature space local to the point of interest. This can be seen particularly in 3b in which the unitary kernel and classical kernels compete relatively well with the state kernel in highly localized GP regression.

VII. CONCLUSION
The framework we have presented here allows on-device VQE of small systems to be performed with remarkably few energy evaluations. Our use of a Gaussian process surrogate model equipped with a classically-evaluated quantum kernel allows us to avoid many of the difficulties faced when implementing gradient-based optimization on NISQ devices. The two quantum kernels we have considered are based on the similarity (in terms of the fidelity) between two parameterized quantum states and two unitary operations. We have demonstrated that these kernels can be used to build very powerful Gaussian process surrogate models which are manifestly well-suited for regressing the cost function in VQE. For this regression task, these quantum kernel-based surrogate models exhibit significantly better predictive accuracy over many widely-used classical kernels. The advantage in predictive accuracy is particularly acute for the state fidelity-based kernel. Using this kernel one can build accurate regression models with far fewer samples than is needed by the unitary kernel or the classical kernels we have considered.
The advantage brought by a state kernel-based GP surrogate holds both on local and global parameter scales. Accuracy on these scales determines the effectiveness of the exploitative and exploratory phases of Bayesian optimization. Therefore, these results suggest that a state kernel-based GP surrogate can allow for fast Bayesian VQE with few energy evaluations. Due to the large number of kernel evaluations needed to build and optimize a quantum kernel based GP surrogate, it is impractical to evaluate these kernels on-device. Instead we propose that VQE problems at a scales where quantum kernel evaluation remains classically tractable can be solved quickly and with high accuracy using classically-evaluated quantum kernels. This allows one to make use of prior knowledge of the ansatz circuit while implicitly obtaining and making use of knowledge about a device's noise processes through the observed energy values. Through numerical experiments we have verified that this Bayesian approach to VQE is effective with both noiseless and realistic noisy energy observations.
The suitability of the state kernel for VQE stems from its induced feature space. We have explicitly constructed feature maps for the state and unitary kernels and demonstrated that the energy function in VQE is linear in both feature spaces. The dimension of the state kernel is roughly quadratically smaller than that of the unitary kernel (for typical ansatz circuits) and we believe this is the origin of the state kernel's advantage over the unitary kernel in GP regression and Bayesian optimization of VQE.
Finally, while we have shown our method for VQE is highly effective on small numbers of qubits, classical evaluation of the quantum kernels for more qubits quickly becomes intractable. By letting only individual blocks/layers of parameterized gates vary at any one time one can ensure that the unitary kernel remains tractable even for large numbers of qubits. For the state kernel, which is more useful for VQE, as the input state before varied gates must be considered to evaluate the kernel which may be infeasible. To remedy this we presented a scheme for approximating the state kernel so that it may be evaluated classically for large numbers of qubits. We ensure that the computational cost of evaluating this approximated state kernel remains bounded by approximating the input state to a matrix product state with a limited bond-dimension. Our simulations suggest that these approximated state kernels exhibit similar performance to the full state kernel in global GP regression provided the maximum allowed bond dimension is sufficiently large. Our results also suggest that GP models using an approximated state kernel with a lower maximum bond dimension (and so a smaller feature space dimension) can outperform the full state kernel in situations where few energy observations have been made.

ACKNOWLEDGMENTS
The kernel-based surrogate model used in our Bayesian VQE simulations is a Gaussian process. A Gaussian process (GP) is a collection of random variables with the property that any finite subset has a joint multivariate normal distribution [63]. It is defined in terms of a covariance (a kernel) function k(⋅, ⋅) and a mean function µ(⋅). Given a vector of m observed values y = (y 1 , . . . , y m ) T of some unknown function at points X = (x 1 , . . . , x m ) T (expressed as an m × p matrix, where p in the input space dimension) then the joint distribution with a new point y * at location x * is [63] [ where µ is an m element mean vector with µ i = µ(x i ), K is an m × m Gram matrix of pair-wise kernel evaluations between the observed inputs, K ij = k(x i , x j ), and k is an m element vector of kernel evaluations between the new point x * and the observed inputs X, k i = k(x * , x i ). If noise is present in the observations this is usually approximated to be normally distributed with zero mean. One can show analytically that for observations with normally distributed noise the kernel should be altered to k(x i , x j ) → k(x i , x j ) + σ 2 n δ ij , where σ 2 n is a noise variance hyperparameter and the indices of δ ij correspond to indices in the training data meaning that K → K + σ 2 n I but k and k(x * , x * ) are unchanged. Marginalising over the observed data, the resulting posterior distribution is itself a normal distribution with a posterior mean (a prediction) E[y * |x * , X, y] =ŷ(x * ) given bŷ and a posterior variance The performance of GP regression depends on the suitability of the kernel function in describing the unknown function. Unless one has significant prior knowledge of the problem at hand, choosing a suitable kernel is often difficult. As a result, problem-agnostic approaches to GP regression often make use of flexible kernels with internal hyperparameters which are varied to fit observed data. A common approach is to find hyperparameter values that maximise the marginal likelihood of the observed data (often the log-likelihood) [63]. Thanks to the normality of the GP's distributions, the marginal log-likelihood M (X, y) = log p(y|X, α) can be calculated analytically for a kernel with hyperparameter α as The gradient of this quantity with respect to a kernel hyperparameter α is given by and so this quantity can be maximised through gradient ascent at a cost primarily dictated by the calculation of K −1 σn if the kernel evaluations and their gradients are inexpensive to calculate. The posterior variance (22) depends only on kernel evaluations and not the observations y. If the kernel is fixed, a re-scaling of the training data of the problem y → αy will leave the posterior variance unchanged whereas it should be a factor of α 2 larger. To allow the posterior variance to scale with the size of the observations an external "signal variance" hyperparameter σ 2 is often added to the kernel k(x, x ′ ) → σ 2 k(x, x ′ ). We see from (21) and (22) that this does not change posterior mean prediction (up to a re-scaling of σ 2 n ) but the posterior variance is scaled by σ 2 . If the base kernel has no hyperparameters (which is true for the quantum kernels we consider) and the observed data is assumed to be noiseless then it is simple to show the marginal likelihood of the training data is maximised when σ 2 noiseless = (y − µ)K −1 (y − µ) T /m (where K is the Gram matrix for the kernel without σ 2 and m is the number of observations) -this often serves as a good starting point for maximum likelihood optimization even if noise is present.
The mean function µ is often set to a constant value, most commonly µ(x) = 0 for all x, as it is primarily the covariance function which defines a GP's properties; in most cases µ only has a significant impact at points that are far away (with respect to the kernel) from any observed data. However (as shown in appendix VI) for many ansatz circuits the expected value (taken over the gate angles) of the noiseless energy E θ [E(θ)] is given by Tr{H} meaning that one should set µ(x) = Tr{H}. Alternatively, one can remove this diagonal offset from the Hamiltonian by using H ′ = H − Tr{H}I and set µ = 0. We suggest that this should be done generally as calculating or estimating the average energy for a complicated ansatz will be difficult, especially if device noise is present. Removing this offset is good practice in general as a large diagonal offset in H can lead to improper conclusions on the effectiveness of an optimization scheme.
In our simulations we build the GP models using two quantum kernels (described in section III) and a set of classical kernel functions. These classical kernels are as follows; Matern kernels with smoothness hyperparameter ν = 5/2 and ν = 3/2, the radial basis function (RBF) kernel, and the rational quadratic (RQ) kernel. Their kernel functions are: where d(x, x ′ ) is the squared euclidean distance between points x and x ′ , l are length scale hyperparameters, and α is a scale mixture hyperparameter for the RQ kernel. These classical kernels require optimization of internal hyperparameters (a length scale l and, for the RQ kernel, a mixing parameter α) to ensure that the kernel's infinite-dimensional feature space most-succinctly describes the objective function. By contrast, the state and unitary kernel have finite-dimensional feature spaces in which the (noiseless) energy function is manifestly linear which means no hyperparameter optimization is required for a GP model to make sensible predictions about the energy. Hyperparameter optimization is often the most computationally intensive part of Bayesian optimization as it requires repeated calculation of the Gram matrix K (see (23) and (24)) and its inverse. If we have a kernel with no hyperparameters other than those for the signal variance and noise k(x i , x j ) = σ 2 k 0 (x i , x j )+σ 2 n δ ij , where k 0 is fixed, then the Gram matrix for the fixed part of the kernel K 0 only needs to be calculated once. This simplifies the hyperparameter optimization as the total Gram matrix K = σ 2 K 0 + σ 2 n I can be calculated easily for σ 2 and σ 2 n from K 0 . Additionally, the eigenvalues/vectors of the total K can be found analytically in terms of those for K 0 , simplifying calculation of the inverse Gram matrices needed to both optimize the hyperparameters and make predictions from the model. This ultimately means that the state and unitary kernels only need to be evaluated once between any two data points within a training data (up to a re-scaling by σ 2 and a noise offset), in contrast to the classical kernels whose Gram matrices must be entirely recalculated whenever an internal hyperparameter is adjusted.

B. Bayesian optimization
In Bayesian optimization one attempts to minimize an unknown expensive-to-evaluate objective function by assuming it is randomly drawn from some family of functions [67]. A prior distribution is chosen to encode any initial beliefs about this function, usually that it is sampled from a Gaussian process. From observations of the function's values and the prior, a posterior distribution over the chosen family of functions is constructed (the posterior distribution of the GP), quantifying how likely it is that a given function could have produced the observations. An acquisition function is then calculated from the posterior distribution and is maximised/minimised to determine the next point to query. Once the new point has been queried, the posterior distribution is reconstructed with this new data and the process is repeated until convergence to the objective function global minima.
The acquisition function quantifies how "promising" the querying of a given an unseen point appears for minimizing the objective function. These functions are often designed to balance the exploitation of regions of the parameter space that have already shown good objective function values with the exploration of areas in the parameter space where observations are sparse. By only querying parameter points that appear likely to yield a significant improvement in the objective function, Bayesian optimization can allow global minima to be found with remarkably few iterations. In our simulations we will use the Expected Improvement acquisition function [67] given (for minimization) by EI(x) = E[max(y best −ŷ(x) + ξ, 0)|x * , X, y], where the expectation is taken over the surrogate model's posterior distribution, y best = min{y} is the lowest observation seen so far, and ξ is an exploration hyperparameter. Using of the analytically tractable GP posterior distribution (i.e. normally distributed with mean/variance given by (21) and (22)) we can derive an explicit formula for this [67]:  Figure 9. Illustration of the scheme used to produce an MPS approximation to the state kernel. We assume the ansatz circuit has a limited number of parameterized gates as would be used in layer/block-wise VQE. a.) The portions of the ansatz circuit which are relevant to the state kernel are separated into a constant part U A and a parameterized part U B . b.) The anastz circuit is re-expressed in a tensor network representation. c.) The input state to U A is converted to an MPS state with bond dimension χ = 1 into which the gates are sequentially contracted (using the contraction and truncation process illustrated in Figure 10).
Φ and ϕ are the cumulative distribution and probability density functions for the standard normal distribution respectively. Expected improvement-based optimization can sometimes be overly greedy and exploitative and so a hyperparameter ξ has been added to control the ratio of exploration to exploitation. The ξ parameter allows more positive observations (up to where y * (x) = y best + ξ) to count as an improvement, meaning the objective function encourages exploration further away from the point where y best was observed. In our simulations we used ξ = 0.01 and optimized the surrogate model using the L-BFGS-B gradient-descent algorithm [87].
While it is possible to run Bayesian optimization starting with an observation at a single randomly chosen parameter point, more typically a collection of initial points are used to ensure an initial explorative phase. While there are many designs for these points [88] we use the simplest; random uniform sampling across the parameter space.

C. MPS approximation to the state kernel
Our approximation scheme for the state kernel relies on the parameterized circuit being of the form |ψ(θ B )⟩ = U A (θ A )U (θ B )U (θ C ) |0⟩ where θ A and θ C are fixed, so that the state kernel can be written in the form (15). We also require that U B is simple enough that its action on an MPS state of relatively low bond dimension can be simulated classically. If this is the case then by approximating |ψ A ⟩ to an MPS |ψ (χmax) A ⟩, we can achieve a classically tractable approximation to the state kernel. Figure 9 illustrates our approximation scheme. We begin (9a) with a quantum circuit composed of a constant part (U A ) and a variational part (U B ) whose parameters are to be optimized. For simplicity we omit any U C that lies in the future light-cone of the variational part of the circuit as this has no effect on the state kernel. The circuit is then converted into a tensor-network representation, which can be done at a cost proportional to the number of gates (an example is shown in 9b). The initial state |0⟩ is then converted to a matrix product state with bond dimension χ = 1 and each gate in the circuit is contracted into the MPS to produce |ψ (χmax) A ⟩ (shown in 9c). Figure 10 illustrates the process of contracting a gate G into an MPS |ψ MPS ⟩ (with physical dimension d). As we only consider single-and two-qubit gates (outlined in green and yellow respectively) only two MPS tensors with shared bond dimension χ are shown (outlined in blue). Vertical open bonds indicate connections to the rest of the MPS state. In 10a a single qubit unitary is contracted directly into the MPS tensor at the corresponding site/qubit, altering this tensor but leaving the bond dimensions unchanged. Figure 10b shows the contraction of a two-qubit unitary which acts on adjacent sites. The two-qubit unitary is fully contracted into the MPS tensors yielding a joint a.) ′ = min( svd , max ) b.)  Figure 10. Illustration of process for contracting single-and two-qubit gates into a matrix product state. The initial matrix product state is shown as a series of connected MPS tensors (blue circles with dark blue outlines), one at each qubit/site. Only relevant sites are shown in each diagram. Relevant bond dimensions are shown; χ denotes an internal bond while d denotes d-dimensional physical bonds (d = 2 for qubits). a.) A single qubit gate is applied by contracting its unitary (shown as a green dot) directly into the MPS tensor of the corresponding qubit, leaving the MPS structure and bond dimensions unchanged. b.) A two-qubit gate on adjacent qubits is applied by contracting its unitary (yellow oblong with dark yellow borders) into the MPS tensors of the corresponding qubits and splitting the result with SVD. The bond dimension is capped to a value χ ′ = min(χ svd χmax) by retaining only the χmax largest singular values, after which the state is renormalized. c.) An alternative two-qubit gate application scheme demonstrating that the maximal value of χ ′ is d 2 χ. d.) Following truncation the fidelity between the produced MPS state and the non-truncated state (modulus squared of the tensor network shown) is optimized to improve accuracy. This optimization is similified by only varying the tensors altered when applying the gate and by pre-contracting all other tensors to form the red tensor shown.
tensor for the two sites. Singular value decomposition (SVD) is used to split this joint tensor into two new MPS tensors with a new bond dimension χ ′ . The size of χ ′ depends on the specifics of the unitary and the state to which it is applied but takes a maximum value of d 2 χ. Figure 10c shows this diagrammatically by giving an alternative scheme for contracting the two-qubit gate. Here the unitary is first decomposed using SVD into two tensors connected by a bond of dimension d 2 . Each resulting tensor is contracted into their connected MPS tensor yielding an MPS state with a two bonds between the new MPS tensors (of size χ and d 2 ). These bonds are combined to form a single bond of dimension d 2 χ. Two qubit gates on non-neighbouring sites can be contracted using the same scheme by applying a chain of SWAP gates (as described in [89]) but can significantly increase the bond dimensions. Applying many two-qubit gates will lead to exponential growth in the MPS's bond dimensions. To avoid this we truncate χ ′ by only retaining the χ max largest singular values obtained from the SVD in the final stage of 10b. Following truncation the MPS is re-normalized, yielding an approximation |ψ M P S ⟩ to the transformed state G |ψ MPS ⟩ with a maximum bond dimension of χ max . If significant singular values of the joint tensor are discarded, the fidelity between |ψ M P S ⟩ and the target state G |ψ MPS ⟩ will be degraded. We partially mitigate this by maximizing the fidelity |⟨ψ MPS | G |ψ MPS ⟩| 2 between these two states, optimizing over |ψ MPS ⟩. This procedure can be greatly simplified by only optimizing over the two MPS that are altered when producing |ψ M P S ⟩ (shown as yellow dots with a blue outline in Figure 10b and 10d). By first contracting over the tensors which are not optimized (creating the tensor shown with red fill Figure 10d) this optimization can be performed extremely efficiently, involving contractions over just 4 tensors. These initial contractions can also be further simplified by first canonicalizing |ψ MPS ⟩ around the two qubits acted upon by the unitary [82] (provided the MPS has open boundaries to allow this). Once all gates in the circuit have been contracted into the MPS using the scheme we have outlined (with the necessary truncation and fidelity optimization) the resulting MPS state |ψ (χmax) A ⟩ is used as the input state for the approximated state kernel given in (15).

(A11)
We can also use (A11) to calculate the mean value of the energy function E θ [E(θ)] with respect to the gate angles, which is required to properly set the Gaussian process' mean. Because E(θ) is a linear function of v(θ) we have is a Kronecker product of k subvectors with v (j) = (1, sin(θ j ), cos(θ j )) T each of which only depends on a single gate angle (we assume all gate angles are independent). This means that So the only the first element in E θ [v(θ)] is nonzero, greatly simplifying the expected energy to One way to interpret 1 2 (R q ⊗ R * q + P q R q ⊗ P * q R * q ) is as a projector onto the positive eigenspace of the Pauli operator P q ⊗ P * q . Alternatively it can be seen as taking the part of its unvectorized input (following the application of R q ) that commutes with P q . We can decompose an input ρ ′ = R q ρR † q into parts (sums of Pauli operators) that commute and anticommute with P , i.e. ρ ′ = c q + a q with [P q , c q ] = 0 and {P q , a q } = 0 then 1 2 (R q ⊗ R * q + P q R q ⊗ P * q R * q )|ρ⟫ = 1 2 |R q ρR † q + P q R q ρR † q P q ⟫ = |c q ⟫. Applying each of these projectors in this way we see that where c 1,...,k is what remains of ρ 0 once the process of applying projector (applying each R q and removing the part of the result that anticommutes with P q ). In general calculating this is complicated as it depends on the fixed unitaries {R q } present in the ansatz however we suggest that for most ansatzes of interest the remainder will be c 1,...,k = I/2 n as this factor is present in all density matrices (ensuring Tr{ρ} = 1) and commutes with any P q regardless of the fixed unitaries applied. This leads to our suggested approximation to the mean energy of E θ [E(θ)] = h 0 ≈ Tr{R † p+1 HR p+1 I/2 n } = Tr{H}. It takes only a few anti-commuting parameterized Pauli rotations on each qubit (e.g. two sequential non-commuting single-qubit PPRs on each qubit) for this approximation to be exact.