Quantum Kernel Evaluation via Hong-Ou-Mandel Interference

One of the fastest growing areas of interest in quantum computing is its use within machine learning methods, in particular through the application of quantum kernels. Despite this large interest, there exist very few proposals for relevant physical platforms to evaluate quantum kernels. In this article, we propose and simulate a protocol capable of evaluating quantum kernels using Hong-Ou-Mandel (HOM) interference, an experimental technique that is widely accessible to optics researchers. Our proposal utilises the orthogonal temporal modes of a single photon, allowing one to encode multi-dimensional feature vectors. As a result, interfering two photons and using the detected coincidence counts, we can perform a direct measurement and binary classification. This physical platform confers an exponential quantum advantage also described theoretically in other works. We present a complete description of this method and perform a numerical experiment to demonstrate a sample application for binary classification of classical data.


I. INTRODUCTION
In recent years, there has been growing interest in the applications of machine learning to the physical sciences [1].One particular area of that has received considerable attention is quantum machine learning [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].In quantum machine learning, evaluating a quantum kernel is analogous to computing a classical kernel in classical machine learning.Both methods aim to measure the similarity or distance between data points in high-dimensional feature spaces without explicitly transforming the data.In classical machine learning, kernel methods use kernel functions to calculate inner products between data points in the transformed space.In quantum machine learning, classical data is encoded into quantum states, and the quantum kernel computes inner products between these quantum states to quantify their similarity or distance in the quantum feature space.Encoding classical data into a quantum system is equivalent to embedding the data into a quantum feature space [6, 9, 12-14, 21, 22], and measurement of the quantum system is equivalent to evaluating the kernel.
This connection between classical and quantum kernels leverages the computational advantages of quantum computing for efficient inner product calculations, providing a powerful tool for quantum machine learning algorithms and applications in data analysis and pattern recognition.
Despite much success in the theoretical investigation of quantum kernels, there have been only a few experimental demonstrations.A seminal approach using nuclear spins [23] encodes specific feature vectors into two unitary operators, which are consecutively applied to an initial state before the system is measured along its magnetic moment, providing kernel evaluation.Several applications using quantum optics have also been demonstrated.Two studies have encoded features into the dual-rail encoding of multiple photons, first demonstrated in [24] and subsequently in [25].The entanglement between dual-rail encoded photons can be used to exploit the quantum advantage inherent in quantum kernels, namely the exponential speedup in their evaluation [26].A final example involves using the spectral modes of ultrafast radiofrequency pulses to classify and train labeled datasets [27].The advantage of this methodology is that it makes use of the photon's spectral modes, decomposing these into an orthogonal eigenbasis for encoding higher-dimensional feature vectors.The latter results discussed above highlight the utility of photonics in quantum information.In fact, optically encoded quantum information presents itself in many machine learning examples [28][29][30][31][32][33].The popularity of photonic quantum information stems from the photon's versatility given its numerous and highly controllable degrees of freedom [34].Despite the numerous encoding proposals that hinge on these increased degrees of freedom, the currently proposed physical platforms are limited to qubit-based models.To this end, we propose a new method of evaluating quantum kernels using Hong-Ou-Mandel interference.As we will show, this method not only makes use of the entanglement necessary for the enhancement in quantum kernels [24,26], but also utilizes the higher-dimensional spectrum of single photons, which were exploited in [27].

II. KERNEL METHOD MACHINE LEARNING
We will begin by providing a brief pedagogical overview of kernel methods-for completeness see Ref. [35].To summarize Kernel methods succinctly, one starts by encoding data into some higher-dimensional feature space where classification of the data can be easier to analyse.However what makes this algorithm so useful is that is does not need to explicitly perform evaluations of the data in the feature space, but rather can be carried out using the kernel function that is defined on the domain of the original input data; this is commonly known as the Kernel trick.
To understand this more precisely, let us describe a simple machine learning classification task.Suppose we have a data set Y -say of N images-as a list of input vectors and labels, Y ≡ {(⃗ x 1 , y 1 ), (⃗ x 2 , y 2 ), . . .(⃗ x N , y N )}.Here y i is the label-for example, cats and dogs-and ⃗ x i is the input data X -for example the pixel colour values, or some other set of features.The goal of our classification task is ultimately to classify an unknown data set Y ′ , that is a set of data with unknown labels.One approach is to determine this complex relationship between inputs ⃗ x i and labels y i , such that if it is presented with an unknown data set Y ′ , the algorithm will correctly classify this data.This is typically the approach of Deep Neural Networks which employ a vast number of tunable parameters and non-linear functions to effectively replicate this relationship.
Kernel methods on the other hand embed the data X in a higher-dimensional feature space F via the feature map ϕ: X → F. In the feature space, the data is then classified mathematically by a suitable choice of distance measure along a decision boundary-for example, the inner product which we will consider going forward.This concept is demonstrated visually in Fig. (1).In this sense, a kernel k maps two inputs ⃗ x and ⃗ x ′ -both of which are from the input space X -to a distance measure C such that k : X × X → C [35].Moreover, the feature map ϕ is related to the kernel mapping via the inner product of different feature vectors which has an associated Gram matrix K, that is positive semi-definite.If the Gram matrix satisfies the condition for any c 1 ...c M ∈ C, an associated feature mapping is guaranteed to exist and the function can be considered a kernel [35].With all this specified one can then classify a data point ⃗ x in the feature space F according to some decision boundary b that separates the classes as depicted in Fig. ( 1) where f (x) is positive or negative indicating the binary classification.
Given that a kernel can be evaluated as the inner product between two feature vectors, there is a natural extension to quantum feature spaces [13].Suppose that we again have some input data ⃗ x, which is then encoded into a quantum state |Φ(⃗ x)⟩.The quantum kernel will correspond to the overlap of this state with another Furthermore, given that all the kernel outcomes can be mapped to probabilities, any inner product calculated using quantum states way will automatically satisfy the gram matrix conditions outlined in Eq. ( 3).This has provided ample motivation for the development of kernel based quantum machine learning.
Evaluating the kernel requires that one can directly measure the overlap between two quantum states, which either requires quantum state tomography or other methods using quantum computers where one can directly parameterize the feature map ϕ.

III. TEMPORAL ENCODING AND HONG-OU-MANDEL INTERFERENCE
Here we will outline a proposal for quantum kernel evaluation using the temporal encoding of single photon Fock-states and Hong-Ou-Mandel interference.Creating higher dimensional quantum optical states can be achieved in many ways as detailed in the recent reviews on this topic [34,[36][37][38][39]. Experimentally however, not all methods of encoding information in optics are made equal.For example, a natural orthonormal basis would be multi-photon states, however these can be difficult to measure and require photon number resolving detectors [40][41][42][43].
A more common methodology which circumvents this approach is to encode into multiple photons in different paths, while simultaneously exploiting the two polarisation modes, thus creating numerous dual rail encoded qubits, which to date have been used to evaluate quantum kernels [24].However, the physical scaling and technological challenges associated with nuermous dual rail qubits can prevent higher dimensional feature spaces from being reached in practice.
We propose to sidestep these complications through the use of temporally/frequency encoded photons, a method which has received considerably less attention in the quantum kernel community [44,45].A continuousmode single-photon state can be described as the coherent superposition of many spectral modes ω where Ψ(ω) is the spectral density function that weights each mode, and â † (ω) are the creation operators associated with each ω.Furthermore, we will make the quantum optics approximation whereby we assume that the spectral spread is much smaller than the carrier ω ≪ ω c [46].In this limit, the Fourier transform of the slowly varying spectral envelope corresponds to the temporal wave-packet, thus yielding the description of the single photon state defined in the time domain Here the temporal wave-packet satisfies the normalisation condition dt|Ψ(t)| 2 =1 and the bosonic field operators satisfy the commutation relation Feature vectors can be encoded into the temporal modes of single photons provided a set of orthogonal temporal modes are chosen where α n (⃗ x) denotes a unit weight-encoding the information from the feature vector-of each mode and {u n (t)} a set of orthonormal temporal mode functions [45].In our work, we will take the set {u n (t)} as the set of orthogonal Hermite Gaussian (HG) modes, noting that this choice is arbitrary could be replaced by any numerically orthogonal set of single-variable functions, where is the n th Hermite polynomial and satisfies the orthogonality relation FIG. 2. (Top) A depiction of HOM interference where two identical photons incident on a beam splitter simultaneously will interfere, causing them to bunch into pairs.This effect can be measured using the coincidence counts (CC) of the two respective detectors.(Below) Shows the HOM dip as a function of the relative time-delay dt of the photons.As the time-delay between the two vanish, then the probability of measuring a CC also vanishes.
Furthermore, we will define the unit weight vector α n simply as where ϕ n (⃗ x)∈C is the nth element of the feature vector of the input data ⃗ x, and w n is a free weight-commonly added in kernel methods to allow optimisation of the resulting kernel.Moreover the coefficients are normalized such that n |α n | 2 =1.We are thus in a position to now define the proposed single photon encoding for an input vector ⃗ x as In our proposed encoding scheme, the single photon corresponds to the information carrier, the HG temporal modes correspond to the orthogonal basis of the feature space, and the coefficients encode the data into this basis.Now suppose we have two encoded feature vectors |Φ(⃗ x, t)⟩ and |Φ(⃗ x ′ , t)⟩ and we would like to measure the quantum kernel Eq. (5).
We can do this by interfering the two photons with each other on a 50:50 beam-splitter (BS).This interference is intrinsically quantum mechanical and leads to 'bunching' where both photons exit the same port and are detected together.This is known as Hong-Ou-Mandel (HOM) interference [47].The probability of detecting the two photons simultaneously at either detector P (2, 0) or P (0, 2) is equal to the overlap of the two wave functions which can be readily evaluated using the commutation relation Eq. ( 8) and the orthogonality of the HG modes Eq. ( 12) as which provides a direct evaluation of the quantum kernel.This quantum kernel can be observed directly by measuring the HOM dip, whereby two temporally synchronized detectors measure correlated photon detection events, otherwise known as coincidence counts (CC) [47,48].This is visualized in Figure .(2).The normalized CC (divided by the total number of counts) is equal to the probability of measuring a photon event at both detectors simultaneously and is therefore equivalent to This provides a clear experimental evaluation of the quantum kernel where the HOM interference will classify the similarity between feature vectors ⃗ x and ⃗ x ′ .Finally, there is an immediate parallel between our physical proposal and a theoretical proposal subject to a quantum advantage, discovered in [26] and tested in [24].Firstly, we are both working towards solving the same task, namely supervised cluster assignment.In the interest of brevity, we will use our abbreviated notation in Eq. ( 6), where |Φ(⃗ x ′ , t)⟩→|1 Φ ′ ⟩ describes the encoded features in orthogonal temporal modes.We therefore initially begin with two photons, encoded into different paths corresponding to the initial state: The two photons are then interfered on a single beam splitter as described above, leading to the transformation: where the notation indicates that the two photons are in the same path mode |2 Φ ′ ,Φ ⟩.As we have shown above, the probability of measuring two qubits (in either port) is given by Eq. ( 15), and measuring one qubit in both ports (a coincidence count) is given by Eq. ( 17).This proposed scheme makes use of the same quantum resources required in [26], whereby a larger distance between Φ(⃗ x, t) and Φ(⃗ x ′ , t) is represented in the entanglement of the state given by Eq. (19).Rather than relying on an empty ancilla as an indicator function, we make use of both photons simultaneously as information carriers.By utilizing entanglement as a resource in the same manner as [26], we conclude that our proposed physical platform also offers the same available quantum advantage over their classical counterparts.

IV. APPLICATION EXAMPLE: NUMERICAL EXPERIMENT
We have so far described how one can use the CC of temporally encoded photons to evaluate a quantum kernel.Now we will demonstrate an example application for a binary classification task using generated data.The model for machine learning and classification in this case is a maximum mean discrepancy (MMD) model via kernel mean embedding (KME), with training of the model performed via stochastic gradient descent (SGD).The applications of this physical platform are not limited to any aspect of this model, it can be used for any machine learning algorithm that requires kernel evaluation.

A. Kernel Implementation Model: Maximum Mean Discrepancy
This subsection outlines the use of MMD for kernel machine learning and specifically the implementation for our proposed physical platform.Suppose we have two classifications, both of which are characterized by the probability distributions over the input data P (⃗ x) and Q(⃗ x) respectively.Then given a kernel function k : X × X → C with associated mapping ϕ : X → F, we can define group mean µ P over the classification group P (X) as where X P indicates that we are only averaging over feature vectors of the P class.Our encoding scheme described above permits the immediate definition of the quantum feature mean [14] |µ where N P is a normalisation constant.Therefore, we can create a quantum feature mean by simply averaging over all the state vectors corresponding to a quantum feature map.
We now define the maximum mean discrepancy (MMD) which corresponds to the absolute difference measure between any two distributions P and Q, and is defined as the absolute difference between KMEs, What's more, if we now combine our quantum feature map using our single photon encoding Eq. ( 14) as well as our quantum feature mean Eq. ( 21), then we can evaluate the quantum MMD using our HOM interferometer where one photon encodes the mean µ P and the other µ Q , which yields where CC(P, Q) = 1 − |⟨µ P |µ Q ⟩| 2 given by Eq. ( 17).We can therefore measure the MMD using a single evaluation of a HOM interferometer and the kernel function associated with feature mapping ϕ.Moreover, using the free-weights w n in Eq. ( 13), one can optimize the MMD, yielding an optimal class separation in feature space F. Crucially, the MMD model provides a cost function to be optimized in a given feature space allowing for an implicit separating hyperplane between classification groups.Once optimized, the mapping ϕ could be used to classify an unseen data point ⃗ x ′ , using a quantum HOM classifier by where a positive evaluation of this class function will place ⃗ x in class Q, and negative will place it in class P .If the feature map has been optimized such that the means are almost orthogonal ( ⟨µ P |µ Q ⟩ ∼ 0), then classification can be accurately approximated with a single feature mean: If, for example, ⃗ x belongs to P , then the overlap with µ P will be high, and low with µ Q , thus we would expect to see a HOM dip in the former, and not in the latter.To minimise the use of resources, we note that one does not need to compare ⃗ x to both µ P and µ Q , but rather just to one-say µ Q -provided that the dip has been calibrated to MMD(P, Q).In this setting, one could set the decision boundary to be equal to (27) In a high efficiency experiment (low loss and noise, high quantum efficiency) with highly orthogonal encoding (⟨µ P |µ Q ⟩ ∼ 0), this classification of an unknown ⃗ x could be measured with very few single photon measurements as the expected coincidence count would be highly correlated or anti-correlated with the single comparison point µ Q .This ensures that very few photons are required to perform this classification, thus minimising the experimental overhead.The MMD model for kernel implementation defines the criteria used to determine the ability of the model and the process of classification, but does not specify the optimisation algorithm used to train the model.For this example, we choose to implement a SGD optimisation algorithm, however one is free to choose whatever algorithm they see fit for this purpose.
The parameter to be optimised are the free weights, given by the vector ⃗ w, which were introduced in Eq. ( 13).We use stochastic gradient descent (SGD) with MMD employed as the cost function to be optimized.This ensures that the weights are optimized with respect to the means, and most likely to be maximally orthogonal in the feature Hilbert space.Moreover, we use SGD rather than normal gradient descent to ensure the optimisation does halt in any local minima.As such the weights are updated at each iteration i according to the difference equation where the subscript i indicates the iteration-not to be confused with the element n in Eq. ( 13)-and ∇ ⃗ w MMD(P, Q) is the numerical derivative of the cost function evaluated at ⃗ w i -MMD evaluation as given by Eq. ( 22)-with respect to the weights.L i is then the learning rate at the ith iteration, and ⃗ ϵ i is a stochastic random variable we add at each time step.

C. Application and Results
To simulate the expected results of this model, a numerical experiment is performed.We initially generate a two dimensional (two features F 1 and F 2 ) training and test data set using scikit-learns' "make blobs" function with parameters such that it would be separable in a polynomial feature space, but not be linearly separable [49].The use of two feature is for visual demonstration purposes, and the model is applicable to higher dimensional data.Here the data corresponds to two classes blue P and red Q.The training data set is depicted in Fig. (3).From the training set, the group mean from each classification is determined by initially mapping each data vector to the feature space given by the polynomial kernel of degree two and then taking average values within each classification group.After this transformation, we can also add trainable weights w n as we introduced in Eq. ( 13), and then normalize them as suitable coefficients for the quantum encoding Eq. ( 14).The weights are then optimised using the SGD algorithm outlined in Eq. (28).
For the interested reader, a more detailed explanation Mapping the means to feature space using the transformation in Eq. ( 29) is insufficient to induce the needed hyperplane, separation only occurs in the feature space after optimisation of the free weight parameters of the training process and associated hyper-parameters can be found in Appendix A.
The optimisation process maximises the MMD value of the two feature mean encoded photons µ P and µ Q , minimising the probability of a HOM interference measurement registering a coincidence count.This process is demonstrated in Fig. (4a) where we plot the HOM dip calculated before and after the weights have been optimized between the two means µ P and µ Q .At the center of the dip, where the time-delay between the photons is zero dt = 0, the means are initially very similar, resulting in no discernible HOM dip-⟨µ P |µ Q ⟩ ∼ 1-but after training are maximally orthogonal thus generate a large dip-⟨µ P |µ Q ⟩ ∼ 0.Moreover, when the training is complete, we can use our encoded photons to classify unseen data accurately via Eq.( 26) which is depicted in the Violin plots in

V. DISCUSSION
In this article we proposed a practical experimental methodology for evaluating quantum kernels using HOM interference of temporal encoded photons.We showed that the kernel is directly related to the conicidence counts measured via two photodetectors.We further showed that by using a quantum KME, one could also perform MMD for classification, which we demonstrated using a very simple example.Given that we have optimized the MMD between the two classes, then it is quite reasonable to assume that the number of experimental runs required to classify the data is minimized-since quantum classification via quantum kernel methods necessarily requires estimating the probability given by Eq. ( 5).Finally, this concept offers another experimentally feasible methodology for evaluating quantum kernels in the growing application of photonic quantum machine learning.For the first few iterations there is very little change, before the the algorithm begins to maximise the cost function.

FIG. 1 .
FIG.1.A visual representation of kernel machine learning, between two classes; blue circles and orange squares.(left) A depiction of the initial feature space of unprocessed data contained in the 2 dimension features space with features x1 and x2.(right) After transforming to the feature space F via ϕ, the data can clearly be linearly separated into two classes (for illustrative purposes, we have limited this to a two dimensional transform, but this can be relaxed in general).In this depiction, the average of each class can be computed (coloured stars) and used as a discrepancy measured via MMD.

FIG. 3 .
FIG.3.Scatter plot of the example data showing two classes (red) and (blue) which are clearly distinguishable, but have overlapping means and no hyperplane that separates them in two dimensions.Here F1 and F2 are two features and the mean data point for each classification group has noted by the symbol (⋆).Mapping the means to feature space using the transformation in Eq. (29) is insufficient to induce the needed hyperplane, separation only occurs in the feature space after optimisation of the free weight parameters Fig. (4b) and also shown in the before and after training confusion matrices Fig. (4c) and Fig.(4d).The confusion matrix is a machine learning tool that demonstrates the percentage of correct/incorrect classifications.The vertical axis denotes the calculated classification as given by the machine HOM classifier, and the horizontal axis denotes the true classification.We clearly see that after training we are able to classify the data precisely using our HOM quantum classifier.

FIG. 4
FIG. 4. a) A depiction of the HOM dip of the two means created by evaluating the two means µP and µQ before (solid black line) showing a dip indicating significant overlap, and after training (black dashed line) showing no dip indicating a large separation of two means in the feature Hilbert space.b) Depicts two violin plots showing the classification of unseen data before and after using the quantum HOM classifier in Eq. (26).The widths of the violin plots are arbitrary but are proportional to the probability density of the classified distributions.After training the number of misclassifications by the HOM classifier is low.Figure c) and d) Show the confusion matrix before and after training, where the diagonal (off diagonal) elements correspond to correct (incorrect) classifications.

2 FIG. 5 .
FIG. 5. A plot of the average cost function (MMD value) over 1000 trials with regards to the epoch.The average initial parameters are randomly chosen, hence the non-zero overlap.For the first few iterations there is very little change, before the the algorithm begins to maximise the cost function.