Qudit machine learning

We present a comprehensive investigation into the learning capabilities of a simple d-level system (qudit). Our study is specialized for classification tasks using real-world databases, specifically the Iris, breast cancer, and MNIST datasets. We explore various learning models in the metric learning framework, along with different encoding strategies. In particular, we employ data re-uploading techniques and maximally orthogonal states to accommodate input data within low-dimensional systems. Our findings reveal optimal strategies, indicating that when the dimension of input feature data and the number of classes are not significantly larger than the qudit’s dimension, our results show favorable comparisons against the best classical models. This trend holds true even for small quantum systems, with dimensions d < 5 and utilizing algorithms with a few layers ( L=1,2 ). However, for high-dimensional data such as MNIST, we adopt a hybrid approach involving dimensional reduction through a convolutional neural network. In this context, we observe that small quantum systems often act as bottlenecks, resulting in lower accuracy compared to their classical counterparts.


Introduction
Machine learning (ML) has been integrated into our daily routines.This broad adoption is closely linked to advancements in hardware capabilities.Furthermore, the extensive utilization of machine learning has brought attention to the energy costs of computation, emphasizing the importance of exploring alternative computing paradigms [1].A case in point is the emergence of Physical Neural Networks (PNNs).They are dedicated physical systems like nonlinear optical or mechanical resonators, that are manipulated to produce desired computational outputs [2,3].
In this context, a quantum processor is nothing more than a system implementing quantum dynamics depending on several physical parameters that can be modified for performing different tasks or minimizing a cost function, which aligns with the natural paradigm in machine learning and PNNs.Numerous architectures and models have been already implemented in the lab [4][5][6][7] The interest of using quantum processors for ML tasks lies in the hope that some problems could be solved faster [8] on a quantum computer and / or that it could be more energy-efficient, thus reducing the environmental impact of machine learning [9].However, whether these expectations are met for a wide range of problems remains unclear [10][11][12][13][14]. Nonetheless, it is worth investigating to either confirm or dismiss them.
In this work, we aim to test the learning capabilities of a quantum system.For this purpose, we will consider classification tasks as examples and use a simple d-level system (a qudit) where transitions can be induced [15][16][17][18][19][20], and the population of levels after this external manipulation can be measured [21].This minimal system will allow us to compare the universality, expressiveness and performance of a d-dimensional unitary with standard classical algorithms.Additionally, we will explore different strategies depending on the data-set dimension D x , qudit dimension d, and the number of classes K to be classified, leveraging the simplicity of the quantum system considered here.

Our Work in Context
The use of quantum processors for machine learning tasks is a rapidly growing field.It is important to note that this brief overview may not cover all relevant works in the field.
However, there are seminal works that align more closely with the philosophy of our current research.These studies investigate the capabilities of the smallest quantum systems, i.e. two level systems and test their performance.They also introduce the concept of data re-uploading [37], a concept that we will discuss in this work as well.Reference [38] demonstrated that a two-level quantum system with data re-uploading can serve as a universal approximator for any function.This is achieved by leveraging results from quantum Fourier analysis, which have been recently generalized to the qudit scenario [39].Moreover, the generalization of data re-uploading to qudits has been discussed in [40], where the qudit dimension is fixed to the number of classes.
Our aim here is to test the learning capabilities of a single quantum system with d levels (qudit), conducting an extensive study that employs various types of learning, encoding, and measurement algorithms.

Main Results and Organization
In this work, we take a broader perspective on learning with qudits within the metric learning paradigm [41], discussing both implicit and explicit approaches [42], each with its own advantages and drawbacks.We explore different types of encodings, such as data re-uploading and the use of maximally orthogonal states, to accommodate the dimension of the data set within fewlevel quantum machines.
Regarding the results, we discuss various prototypical datasets with unique characteristics that allow us to explore the efficiency of the different methods discussed.Additionally, we test the potential of hybrid classical-quantum models suitable for cases with high-dimensional data sets.
For data sets with dimensions and a number of classes not significantly larger than the dimension of the qudits, we provide results that compare favorably to the best classical models.In such cases, a few-layer protocol with a small quantum system of dimension d ∼ 3 − 5 suffices.However, for higher-dimensional data sets, hybrid models perform acceptably, but the small dimension of the quantum system cannot outperform advanced classical techniques like convolutional neural networks.
The rest of the manuscript is organized as follows: in section 2 we present our theoretical backbone where we cover the control of our physical system, the optimisation method and how data is encoded.In section 3 we test these tools using different datasets and different models according to the needs of each problem as well as discussing the computational complexity.Finally, in section 4 we check what effect the main sources of noise have on our physical system in both the training and testing phases.We leave the conclusions and final remarks for section 5.

Learning with qudits
Our focus is on classifiers based on quantum variational algorithms (QVA) for addressing supervised learning problems.In the supervised learning scenario, the data set consists of input vectors containing the feature values and their corresponding labels, {⃗ x i , y i } N i=1 with dim(⃗ x i ) = D x & y i = 1, .., K.We are assuming that data can be classified in K-classes.Besides, the data is splitted in training (N training ) and testing (N test ).However, for certain problems it is convenient to further split the training data set into validation (N validation ), which is used to assess that no overfitting is happening and proper training (the one with which the parameters of the model are optimized).
Typically, QVAs can be divided into three steps: i) encoding the data onto a wave function However, these two steps can be encapsulated in a step that depends on the input data set and the parameters ⃗ φ, namely, The final part of the algorithm, iii) involves performing a measurement This process undergoes an iterative procedure, as illustrated in Figure 1.By optimizing specific parameters (⃗ φ), Figure 1: Variational optimization process on a d-level system.Classical devices are used to perform the optimization task and to control the operations to implement on the physical system (the qudit).After coherently controlling the system, we perform some measurements that allows us to extract the information need to compute the objective function and iterate the optimization process.
our objective is to minimize a predetermined objective function that depends on the output o i , often referred to as the cost/loss function, which quantifies the level of misclassification in our model [33,34,36].

Qudit unitaries
This formulation is quite general, and specific details regarding the unitary operation need to be considered, especially in relation to hardware constraints.In this context, we consider a d-level system with a fixed (time-independent) Hamiltonian H 0 , which has eigenvalues |l⟩, where l = 0, ..., d − 1.Furthermore, we assume that a series of two-level rotations can be applied.These rotations serve as the fundamental operations achievable using conventional Electronic Paramagnetic Resonance (EPR) techniques.Importantly, these operations are universal for a single qudit, enabling the generation of any wave function within the d-dimensional Hilbert space.In the interaction picture, utilizing a simple monochromatic microwave pulse to coherently control the system, these rotations are given by [20,43] Here, where σk,l denotes the Pauli matrices derived from the eigenvalues |k⟩ and |l⟩, i.e. σ x k,l = |k⟩⟨l| + h.c. and σ y k,l = i|k⟩⟨l| + h.c. .Each rotation between levels k and l in our system is characterized by two parameters to be tuned: θ and ϕ.Consequently, a layer of rotations involving all accessible levels within our qudit will encompass 2(d − 1) parameters, assuming ladder-like coupling between adjacent levels.It is important to note that these two parameters do not carry the same significance in the resulting state they produce.One parameter, θ, governs the population transfer between the two states, while the other parameter, ϕ, controls the relative phase they acquire.This distinction becomes relevant when encoding the data, as we will explore later.The complete ansatz entails applying L layers of these rotations, where the rotation angles store both the classical data, ⃗ x, and the parameters to be optimized, ⃗ φ.Therefore, the most general qudit unitaries that we will consider through this work are, We emphasize that the rotation angles θ i and ϕ i are general functions of both ⃗ x and ⃗ φ.This is discussed in the next subsection.

Metric learning
In our study, we focus on metric learning, specifically its quantum version [41,42].Metric learning is well-suited for classifiers, particularly in image classification and recognition.This type of learning aims to map the data points onto a feature space equipped with a metric such the distance between points belonging to the same class are minimized while ones belonging to different classes are maximally separated.
In the quantum realm, the unitary Û (⃗ x; ⃗ φ) in Eq. (1) maximizes the distances between states belonging to different classes while minimizing them for states within the same class.In this work, we will consider that the data corresponds to K classes.The algorithm aims to learn how to classify the input vector into one of these classes.The literature [42] discusses two main approaches: implicit and explicit algorithms.

Implicit approach
In the implicit approach, the training data defines K ensembles, one for each class.Using the encoded input vectors as given by Eq. (1), K ensembles can be defined as follows: Here, N k represents the number of training points belonging to class k.
Besides, the cost function to be minimized reads [41,42] Thus, this approach aims to maximize the purity of the ensemble of states belonging to the same class while moving them away from ensembles of other classes by minimizing the trace of the crossed ensembles.

Explicit approach
In the explicit approach, the algorithm utilizes predefined reference states (referred to as centers) one for each class, denoted as {|ψ R k ⟩} K k=1 .Introducing the density matrix, the cost function in this case is given by Here, the algorithm aims to minimize the distance between the training data and the reference states.It is worth noting that the last term in the above equation is nothing but the fidelity, The performance of the algorithm depends on the selection of the centers.Ideally, they should be as separate as possible.In terms of quantum states, orthogonal states are maximally separated.Therefore, it is natural to choose However, it is possible that d < K.In such cases, we argue below that maximally orthogonal states (MOS) can be employed.

Geometric interpretation
In the next section, we are going to test and compare both approaches in actual data sets.Before that, we find it convenient to provide a geometric way of relating them.The following theorem connects both approaches: Theorem 1.Consider two reference states that, without loss of generality, are denoted as σ k=1,2 in Eq. (7).Then, minimizing the explicit approach cost function as defined in (8) for those classes maximizes the trace distance between the ensembles ρ k=1,2 used in the implicit approach [given in Eq. (5)] if and only if the trace distance between the reference states, Proof.As discussed in section 2.2.2, we can express our function to be minimized as ϵ-close in trace distance [44].On the other hand, if we want to maximise ||ρ 1 − ρ 2 ||, we can relate this distance to the above through the following relation: Then, if we maximize F(ρ k , σ k ) and F(ρ k ′ , σ k ′ ), so each ||ρ − σ|| ≤ 2 √ ϵ, we arrive to: Therefore, in order to maximize ||ρ k − ρ k ′ ||, we have to maximize the distance between centers This theorem justifies to choose as centers states that maximize their mutual distance.These form a set of maximally orthogonal states (MOS).In fact, With this definition we can use numerical methods to obtain MOS for any configuration (d, K) of qudit dimension, d, and number of states, K.We leave the details on this for appendix A and [45].

Encoding strategies
The main difference between our approach and others is that we do not apply independent ansätze for the enconding [46].In the algorithms discussed here, the model learns a metric, so the encoding itself already maximizes the distances between different classes.Therefore, only a single ansatz is considered, where the classical data and the parameters to be optimised are tied together [37,41,42].Therefore, taking into account the unitary transformation in Eq. (4), we need to discuss the chosen function to map the input vectors and variational parameters to rotation angles, i.e g : (⃗ x, ⃗ φ) → ( ⃗ θ, ⃗ ϕ) (11) This, in turn, will give rise to the embbeding through our ansatz, Eq. (4).
In this work, we choose a function g similar to those used in classical neural networks.Specifically, our optimization parameters, ⃗ φ, are divided into a weight matrix, ω, and a bias vector, ⃗ b, such that g : This transformed vector, denoted as ⃗ x ′ , is then used to determine the rotation angles ⃗ θ and ⃗ ϕ.Several comments are relevant here.First, the dimension of ⃗ x ′ , denoted as D x ′ , does not have to be equal to the dimension of ⃗ x, i.e., D x ′ ̸ = D x .Even though there are 2(d − 1) parameters in a layer of our ansatz (as shown in Eq. ( 4)), we can handle a higher dimension by employing sublayers [37].We simply split the vector ⃗ x ′ into tuples of dimension 2(d − 1).It is important to differentiate between using sublayers to accommodate larger data dimensions and using layers in the ansatz to increase the number of optimized parameters.The former allows us to handle larger data dimensions, while the latter adds non-linearity to the model.Second, the assignment of rotation angles ( ⃗ θ, ⃗ ϕ) from the transformed data ⃗ x ′ is arbitrary and will impact the model's performance, as discussed below.
To assess the effectiveness of different encodings, we can study the decision boundaries drawn by our ansatz with a concrete example [46].We take a three-level system (a qutrit) with d = 3 for classifying data into three classes (K = 3).The reference states are the lowest-energy states from the orthonormal basis of the qutrit: |k⟩ for class k with k = 0, 1, 2. Suppose the data dimension is D x = 4.We start with the following transformation: In other words, the matrix ω is chosen to be diagonal, and the angles are assigned as Since it is a qutrit, we have two available transitions and, thus, four rotation angles in our ansatz.Applying this transformation to the ground state |0⟩ yields the expression in Eq. (13): where the coefficients c k are as follows: c 0 = cos θ 0 2 , c 1 = −i sin θ 0 2 cos θ 1 2 e iϕ 0 , and c 2 = − sin θ 0 2 sin θ 1 2 e i(ϕ 0 +ϕ 1 ) .From this expression, we can observe that not all the parameters play a role in the decision boundaries.For example, when computing the overlaps with the reference states |k⟩ as in Eq. (8), the relative phases introduced by the angles ϕ do not have any impact.Furthermore, since each angle in this case depends only on a single component of ⃗ x, our model is trained with only half of the information provided by each data point.
We can address this issue in various ways.The simplest approach is to initialize the qutrit in the state |+⟩ = 1/ √ k k |k⟩, so the coefficients c k in Eq. (13) become a more intricate combination of rotation angles, and the projection onto the reference states no longer cancels out the relative phases.Alternatively, we could change the basis in which the reference states are tailored.For instance, instead of choosing the eigenstates of σ z , {|0⟩, |1⟩}, for the orthonormal qubit case, we could use the eigenstates of σ x , {|+⟩, |−⟩}.In practice, the outcome would be the same.
Another option is to apply a different transformation g such that g : R Dx → R 2(d−1) .We can consider ω as a general (D x ′ , D x )-matrix with D x ′ = 2(d − 1), ensuring that the output always fits into one sublayer.However, each x ′ i is now a function of all the components of ⃗ x: Again, if the initial state were to be the ground state, |0⟩, we would still face the issue of utilizing only half of the parameters when assigning the angles as before, i.e., ).However, initializing in the |+⟩ state should address this problem.In Table 1, we summarize the three main choices for the function g that will be used in section 3.2 in our simulations.

Encodings
Table 1: Different encodings that will be used in Section 3.2.
It is important to note that when multiple layers are involved in the ansatz, either to accommodate data with higher dimensions or to introduce non-linearity through more optimizing parameters, or when the reference states are nonorthogonal, the resulting wave function will generally exhibit sufficient complexity for all parameters to influence the decision boundaries.
Having seen different ways of choosing the function that maps our classical data into the control parameters of our physical system, we can also consider other ways of defining the control itself.For example, an interesting scenario arises when the rotations defined in Eq. (2) are not defined on two levels of the orthonormal basis of the qudit, but rather on a series of maximally orthogonal states (MOS).This becomes relevant when the number of classes exceeds the available levels.Let {|φ k ⟩} k=K k=1 denote the set of K maximally orthogonal states obtained using our genetic algorithm [Cf.App.A].We redefine our ansatz as the set of pulses between these "virtual" states.Consequently, the Pauli matrices in the G(ϕ) term of Eq. (2) will now take the form With this definition, we gain access to K −1 operations that can be decomposed into a series of pulses between the self-states of the physical system, to be implemented on a real platform [20,43].By operating between these maximally orthogonal states, it is reasonable to expect that the generated states belonging to different classes are brought closer to regions that are far from each other, potentially "saving work" for the optimizer.
Lastly, we will consider a scenario where D x ≫ d, making it impractical to apply successive sublayers to encode the original dimension.In such cases, we will explore various classical dimensionality reduction techniques, such as Principal Component Analysis (PCA) or Convolutional Neural Networks (CNNs), to assess their effectiveness in combination with our model.

Results
After establishing the theoretical foundation of learning with qudits, we proceed to test our algorithms on real-world datasets.We have selected three distinct datasets for this purpose: the Iris dataset [47,48], the Breast Cancer Wisconsin (Diagnostic) dataset [47], and the MNIST database of handwritten digits [49], as described in section 3.1.Each dataset possesses unique characteristics that enable us to explore the efficacy of our tools in diverse scenarios.
In section 3.2, we examine various encoding strategies (ansatz) to determine the approach that yields the best results.Subsequently, in section 3.3, we employ the selected encoding strategy to compare the performance of different cost functions (implicit or explicit method) in various classification problems.In section 3.4, we address the specific case of datasets with dimensions significantly larger than that of the qudit, exemplified by the MNIST digits dataset.Here, we study a hybrid algorithm that combines classical dimension reduction techniques with a quantum processing unit (our qudit).Finally, in section 3.5, we discuss the different resources needed to experimentally implement our classifier.

Datasets
■ Iris [47,48] Fisher's database of different species of the Iris plant is one of the most (if not the most) used dataset for classification problems.It contains 3 different species (classes) and for each one, 50 data points.Each data point is composed by 4 features which are the length and width of both the sepal and petal of each plant.It is a perfect dataset for an initial benchmark due to the small size of the dataset, the low dimensionaty of the data and the small number of classes.Here, we used a training set of N training = 30 data points, 10 for each class, and the rest is left for the test set, N test = 120.
■ Breast Cancer Wisconsin (Diagnostic) [47] In this dataset, there are 569 instances of 10 features belonging to cell nucleus from breast masses.It only has two classes: malignant or benign.This is an example of a dataset that comprises a large enough number of features and instances (and still manageable by a single qudit) and defines a binary classification task.We used N training = 113 instances for the training set (around the 20% of the total) and the rest, N test = 456, is used for testing.
■ MNIST Digits [49] This last dataset is also a very well known example of image classification.It comprises 60,000 training images and 10,000 testing images of handwritten digits.It has 10 classes, one per each digit (from 0 to 9).The dataset was provided by the Python package Pytorch and each gray-scale image is a 28 × 28 matrix.It sets the last example as the final test for a task that has a reasonable large number of classes (for a single qudit) and a huge data dimension, which is unfeasible to classify by brute force.Using the model described in section 3.4 we used 300 images per digit for training and 600 for the testing.Here, since the problem involves a more sophisticated technique, we further split the training set into N training = 240 and N validation = 60 as discussed in Section 2. The validation set will be used along the training procedure to compute both the loss function and the accuracy to ensure that no overfitting is taking place.

Testing encoding strategies
We commence by analyzing the various encoding techniques discussed in Section 2.3 (refer to Ta-ble 1) to determine whether any particular technique outperforms the others.To gain insights, we leverage the Iris dataset.The comparisons are conducted using the explicit method, where the centers are predetermined as described in Section 2.2.2.The chosen performance metric is the test accuracy, which is defined as follows: Test Accuracy = Number of Correct Predictions N test (14) Figure 2 illustrates the investigation conducted using various ansatz designs.The upper plot presents a comparison of the test accuracy's impact when employing non-orthogonal states for generating superpositions, as opposed to utilizing states from the system's orthonormal basis.To achieve this, we consider a qubit with the encoding function g 1 from Table 1 and a single optimization layer.We select three nonorthogonal states, {|φ k ⟩} k=2 k=0 , to accommodate the Iris dataset's dimensionality, D x = 4, within a single sublayer.To evaluate the significance of these states, we start with a non maximally orthogonal set and gradually increase their orthogonality until obtaining the solution provided by MOS.In the top plot of Figure 2, these points correspond to the circles.As anticipated, greater orthogonality among these three states leads to better coverage of the Hilbert space with our pulses, resulting in increased expressibility and test accuracy.
We compare this kind of ansatz with pulsing over the original set of orthonormal states and adding sublayers to accomodate the complete data dimensionality.This result is marked with a black square in the figure and proves to be more efficient for this particular task.
Furthermore, we compare these findings with the performance of a qutrit (d = 3) under identical circumstances.The qutrit serves as an ideal unit of information for this problem.It has two transitions, enabling the embedding of complete data dimensionality within our ansatz, without necessitating sublayers.Surprisingly, the performance of the qutrit falls short compared to the performance achieved using a qubit with sublayers.As we will delve into, this is due to the choice of the encoding function, denoted as g 1 , which, as demonstrated in Section 2.3, fails to fully exploit the information provided by each data point.
Having determined that the ansatz involving For the red rounded points, a non-orthogonal set of states is used as basis.The black square point represents the results when the original orthonormal basis is used and the last salmon square point marks the result for a qutrit (also with the orthonormal basis).Bottom: Comparison between different embedding functions of the data (how the original data is transformed into the parameters of the pulses).From darker to lighter colors: g 1 , g 2 and g 3 from Table 1.
the states of the orthonormal basis of the system (the original definition in Eq. (2)) is the most efficient in terms of the type of operations between levels, we now investigate, using this ansatz, which encoding function yields the most beneficial impact on the training process.To accomplish this, we compare the three embeddings, g i with i = 1, 2, 3, presented in Table 1, for different qudit dimensions d.Once again, we utilize the Iris dataset, employing only one optimization layer and the explicit method.
The lower plot presents violin plots illustrating the distribution of test accuracies acquired from 100 distinct and independent initial conditions in the training process.For the darkest color (g 1 ), it is evident that there is negligible statistical vari-ance when d > 2. With the sole exception of a single point at d = 4, the test accuracy consistently remains at 95.83%.As detailed in Section 2.3, this encoding strategy is suboptimal because it overlooks half of the original data, ⃗ x, treating it as relative phases in the generated states.These phases have no influence when projecting onto the reference states of the orthonormal basis.Although it appears to yield good results for this specific dataset, it may not do so in other scenarios where unutilized features play a crucial role.Hence, it is more appropriate to consider the other two alternatives, namely, g 2 and g 3 (represented by medium and light colors in the plot, respectively).It is noteworthy that, despite employing a larger number of parameters in the optimization process, the median performance of g 3 falls short of that achieved with g 2 .Consequently, based on these findings, we deduce that the most suitable option among these alternatives is to encode our data using g 2 : x ′ i → ωii •x i +b i , initializing our qudit in the |+⟩ state.Consequently, we adopt this configuration for all subsequent simulations.

Quantum metric learning
We now proceed to compare the performance of both the explicit and implicit methods discussed in section 2.2.As concluded at the end of section 3.2, the encoding used for the subsequent simulations with the explicit method is g 2 from Table 1.However, for the implicit method, we adhere to using g 1 .In other words, we initialize the state to |0⟩.The rationale behind this decision is that, since there are no fixed centers in this case, there is no issue with unused relative phases.Thus, for the sake of simplicity, we find it more convenient to initialize the system in the ground state.It is worth noting that simulations were also conducted by initializing the system in the |+⟩ state, yielding similar results.
Figure 3 illustrates the results obtained for the Breast Cancer dataset (top plot) and the Iris dataset (bottom plot).The darker-colored violin plot represents the outcomes obtained with the implicit method, while the lighter color represents the explicit method.Each distribution in the plot is derived from 100 different and independent initial conditions, and a single optimization layer is employed.It is evident that, although the implicit method generally appears to achieve  better results, both methods demonstrate a good performance.In fact, these results are comparable to those achievable with the best classical classifiers for these datasets [47].A comparison between classical models and the findings of this study is presented in Figure 4.
It is worth emphasizing that the dimensionality of the qudit does not appear to be a critical factor, except in the case of the Iris dataset where d > 2 seems to yield slightly better results compared to d = 2.This suggests that in datasets where the number of classes does not allow for exploiting the capacity of the MOS (K ≈ d), the size of the Hilbert space does not play a significant role.
Furthermore, we can understand that the implicit method is capable of finding slightly superior solutions compared to the explicit method because it has the flexibility to discover the centers that best fit the data structure.This characteristic becomes evident when we evaluate the clustering achieved by the implicit method through the minimization of Eq. (6) [50].sults obtained by both methods using the test set of the Iris dataset, considering a single qubit and a single layer.The reference states are indicated by arrows, and the color plot represents the overlap between them.While the states obtained using the explicit method are perfectly homogeneous, we observe that the states generated by the implicit method, although very similar, are not completely homogeneous.This, along with the difference in the definition of each cost function, is reflected in the distribution of points on the Bloch sphere.In the explicit case, the points are almost perfectly aligned on the circumference connecting the three reference states, whereas in the implicit case, there is a greater dispersion on the surface.On the right side we plot the overlap between these reference states.Although not perfectly, the implicit method manages to find states that are practically homogeneous and maximally orthogonal in this case.

High-dimensional data
As a final case study, we selected the MNIST dataset to demonstrate the challenge of dealing with high-dimensional data compared to the dimension of the qudit.Applying the data reuploading technique used in previous cases is impracticable due to the dataset's large dimensionality, D x ≫ d.
To address this issue, we need to employ a dimensionality reduction technique that is compatible with the data re-uploading process of our qudit.One commonly used technique is Principal Component Analysis (PCA) [51].PCA transforms the original data into a new set of uncorrelated variables with lower dimensionality.These new variables, known as components, capture the directions with the highest variance in the data.The first few components are more relevant than the latter ones, aiming to compress the relevant information from the original dataset.A challenge with applying PCA in this case is that the images in the dataset need to be "flattened" from a two-dimensional format into a onedimensional vector.This flattening process elim-inates the spatial correlations that are crucial for distinguishing between different digits (e.g., distinguishing a 6 from an 8).In Appendix B, we present a study demonstrating that even with PCA, it is necessary to retain a dimension comparable to that of the original problem to avoid losing valuable information for classification purposes.
Given this challenge, we turn our attention to another technique that has shown remarkable effectiveness in image classification and processing: convolutional neural networks (CNNs) [52].CNNs consist of two-dimensional processing layers designed to capture spatial correlations present in images.These initial layers extract information from arbitrary matrix blocks rather than individual pixels.This information is then processed by linear layers that assign probabilities to each class.In classical machine learning, a final layer with 10 neurons provides the probabilities for assigning each digit to the image.Our proposed model follows the philosophy of hybrid variational algorithms, combining the optimization capacity of classical computers with the state synthesis capacity of quantum computers.Thus, we design a hybrid convolutional network (HCNN), illustrated in Fig. 6, where the first part consists of classical convolutional and linear layers that preprocess the data and reduce its dimensionality.The processed data is then fed into the quantum layer, which acts as our qudit classifier.
For the design of this hybrid network, we utilized the Python library specialized in neural networks, PyTorch [53].Our quantum layer is a customized layer integrated into the global optimization process, and the output dimension of the last classical layer is set to 2(d−1).Thus, the CNN compresses the original data to variables of dimension 2(d − 1), accommodating the qudit dimension, namely g : R Dx → R 2(d−1) .We remark that the last layer in the CNN implements the transformation g 3 in Table 1.We strive to keep the structure of the CNN as simple as possible to avoid the classical part overshadowing the quantum part in the classification process.Consequently, we choose a CNN with two convolutional layers.The first layer has one input channel (as the images are grayscale) and 10 output channels, while the second layer has 10 input channels and 20 output channels, with a kernel size of (5,5).Networks to act as a dimension reducer technique that preserves the spatial correlations present in an image.We exemplify the d-level system with its possible implementation in a molecular qudit [20].

Readout
Dropout is applied to mitigate overfitting, and the network concludes with two linear layers: the first with an input dimension of 320 and an output dimension of 50, and the final layer feeds into the quantum layer.
Figure 7 presents the results of our study.The top plot explores the behavior of the hybrid network when the number of digits to be classified aligns with the number of levels of the qudit, denoted as d, as a function of the number of layers, L, in the ansatz.The reference states correspond to the orthonormal basis states of the system.
The lower plot illustrates the performance of our classifier when confronted with the complete dataset of 10 digits, as a function of both the qudit dimension and the number of layers in the ansatz.In all scenarios, the hybrid CNN optimizes both its classical and quantum layers simultaneously.This allows us to assess the impact of the quantum layer on the overall classification process.For all these simulations, the explicit method with fixed centers was employed.
The black line, in all scenarios of Fig. 7, represents the accuracy obtained using the exact same classical preprocessing CNN utilized in conjunction with the quantum layer in the hybrid model.However, in this case, the quantum layer is replaced by a linear classical layer with N outputs, where N is the number of digits being classified in each scenario.The loss function employed is the Cross Entropy function available in the PyTorch package.Additionally, the shaded grey region covering the black line indicates the dispersion in results obtained from 10 different instances of the algorithm, each with different initial conditions for the parameters.The shaded area represents the maximum and minimum validation accuracy values obtained during the training process for the validation set, while the solid line represents the mean value.Unfortunately, replicating these statistics for the hybrid model was not feasible due to the computationally intensive training process it requires.
The study reveals that when N = d, the hybrid model achieves 100% accuracy for 2 and 3 digits, surpassing the mean accuracy of the classical network for a sufficient number of layers in all cases.However, when considering the full dataset, the quantum layer appears to act as a bottleneck, resulting in a lower accuracy compared to the classical counterpart.Nevertheless, as more nonlinearity (i.e.layers) is incorporated into the quantum ansatz, the hybrid model progressively approaches the performance of the classical results.In Figure 8 we further visualise how adding layers (non-linearity) favours the separation between clusters of points on the Bloch sphere for the case of a qubit.
To conclude, table 2 shows the results of the test accuracy obtained for N test = 600 images/digit.We see that although in Figure 7 it seemed that the qudit was achieving better results than the classical network alone, in the test data set fails to surpass it for the number of levels and layers that we were able to explore.

Computational complexity
Throughout different examples we have been discussing the differences in terms of performance between the explicit and implicit methods introduced in section 2.2.Now, we want to discuss about the challenge that each method imposes in a possible experimental implementation.The implicit method involves a higher training load.The absence of fixed centres means a larger number of terms to evaluate, essentially the cross terms of the type Tr(ρ i ρ j ).These can be evaluated with different techniques like the SWAP test [54] or the inversion test [41,42].In any case, this is an overhead in terms of processing compared to the explicit case.
Another advantage of the explicit approach is that no full state tomography is needed in order to evaluate the cost function.Given that our ansatz is universal, i.e. that it is able to generate any wavefunction [20,43], and that we know the exact coefficients of each reference state, we can relate both to design a unitary operation that implements the desired reference state.Let's take a general reference state with c j , β j ∈ R and β 0 = 0.Then, the unitary operation that transforms the ground state with ϕ k = −β k+1 −(k +1)π/2− l<k ϕ l and θ k = 2 arccos (c k /P k ), being P k a factor which is equal to 1 for k = 0 and P k = l<k sin (θ l /2) otherwise.To compute the cost function as defined in Eq.( 8), we could just apply U † R after the ansatz operation, U (⃗ x; ⃗ φ), and just by measuring the population of the ground state we will get the fidelity between the reference and the produced state: This can be related to Eq.( 8) through Eq.( 5) by noting that tr ρ k ⟩| , so we can reformulate L E as in Eq. (18).
Just by measuring the population of the state |0⟩ (for which N shots will be needed depending on the experimental platform and its read-out noise) for each data point, we can obtain the value of the loss function in N shots • N measurements.Therefore, in the training phase, we will need to run N shots • N training • N eval times the circuit, where N eval is the number of evaluations that the circuit needs to be run in turn by the classical optimizer to achieve convergence.Typically, N eval = N epochs • N ge where N epochs stands for the number of epochs that the optimizer searches for the solution and N ge is the number of evaluations per epoch that the classical algorithm needs to estimate the gradient.Finally, in the test phase, we will need to run the circuit up to N shots • N test • K times, as each test point needs to be compared with each of the reference states in order to assign it the most probable class.

Dissipation and decoherence
Finally, we investigate how our model behaves in the presence of both decay and decoherence.
To achieve this, we formulate a Lindblad-like master equation that realistically models a quantum system.The dynamics is given by [55,56], The first term on the right-hand side represents the unitary part and it reads, Without noise, this term induces the unitary evolution described by Eq.( 2).Physically, this Hamiltonian arises from a driven system H = j ϵ j |j⟩⟨j| + Ω R cos(ω d t + ϕ) k,l (|k⟩⟨l| + h.c.)δ k,l+1 .Here, ϵ j represents the energy of the eigenstate |j⟩, Ω R is the strength of the perturbation (Rabi frequency), and the δ k,l+1 term signifies coupling between neighboring levels in the qudit.Moving to the interaction picture and employing the Rotating Wave Approximation (RWA) when the driving frequency ω d = ω j,j+1 ≡ ϵ j+1 − ϵ j , we arrive at the Hamiltonian (20).The driving frequency, ω d , determines the addressed transition.The duration of the pulse, t p , corresponds to the rotation angle θ via the relation Ω R t p = θ, and the perturbation phase, ϕ, perfectly matches the phase in the G(ϕ) term (see Eq. (3)).
The second term on the right-hand side of the master equation (19) accounts for dissipation and decoherence effects.Specifically, the dissipation operators D T i are defined as   16) Solve ρi (t) ▷ Eq.( 19) Compute and store F(ρ, |0⟩) ▷ Eq. ( 17) end for Assign class to the point according to max{F(ρ, |ψ R k ⟩)} end for Each operation of our ansatz corresponds to the dynamics governed by Eq. (19).The parameters to be optimized are encoded in the time length of the evolution, t p , as well as in the driving phase, ϕ.These, in turn, are functions of the original data points, ⃗ x, which are re-scaled according to the functions described in Section 2.3.Once the state of the system has been generated, ρ(⃗ x; ⃗ φ), we perform the measurement scheme described in Section 3.5 so that, in the end, we compute the fidelity between the final state ρ(t) and the ground state, F(ρ(t), |0⟩).That is, for each point belonging to N training , once the sequence of pulses corresponding to the ansatz, U (⃗ x, ⃗ φ), has been performed, we apply the sequence corresponding to the unitary that implements the reference state assigned to that point, U † R .By doing so, the result from measuring the population of the |0⟩ gives us the value for computing LE in Eq. (18).In the training phase we do this only once per data point, since there is only one reference state to compute the fidelity with.For the testing phase, we measure the overlap between the generated state and all the reference states instead, to be able to decide with which one has the most overlap.Both the fidelity and the preparation of the initial state are assumed to be ideal.That is, the fidelity is computed numerically (we do not perform a stochastic measurement procedure) and the initial state is defined to be a pure state, ρ(t = t 0 ) = |+⟩⟨+|.An scheme of this procedure is sketched in the pseudo-code of Algorithm 1.
To evaluate the performance of the aforementioned model, we employ the Iris dataset.Figure 9 depicts the results for both a qubit and a qutrit as a function of the system's decoherence time, denoted as T 2 , which constitutes the main limitation in practical implementations.We choose parameter values in accordance with typical experimental settings [17,18,[57][58][59]: ω ij /2π ∼ 3 GHz, Ω R /2π = 10 MHz, T 1 = 100 ms, and T 2 ∈ [100 ns, 100 µs].Furthermore, we explore the impact of adding multiple layers to the model and compare the medians of the resulting test accuracies.For classical optimization, we employ the SPSA optimizer [60], which utilizes only two circuit evaluations per iteration (N ge = 2), regardless of the dimensionality of the parameter space being optimized.The maximum number of iterations is set at 30 (N epochs = 30, yielding N eval = 60).We conduct 50 distinct "experiments" (runs) to gather statistics.The initial conditions for each "experiment" are derived from the best parameters obtained in the preceding run if there was an enhancement in test accuracy compared to its predecessor; otherwise, they are randomly reinitialized.The analysis reveals that by extending the decoherence time, the results quickly converge to those attained in ideal simulations.Notably, with T 2 values of 400 ns for the qubit and 1.5 µs for the qutrit, we achieve maximum accuracies of 0.975 and 0.941, respectively.Additional numerical values are tabulated in Table 3.Hence, the performance of the qudit remains robust even when subjected to realistic noise levels.

System Layers Max
We can understand the reason behind the failure of the model when the noise is increased (Ω R • T 2 ≈ 1).In this scenario, the noise terms of the Quantum Master Equation, the second term in the right hand side of (19), become predominant.These dissipators drive ϱ towards its steady state, which, in our case, is |0⟩.As a result, the test accuracy remains around 1/3. Moving to longer T 2 values, the unitary driven part becomes capable of reaching the other reference states (prior to their relaxation), thereby enhancing the accuracy.Adding more layers helps to increase slightly the median value except for some points.Bottom: Qutrit results.Here adding more layers does not provide any benefit in general (obstructs optimizer convergence) and we can see that the median values are higher than in the qubit case.

Conclusions
Challenging the performance of quantum processors in machine learning is an intriguing and timely pursuit.Simple systems not only lend themselves to easier experimental implementation but also facilitate a more comprehensive theoretical understanding of their performance.It is with this motivation that we embarked on our investigation in this paper.In particular, we focus on a quantum system with d levels, in which transitions between these levels can be triggered through external fields.By optimizing these transitions, we demonstrate that qudits can effectively learn to classify realistic datasets.
We explore both implicit and explicit metric learning paradigms, along with various encoding strategies.Our comprehensive study leads us to the following conclusions: • Various problems can derive advantages from distinct encoding strategies.This observation aligns with our discussion in section 3.2.
In Table 1, we presented a range of functions, and it became evident that the explicit method benefits from the g 2 encoding, whereas the implicit method finds better compatibility with the g 1 encoding.Moreover, when considering the hybrid model introduced in section 3.4, it becomes apparent that the g 3 encoding aligns more effectively with the overall structure of the model.
• Within the metric learning framework, we have explored both explicit and implicit methods in depth, substantiating our exploration with geometric rationale that interlinks the two approaches.Notably, both methodologies exhibit remarkable performance, even rivaling some of the most adept classical algorithms.It is worth noting, however, that of the two, the explicit method emerges as the one most amenable to immediate experimental implementations.
• These statements find validation through simulations that incorporate prevalent noise sources found in today's physical devices.
Being able to generate any set of MOS thanks to a genetic algorithm developed for this purpose [45], we outline in section 4 a step-by-step algorithm that has demonstrated robust adaptability, and we are con-fident in its suitability for diverse experimental implementations that align with the discussed requisites.
• To conclude, it's worth highlighting that the studies conducted encompass real-world datasets, underscoring the adaptability of the tools discussed here.When confronted with datasets of substantial size like the MNIST, surpassing the capacity of the physical unit, we devised a hybrid algorithm that yields satisfactory performance.Nevertheless, it remains evident that physical units with limited levels such as the ones that we have access to simulate serve as bottlenecks, unable to surpass the efficacy of standard classical methods.winian natural selection.It involves an evolving population of potential solutions to an optimization problem over successive generations.During each generation, the most adept individuals within the population are chosen.Through processes of mating, mutation, and survival, they give rise to the subsequent generation.This cycle, sketched in Fig. 10 (a), is repeated until a point of convergence is achieved.The mating process aims to probabilistically amalgamate favorable traits from one individual with those of another, resulting in an overall improved individual.Mutation, on the other hand, introduces randomness, permitting an exploratory journey through the solution space.Survival introduces a deterministic aspect to the algorithm.By allowing the fittest individuals to persist, a consistent progression towards enhanced solutions is ensured, without being hindered by random setbacks.The challenge of generating maximally orthogonal states constitutes a suitable application of a genetic algorithm.This is due to the straightforward encoding of sets of states as individuals, after which the mating process can be implemented as a recombination of states from the two parent sets.In our particular setting, for a given qudit dimension d, a population, P d,K = {Φ p d,K } N p=1 , will be a collection of sets of K potentially maximally orthogonal states, {|ψ k ⟩} K k=1 , the individuals, with N the size of the population.We seek to maximise the fitness, which is defined in our case as the negative of the energy in Definition 2.1.Mating is implemented as a random recombination of the states of two individuals in such a way that the geometric structure is preserved in the process.In addition to the usual mechanisms of a genetic algorithm, we also apply local optimization to the parents of each generation.
Throughout the main text this algorithm has been used to define the MOS in the case of the MNIST digits dataset, for example.Specifically, in Fig. 10 (b) we show the difference in the results produced by the algorithm between 4 MOS in a qubit and a qutrit.These, in the case of the qubit, are further represented on the Bloch sphere in Fig. 8 of the main text.

B PCA performance for image classification
Here we deal with the hand-written digits dataset from the Python package sklearn.In Fig. 11 we compare the performance of the classifier in both training and testing.We also compare the performance when we only try to classify d digits, being d the dimension of the qudit, and when we try to do so for the full dataset.In the first scenario, the best performance is achieved for the qutrit, as it has to deal with a smaller parameter-space dimension.However, for the d = 5 qudit we see that the fidelities are higher than in the 10 digits case (panel (b)).In the latter, we can see that there is no big difference between the two qudit sizes.The compression of the original data with the PCA technique was different for each size.The final dimension of the data worked with in the algorithm is 2(d − 1), i.e. 4 in the qutrit and 8 in the d = 5 qudit.These low dimensions compared to the original may be the cause of the poor performance of our classifier.That's why in Table 4 we study the dependence of such performance with the PCA dimension.Our test subject is a qutrit trying to classify 5 digits with just one layer.

Figure 2 :
Figure 2: Comparison between ansätze in the explicit framework.Top: Comparison between different ansatz structures (design of the basic pulses).For the red rounded points, a non-orthogonal set of states is used as basis.The black square point represents the results when the original orthonormal basis is used and the last salmon square point marks the result for a qutrit (also with the orthonormal basis).Bottom: Comparison between different embedding functions of the data (how the original data is transformed into the parameters of the pulses).From darker to lighter colors: g 1 , g 2 and g 3 from Table1.

Figure 3 :
Figure 3: Comparison between frameworks for different datasets.Top: Results for the Breast cancer Wisconsin data set.Bottom: Results for the Iris data set.Darker colors represent the results for the implicit method and lighter colors represent the explicit method.
Figure 5 provides a comparison of the mapping re-

Figure 4 :
Figure4: Performance comparison between the best classical models and our best results.The results for the classical models were extracted from[47] and the bars for the single qudit results cover the distribution of medians and maximum test accuracies obtained for all qudit dimensions in Figure3.

Figure 5 :
Figure 5: Different mappings produced by each method in the Iris data set.On the left side of the figure we represent each wavefunction produced by the circuit after training.The color of each point represents the class at which it belongs.Same for the arrows (reference states).On the right side we plot the overlap between these reference states.Although not perfectly, the implicit method manages to find states that are practically homogeneous and maximally orthogonal in this case.

Figure 6 :
Figure 6: Hybrid Convolutional Neural Network (HCNN).In this model, we propose the use of Convolutional NeuralNetworks to act as a dimension reducer technique that preserves the spatial correlations present in an image.We exemplify the d-level system with its possible implementation in a molecular qudit[20].

Figure 7 :
Figure7: Results obtained for the HCNN model.Top: Results for the partial uses of the data set.We use the same number of digits as levels available in the qudit.Bottom: Results for the full data set (10 digits).Line colors the number of layers used (top image) and the number of levels in the qudit (bottom image).In both cases we plot the validation accuracy as it is the one of which we keep track along all the epochs.

Figure 8 :
Figure 8: Mapping for the MNIST data set with 2 digits (plots (a) and (b)) and with 4 digits (plots (c) and (d)) using the explicit method.Left side plots are the mappings obtained for L = 1 and right side plots the ones obtained for L = 5.

Figure 9 :
Figure9: Median of the distribution of test accuracies obtained with the noisy model for different qudit dimensions and number of layers.Top: Qubit results.Adding more layers helps to increase slightly the median value except for some points.Bottom: Qutrit results.Here adding more layers does not provide any benefit in general (obstructs optimizer convergence) and we can see that the median values are higher than in the qubit case.

Figure 10 :
Figure 10: Genetic algorithm to determine maximally orthogonal states whatever the dimension of the qudit might be.(a) Schematic of a generation of the algorithm.In this representation we show the probability of a crossover occurring as ρ c and the probability of a mutation occurring as ρ m .(b) Example of the determination of 4 maximally orthogonal states in two systems.The colour indicates the intensity of the overlap between each pair of states.As expected, in the qutrit (d = 3), having a higher dimension than the qubit, it is easier to accommodate the same number of states while maintaining a smaller overlap.

Figure 11 :
Figure 11: Fidelity obtained for both testing and training with a qutrit and with a d = 5 qudit for different number of layers.In (a) we check the performance for the same number of digits as levels in the qudit and in (b) for the full 10 digits dataset.SCF and WCF stand for Simple Cost Function and Weighted Cost Function, respectively.They refer to the costs functions used in Ref.[37].The Simple one is analog to Eq. (8) in the main text.The Weighted one was considered for the sake of completeness.The size of the marker indicates the number of layers employed, from 1 (smallest) to 5 (biggest).

Table 2 :
Test accuracy obtained from the best models.Each best model is defined as the one with the highest validation accuracy throughout the training.

Table 3 :
Test accuracies for the Iris dataset using the noisy model.The maximum test accuracy values shown correspond to the minimum T 2 value for each respective row.

Table 4 :
Fidelities obtained with a one layer qutrit ansatz classifying 5 digits as a function of the PCA's dimension.