A hybrid quantum–classical neural network for learning transferable visual representation

State-of-the-art quantum machine learning (QML) algorithms fail to offer practical advantages over their notoriously powerful classical counterparts, due to the limited learning capabilities of QML algorithms, the constrained computational resources available on today’s noisy intermediate-scale quantum (NISQ) devices, and the empirically designed circuit ansatz for QML models. In this work, we address these challenges by proposing a hybrid quantum–classical neural network (CaNN), which we call QCLIP, for Quantum Contrastive Language-Image Pre-Training. Rather than training a supervised QML model to predict human annotations, QCLIP focuses on more practical transferable visual representation learning, where the developed model can be generalized to work on unseen downstream datasets. QCLIP is implemented by using CaNNs to generate low-dimensional data feature embeddings followed by quantum neural networks to adapt and generalize the learned representation in the quantum Hilbert space. Experimental results show that the hybrid QCLIP model can be efficiently trained for representation learning. We evaluate the representation transfer capability of QCLIP against the classical Contrastive Language-Image Pre-Training model on various datasets. Simulation results and real-device results on NISQ IBM_Auckland quantum computer both show that the proposed QCLIP model outperforms the classical CLIP model in all test cases. As the field of QML on NISQ devices is continually evolving, we anticipate that this work will serve as a valuable foundation for future research and advancements in this promising area.


Introduction
The recent phenomenal investment and rapid development of quantum computing hardware have ushered in the noisy intermediate-scale quantum (NISQ) [1] era where quantum machines are expected to support 50 ∼ 100 qubits (quantum bits) and around 10 3 quantum operations in the coherence time of the physical qubits.In table 1, we summarize the key features of two state-of-the-art quantum computers-IonQ Forte [2] launched in 2022 and IBM Heron [3] slated for 2023.As it shows, NISQ computers suffer from errors due to imperfect qubit control and external interference.Current error rates on NISQ devices greatly exceeds the 10 −15 error rate required for many quantum algorithms [4][5][6][7][8][9][10][11][12] to achieve computational advantages.Although fault-tolerant quantum computers are theoretically feasible by incorporating quantum error-correction protocols [13][14][15], their practical implementation with millions of physical qubits may take decades of research.
a wide range of applications including material discovery [33,34], medical health [35,36], and financial services [37,38].Despite demonstrated advantages, state-of-the-art QML models have yet to solve practical problems due to the limited learning capabilities of QML algorithms, the constrained computational resources available on NISQ computers, and the empirically designed circuit ansatzes for QML models.First, most QML algorithms [27][28][29] focus on supervised classification by training models to predict class labels on test data that is generated from the same distribution as the training data.However, sufficient labeled training data for real-world tasks is usually unavailable [39,40] or prohibitively expensive [41] to obtain.Moreover, representations learned from supervised QML are restricted to a set of 'golden labels' , which greatly limits the generalization and transferability of the developed models on datasets that are generated from different distributions [42].Second, NISQ computers suffer from limitations in terms of qubit number and coherence time.The input size for real-world datasets is normally millions of tensors with millions of entries each, however, current NISQ devices can only work with small-scale toy benchmarks with input sizes of 2 × 2 or 4 × 4 [43][44][45][46].How to achieve quantum advantages in practical-scale problems with NISQ devices is of great research significance.Third, QML models are typically implemented as parameterized quantum circuits [43][44][45][46][47][48][49][50] consisting of a classical-to-quantum data encoder and repeated layers of a variational quantum circuit (VQC).The circuit architecture for the data encoder and the VQC ansatz are currently empirically designed or simply randomly assigned.

Our contributions
In this work, we address the aforementioned challenges by proposing a hybrid quantum-classical neural network (CaNN) architecture for learning transferable visual representation, which we call QCLIP, for Quantum Contrastive Language-Image Pre-training.Our main contributions can be summarized as follows:

Learning transferable visual representation
Supervised representation learning methods [42,[53][54][55][56][57][58][59] suffer from prohibitively expensive cost on labeled data preparation and poor representation transferability to downstream unseen datasets.Therefore, learning transferable visual representations is proposed and become a long-standing core problem in machine learning.Given a source domain D S with a source task T S and a target domain D T with a target task T T , the goal of transferable visual representation learning is to improve the target function f T (•) by reusing the representation learned from D S and T S , where D S ̸ = D T or T S ̸ = T T .Recent works [52,[60][61][62][63][64][65][66][67][68][69][70][71][72] encourage models to extract underlying explanatory factors hidden in the image by using unlabeled data in an unsupervised fashion, rather than just predicting human annotations.Provided the unlimited free raw data available on the Internet, this produces a model with better performance, and most importantly, the learned perception enables flexible representation transfer to downstream unseen datasets.Among all prior arts, the CLIP method [52] has demonstrated state-of-the-art visual representation transfer performance.CLIP collects over 400 M (image, text) pairs and trains an image encoder and a text encoder jointly with a task-agnostic contrastive loss [66,67].It is worth mentioning that the text descriptions are often referred to as 'prompt' and their design is critical to CLIP performance.Once the training is complete, the quality of the visual representations learned by CLIP can be evaluated via different methods [73] including (1) zero-shot inference by directly generalizing the learned CLIP model to an unseen dataset; (2) one-shot (or few-shot) prompt learning by training a lightweight prompt adapter neural network [74][75][76] using one (or a few) training samples per class from the target dataset; or (3) linear probing which connects the pre-trained image encoder with a linear classifier [52,66,67] fully trained on a sufficiently large number of training data from the target domain.In general, zero-shot inference and linear probing respectively set the lower and upper bound on model transferability, while one-shot (or few-shot) prompt learning achieves intermediate performance because it considers a more practical scenario where the target dataset is neither completely inaccessible nor fully accessible.

QuNNs
As illustrated in figure 2, a standard QuNN begins with a classical-to-quantum encoder E(x) that encodes a classical input vector x into a N Q -qubit quantum state |x⟩ [77]: where R denotes one-qubit gates {RX, RY, RZ} or their combinations, commonly referred to as angle encoding.Note that in this work, we exclude the amplitude encoding method due to its high O(2 NQ ) circuit depth, making a QuNN more error-prone [44].Instead, we focus on the angle encoding, which uses N Q qubits and a constant-depth quantum circuit to encode a N Q -bit classical data.The generated |x⟩ state is often referred to as a quantum input feature map and is manipulated by a subsequent VQC U(θ): where U(θ) is implemented as a concatenation of a VQC ansatz in repeated L U layers, and θ k is a set of trainable variables for the k th layer.As illustrated in figure 2, VQC ansatzes used in mainstream QML models [43][44][45][46] are normally constructed by single-qubit rotation gates followed by two-qubit entanglement gates.The final output results are obtained by quantum state measurement, M, that maps the output quantum state |y(θ)⟩ to a classical vector y(θ): By default, qubits are measured in the z-basis for implementation simplicity.Globally the full QuNN can be written as A QuNN model is evaluated by a pre-defined loss function L(•) and iteratively trained to obtain optimal parameters via hybrid quantum-classical gradient descent [78]: Update rule :

Theoretical Insights
While QML theory is continually evolving and in its nascent stages, this work provides insights on the optimal quantum encoder and variational circuit ansatz designs (see appendix C) based on the current state of QML theory research.However, it is important to note that the field currently lacks a standardized consensus.As a result, the discussions presented may be subject to changes or even controversies as our understanding of QML progresses.

Method
In this section, we present the details of the proposed hybrid quantum-CaNN architecture.In section 3.1, we describe the general QCLIP framework, introduce QCLIP representation transfer for zero-shot inference, one-shot (or few-shot) quantum prompt learning, and fully supervised linear probing.In section 3.2, we present the implementation of the QuNNs used in QCLIP.Finally, in section 3.3, we discuss the training approach of QCLIP.

The QCLIP framework
At the core of QCLIP is to learn image representations by contrasting them with the text prompt of the images, the same as classical CLIP [52].The idea of QCLIP is inspired by recent research advances in quantum-enhanced feature learning [32] through exploiting quantum mechanical superposition, entanglement, and interference principles.Instead of using purely QuNNs on small datasets as in [32], the proposed QCLIP architecture is implemented by combining classical and QuNNs in one framework, thus, QCLIP can leverage CaNNs for large dataset preprocessing while utilizing QuNNs for quantum-enhanced feature adaptation and generalization.

QCLIP overview
As shown in figure 3(a), each high-dimensional input (image, text) pair (x i , x t ) is first processed by CaNNs to generate compact low-dimensional data embedding in the classical feature space, and then QuNNs are utilized to further adapt the embeddings in an exponentially large quantum Hilbert space.Taking the hybrid image encoder network as an example, it utilizes a classical ViT-B/32 model [52] to produce a low-dimensional classical image embedding vector x c i and then uses an QuNN, QuNN i (x c i , θ i ), to map x c i to the quantum state space.A classical image embedding I is eventually generated via quantum measurements in the z-basis.Similarly, the hybrid text encoder is implemented as a classical 12-layer 512-wide text Transformer model with 8 attention heads [79] followed by an QuNN, QuNN t (x c t , θ t ), to generate a text embedding vector T. Note that I and T share a common dimensionality, specifically N Q , which corresponds to the number of qubits utilized in the QuNNs.At the training time, QCLIP is optimized to predict the correct pairings of a batch (with a batch size B) of (I k , T j ) (0 ⩽ k, j < B) pairs using symmetric cross-entropy loss.The ViT-B/32 and text Transformer models are particularly selected as classical feature extractors since they have demonstrated the best performance in classical CLIP models [52].

QCLIP representation transfer
We evaluate the transfer capability of learned QCLIP visual representations using all mainstream evaluation methods introduced in section 2.1.Below we describe the detailed configuration for each method.
Zero-Shot inference assumes no access to the target dataset at all.Assuming the downstream dataset has N class names, we reuse the pre-trained QCLIP and compute the text embeddings, {T 1 , T 2 , . .., T N }, for each target class name, as denoted as x in figure 3(b).A test image is processed by the image encoder to generate a feature embedding, I.The similarity between I and {T 1 , T 2 , . .., T N } is then calculated and normalized into a probability distribution via a softmax function.We identify the most probably (image, text) pair as the output prediction.Prior works [52] show that the transferability of the classical CLIP model is greatly impacted by the input text that describes the image and found that using a text template improves performance.We follow the same text template engineering and ensembling schemes in [52].
One-Shot (or Few-shot) prompt learning targets a more practical scenario where one (or a few) training samples per class from target datasets are available at the test time.Various prompt learning algorithms [74][75][76] are recently proposed to alter the functionality of a pre-trained model across domains.However, none of these schemes can be directly applied to work with QuNNs.In this work, we introduce a quantum prompt learning algorithm.
As denoted as y in figure 3(b), we design a domain prompt adapter, QuNN p (I, θ p ), which is implemented as a parameterized QuNN.At the training time, the quantum prompt adapter takes the image vector I as input and generates a prompt T using one (or a few) unlabeled images x i from the target training dataset.T has the same width as text embedding vectors and is added to all the original class embeddings to generate an adapted set of text pairing embeddings, denoted as {T p 1 , T p 2 , . .., T p N }.At the test time, we utilize domain-adapted text embeddings {T p 1 , T p 2 , . .., T p N } instead of the general QCLIP text embeddings {T 1 , T 2 , . .., T N } to compute the similarity between the input image and the predicted classes.
Linear probing assumes full access to the target training dataset.We adopt the established linear evaluation protocol [52,66,67] to test the visual representation transfer of QCLIP, where we freeze the QCLIP image encoder and only train a linear classification prediction layer on the output of the encoder network.The linear classifier is implemented as a logistic regression model and fully trained on target datasets for 1000 iterations.We then apply the whole network consisting of the QCLIP image encoder and the linear classifier head to the test data and report the classification accuracy.

QCLIP implementation on NISQ computers
The classical ViT-B/32 and text Transformer respectively map the original data pair to a 512-dimensional image/text feature vector [52], which is considered as a classical compact encoding of the input.Ideally, the CaNNs can pass these 512-dimensional vectors to the QuNNs for further processing, however, NISQ computers available now only have 50 ∼ 100 qubits.Therefore, we follow the common practice [80,81] by inserting a 512-to-N C fully-connected layer between the classical and quantum layers to compress the initial feature vectors to a N C -dimensional vector that can be effectively encoded in a practically available N Q -qubit quantum system.The relationship between N C and N Q is determined by the classical-to-quantum encoding methods.To investigate the impact of compressed feature dimensions on the final performance, we conducted a study of the accuracy achieved by QCLIP with different N C , as reported in figure A1 in appendix A. The experimental results demonstrate that increasing N C leads to improved accuracy and transferability of the QCLIP model.
In conclusion, with a fixed N Q qubits on a quantum computer, the encoder is expected to enable a larger N C , allowing for a more accurate input representation by preserving a greater amount of information from the classical input data.The default angle encoding, which uses N Q qubits, can only encode N Q features, motivating the development of a denser encoder to accommodate a larger N C in this work.Furthermore, the performance improvement with the increasing N C also indicates that advancements in technology and the availability of more qubits will lead to improvements in the implementation scale of QCLIP and its corresponding performance and transferability.

QuNNs
QuNNs used in QML models are currently empirically designed.In this work, we investigate various widely used encoding methods and VQC circuit ansatzes.Based on the performance evaluation, we identify the optimal QuNN circuits for each quantum component in the proposed QCLIP framework.We provide a full list of candidate quantum encoding methods and VQC ansatzes studied in this work respectively in appendices A and B.

Quantum image and text encoders
Figure 4 shows the QuNN circuits used in the text and image encoder networks.In this example, we consider a QuNN with only four qubits for simplicity.The number of qubits as well as the number of VQC layers (i.e.L U ) in a generic QCLIP model can be adjusted to fit the problem of interest.
Classical-to-quantum encoder is essential for ensuring QML model accuracy, as it extracts and encodes relevant features from classical data into a quantum format, enabling subsequent processing in the quantum domain.However, the limited number of qubits in current quantum computers presents challenges in effectively embedding classical data, particularly with large-dimensional input datasets.In this work, we follow the generalized dense angle encoding [77] and present a dense classical-to-quantum encoder consisting of a layer of RY gates followed by a layer of U1 gates, as shown in figure 4. Given a classical N C -dimensional input vector x = (x 0 , x 1 , . ..xNC−1 ), a quantum input feature map is generated by applying the encoding circuits to the ground quantum state |0⟩ ⊗NQ of a N Q -qubit system where N c = 2N Q , defining an encoder E(x) given by (see detailed mathematical derivation in appendix A): In contrast to the conventional encoding method represented by equation ( 1), which uses N Q qubits to represent N Q features, QCLIP leverages the relative phase degree of freedom along with the angles to embed 2× more features using the same number of qubits.On top of this dense encoding method, we also explored data re-uploading [82] and variational encoding [83] to improve QuNN performance.Experimental results (see appendix C) show that these two methods achieve negligible accuracy improvement in an QCLIP model, which contradicts previous conclusions from QML models [82,83] implemented purely by QuNNs.We interpret the main reason as that these two methods primarily provide nonlinearity to a linear QuNN, while in a QCLIP model, nonlinearity is already sufficiently provided by the earlier CaNNs in the framework.Considering the significant implementation and training overhead introduced by data re-uploading and variational encoding, we do not recommend using these two methods in QCLIP.
We run experiments with different VQC ansatzes (see details in appendix B) in the QCLIP architecture, results (see appendix C) show that a VQC using parameterized two-qubit CRX(θ) gates leads to significant accuracy improvement compared to a baseline VQC with fixed two-qubit CNOT gates, demonstrating that adaptive and flexible entanglement rather than fixed maximal entanglement performs better for a QML algorithm, which is consistent with the conclusions in supervised QuNN models [43][44][45].However, we find that further increasing the flexibility by replacing CRX(θ) gates with U3(θ,ϕ,λ) and CROT(ϕ,θ,ω) gates introduces significant hardware overhead and training complexity with no noticeable performance improvement.Therefore, we present the VQC circuit implemented with two-qubit CRX(θ) in figure 4.

Quantum prompt adapter neural network
The quantum prompt adapter QuNN p (I, θ p ) takes an image vector I as input and generates a domainadapted text vector T. In designing QuNN p encoders, we chose the default angle encoding over the dense encoder for two main reasons.First, maintaining the output dimensionality of QuNN p as the input vector I is required to ensure seamless integration with the subsequent components.Second, expanding the dimensionality of I by 2× and using the dense encoder is possibly but considered impractical.I is already a compact representation learned by QuNN i , and increasing its dimensionality would not provide significant benefits.Moreover, it could introduce unnecessary complexity without improving overall performance.Through experiments, we identified the optimal circuit structure shown in figure 5 consisting of a single layer of RX gates in the encoder and a VQC circuit employing two-qubit CRX(θ) gates for qubit entanglement, following previous work [43].

Training of QCLIP
To fix the parameters in the 512-to-N C compression layer and the N Q -qubit QuNNs used in the image and text encoders, we train the QCLIP model using CC3M [84] as a proxy dataset.The training goal is to predict which text as a whole is paired with which image.Specifically, given a batch of B input (images, text) pairs, QCLIP obtains respectively B image embedding vectors and B text embedding vectors.We denote (I k , T j ) where k = j is a positive pair and a negative pair for k ̸ = j.We define a function that calculates loss using all these possible pairs and minimizes this function via stochastic gradient descent.Intuitively, if information can be successfully passed forward and backward in the hybrid architecture of QCLIP, the measured similarity between representations for positive pairs will decrease, while the distance between representations for negative pairs will increase.

Loss function
We consider two widely used loss functions, namely, normalized temperature-scaled contrastive loss [66,67] and symmetric cross-entropy loss [52,85].We optimize the loss over similarity scores.Experimental results show symmetric cross-entropy loss outperforms contrastive loss for the training of QCLIP.We provide the pseudocode of cross-entropy loss based QCLIP training in algorithm 1.We also provide details of the contrastive loss in appendix D for comparison.

Input:
1. Batch size: B, 2. Label: [1, 2, . . ., B], 3. Cross entropy loss: , where l i is the truth label and p i is the Softmax probability for the i th class.

Training method
We implement the classical ViT-B/32 and text Transformer models in PyTorch [86].We implement the QuNNs using PennyLane [87].We use a mini-batch size of 128.We train the model for 75 iterations.We use Adam optimizer and set the learning rate to 0.001.Among all the training hyperparameters, the initialization of parameters in QuNNs emerges as the most critical factor influencing the final performance of an QCLIP model.This is primarily due to the challenge of exponentially vanishing gradients concerning the quantum circuit depth and qubit number.For a deeper understanding, interested readers can refer to the theoretical discussion on the effect of parameter initialization on the trainability and performance of QML models provided in [88].In this work, we study both uniform initialization and Gaussian initialization in QCLIP as detailed in appendix E. Inspired by classical Xavier initialization [89], we utilize the information of QuNN structures in the Gaussian initialization by defining N (0, σ 2 ), where σ = 1/ √ N Q .Experimental results show that the Gaussian distribution demonstrates better performance in terms of accuracy, training stability, and convergence.

Results and analysis
In this section, we evaluate the effectiveness of the proposed QCLIP model.We follow the general QCLIP architecture and implement a practical design by setting N C , N Q , and L U respectively to 16, 8, and 2. We run numerical simulations and report results on representation learning in section 4.1.To compare QCLIP with classical CLIP, we create a baseline model by implementing classical CLIP in PyTorch.We follow the training approaches used in the original work [52,81], with the only difference being the insertion of a 512-to-N C fully-connected layer in the image/text encoder.This modification is made to ensure a fair and equal comparison between QCLIP and classical CLIP models.In section 4.2, we evaluate the representation transferability of QCLIP and show that QCLIP outperforms the classical CLIP model on various datasets.Section 4.3 provides exploration results for different training configurations.We also implement a proof-of-concept QCLIP on NISQ IBM_Auckland quantum computer and report its performance results in section 4.4.

Results on QCLIP representation learning
We first verify whether the proposed hybrid QCLIP model can be successfully trained for representation learning.To this end, we train the QCLIP model using CC3M [84] as a proxy dataset for 70 batches and record the training loss after each batch in figure 6(a).Results show that the loss decreases from 6.075 to 4.152 over the course of training, indicating that our model is able to learn.It is notable that the training time for QCLIP is significantly less than classical QCLIP models, which would typically take several hundred epochs [52].By comparison, QCLIP is more compute-efficient, which allows us to reach higher overall performance within a limited computing budget.
We further quantitatively study the representation learning ability of QCLIP.We adopt the widely used Hilbert-Schmidt distance as the evaluation metric and report results on several key distances, following the approach taken in related work [52,81].Figures 6(b)-(d) respectively record the distance between positive and negative pairs (denoted as Distance), similarity within positive pairs (denoted as Positive Similarity), dissimilarity between negative pairs (denoted as Negative Dissimilarity).Throughout the training process, we observe that the measured similarity and dissimilarity undergo expected changes, indicating successful information propagation both forward and backward in the hybrid architecture of QCLIP.These quantitative results affirm that quantum components can effectively combine with classical resources to achieve meaningful and nontrivial representation learning tasks.

Results on QCLIP representation transfer
QCLIP is pre-trained to predict whether an image and a text prompt are paired together in a source dataset.This capability is then reused to perform zero-shot inference, one-shot prompt learning, and linear-probing, to study the representation transfer ability on downstream datasets.To demonstrate the robustness of QCLIP on various datasets with wide distributions, we evaluate QCLIP on four different target datasets including MNIST [90], Cifar10 [51], OxfordPet [91], and Food101 [92].
In table 2, we summarize the performance of QCLIP on each task and highlight the accuracy improvement (denoted as ∆) provided by QCLIP compared to classical baselines.The quantitative results show that QCLIP is robust on all tested datasets and outperforms classical CLIP on all tasks.While supervised linear probing exhibits the upper bound on model transferability, QCLIP has the lowest performance improvement over CLIP on this task.Notably, one-shot prompt learning benefits the most from QCLIP with a performance improvement up to +17.21% on the Food101 dataset.We further increase the shot number from one to ten for both classical CLIP and QCLIP and report the few-shot performance in figure 7. The performance of few-shot prompt learning shows negligible improvement when the shot number increases from one to ten, indicating that the accuracy of the small domain prompt generators rapidly saturated with just very few (i.e. one per class) training data.

Results on different training configurations
As discussed in section 3.3, the QCLIP performance is greatly impacted by pre-defined loss functions and parameter initialization.Since linear probing represents an upper bound of QCLIP representation transferability, here we use it as a proxy task to explore the impact of different types of loss functions and parameter initialization methods.
Figure 8 compares the QCLIP model accuracy on linear probing by using normalized initialization (denoted as Q Norm) and uniform initialization (denoted as Q Uniform).Results show that Q Uniform performs better in the first several training runs, while Q Norm provides better (8.2% higher than Q Uniform) final accuracy.These results are consistent with the observation reported in a previous work [93].Therefore, normalized initialization is adopted in QCLIP training.
Figure 9 reports the performance on linear probing for classical CLIP and QCLIP by using respectively contrastive loss and cross-entropy loss.In general, cross-entropy loss improves the performance of both classical and quantum models.For classical CLIP training, the cross-entropy loss (denoted as C CrossEntropy) provides a 2.2% accuracy improvement compared to the contrastive loss (denoted as C Contrastive).For QCLIP, a significant 9.6% accuracy improvement is achieved when replacing the contrastive loss (denoted as Q Contrastive) with the cross-entropy loss (denoted as Q CrossEntropy).Recent work on quantum self-supervised learning [81] directly employs the contrastive loss function for QuNN training, whereas in this work we identify the cross-entropy loss function as an optimal option and used it for QCLIP training.

Results on NISQ devices
In addition to the numerical simulation results reported in previous sections, we also implement a proof-of-concept QCLIP on real NISQ devices and report its performance to demonstrate the effectiveness of QCLIP.We use the IBM_Auckland quantum computer, which is a 27-qubit device with respective 0.022%, 1.164%, and 1.110% error rates for 1Q-Gate, 2Q-Gate, and SPAM.Compared with the state-of-the-art devices reported in table 1, IBM_Auckland is a more practical NISQ device that is publicly  available to average users.We adopt the pre-trained QCLIP model and implemented it on IBM_Auckland using only 8 qubits.
We perform zero-shot inference and one-shot prompt learning on real devices and report the results respectively in figures 10 and 11.Note that we exclude the fully fined-tuned linear probing on real devices due to its long training latency.In general, the performance of QCLIP on real devices is decreased due to the noisy qubits and imperfect control and measurement.Specifically, the QCLIP accuracy on zero-shot inference drops from 46.4% to 44.4% (i.e. the final accuracy for Real Q Zero-Shot in figure 10), while the performance on one-shot prompt learning decreases from 55.6% to 49.6% (i.e. the final accuracy for Real Q One-Shot in figure 11).However, the classical CLIP model only achieves respectively an accuracy of 39.0% and 46.2% for zero-shot inference and one-shot prompt learning.Therefore a quantum advantage (up to 5.4%) on representation transferability is still reserved for real-device results.

Conclusion
Current QML models mainly focused on supervised classification tasks using down-sampled input data with a very small scale, i.e. labeled images with a 4 × 4 or even 2 × 2 size.Such models failed to solve practical problems and show limited generalization and transferability to unseen downstream datasets.In this work, we propose to advance the flagship CLIP method by proposing QCLIP, a quantum CLIP framework, to improve the performance of QML algorithms on transfer representation learning tasks.The key idea is to leverage the quantum-enhanced transferability and generalization only efficiently accessible on quantum computers.However, current quantum computers are all NISQ devices, which can only support 50 ∼ 100 qubits and a limited number of quantum gate operations.In order to leverage the limited NISQ resources to perform meaningful tasks, QCLIP combines quantum computing resources with classical computing power in a hybrid quantum-classical fashion, where CaNNs are used to generate low-dimensional input embeddings in the classical feature space, and QuNNs are employed to enhance the model generalization in the quantum Hilbert space.We survey the mainstream QuNN implementation and study how different encoding methods, variational circuit ansatzes, and training configurations affect the final performance of the QCLIP model.We present a dense encoding method in this work, and also identify the optimal quantum circuit for each quantum component in QCLIP.
We implement a small-scale QCLIP and demonstrate the proposed hybrid quantum-CaNN can be successfully trained for representation learning.We evaluate the transfer representation learning capability of QCLIP against the classical CLIP model using different datasets.Experimental results on numerical simulation and NISQ IBM_Auckland quantum computer both show that QCLIP model outperforms the classical CLIP model in all test cases.classical CLIP with N C = 16 trained on the same dataset, indicating the improved representation learning capability enhanced by QuNNs.The proposed dense encoding provides an average of 5.4% accuracy improvement compared to the baseline angle encoding.Moreover, increasing the width of the QuNN (i.e.N Q ) improves the QCLIP accuracy, demonstrating the scalability of our approach.
We also explored data re-uploading [82] and variational encoding [83], which are two recently proposed encoding techniques to improve QuNN performance.The key idea of data re-uploading is to repeatedly apply the classical-to-quantum encoder, E(x), before each parameterized VQC ansatz, U k (θ k ).Variational encoding proposes to introduce trainable parameters to a classical-to-quantum encoder by defining a variational encoder function, E(x•θ), where the parameter set θ is pre-trained to produce faithful quantum presentations in which data from different clusters are separated.We refer interested readers to [82,83] for a more detailed explanation and demonstration.outperforms single-qubit rotation-based angle encoding, likely due to the two-layer RY-U1 gate in dense encoding, enabling access to frequency spectra with two frequencies, in contrast to angle encoding's single frequency.However, it is important to consider that increasing encoding density also leads to higher training complexity.Considering the problem set used in this work, we find the two-layer RY-U1 dense encoding to be the optimal choice for our QML models.The theoretical analysis in VQC circuit ansatz [43] primarily explores the impact of circuit entanglement capacity on the expressivity of QML models.As of now, there is no universally agreed-upon optimal VQC design, and VQC circuits are typically empirically designed.However, there is a common consensus that adaptive and trainable entanglement capabilities can be beneficial for QML algorithms compared to fixed maximized entanglement provided by fixed CNOT gates.
The final training loss is defined as the weighted sum of the above two losses.For batch training, the averaged loss is calculated using the following equation (D.3), where λ ∈ [0, 1] is a scaling hyperparameter.). (D.3)

Figure 1 .
Figure 1.Given a pre-trained model such as classical CLIP or QCLIP, we can transfer the learned visual representation via (a) zero-shot inference, (b) one-shot (or few-shot) prompt learning, or (c) fully fine-tuned linear probing, We use Cifar10 [51] as the downstream dataset and report the test accuracy of classical CLIP, QCLIP, and the accuracy improvement of QCLIP over CLIP (denoted as ∆) in the table.

Figure 2 .
Figure 2. A standard quantum neural network.

Figure 3 .
Figure 3.The overview of the proposed QCLIP framework.(a) QCLIP jointly trains an image encoder and a text encoder, which are both implemented as hybrid classical-QuNNs.For simplicity, we only show a QuNN with single-layer VQC in this example highlighted by a red rectangle.(b) At test time, the learned QCLIP model can be used for x zero-shot inference, y one-shot quantum prompt learning, or linear probing that is omitted in this figure.

Figure 4 .
Figure 4.The proposed QuNN circuit for quantum image and text encoders.Note that the shaded gates are variational gates with trainable parameters.

Figure 5 .
Figure 5.The proposed QuNN circuit for the quantum prompt adapter.Note that the shaded gates are variational gates with trainable parameters.

Figure 6 .
Figure 6.The process of QCLIP on representation learning, with the line indicating the mean and the shaded area representing the deviation.

Figure 10 .
Figure10.QCLIP performance on zero-shot inference using the IBM_Auckland quantum computer.

Figure 11 .
Figure11.QCLIP performance on one-shot prompt learning using the IBM_Auckland quantum computer.

Table 1 .
A summary on two state-of-the-art quantum computers (1Q-Gate: one-qubit gate; 2Q-Gate: two-qubit gate; SPAM: state preparation and measurement).

• A novel QML framework for learning transferable visual representation. Instead
[52]raining a supervised QML model for predicting human annotations, we advance the flagship Contrastive Language-Image Pre-Training (CLIP) method[52]by proposing QCLIP, a quantum CLIP framework, which enjoys quantumenhanced transferability and generalization only efficiently accessible on quantum computers.QCLIP combines limited NISQ resources and classical computing power to perform meaningful tasks, where CaNNs are used to generate low-dimensional data embeddings in classical feature space, while quantum neural networks (QuNNs) are exploited to enhance the model generalization in an exponentially large quantum Hilbert space (section 3.1).•Quantum

encoding methods and QuNN circuit ansatzes specialized for transferable visual representa- tion learning.
We investigate various encoding methods and circuit ansatzes in the proposed QCLIP framework and identify the optimal candidate circuit ansatz for each quantum component (section 3.2).We implement QCLIP on NISQ devices and carefully study how different training configurations affect final model performance.We provide a detailed training procedure for QCLIP (section 3.3).•

High-performance visual representation transfer on NISQ devices.
We demonstrate that the hybrid QCLIP model can be successfully trained for representation learning (section 4.1).We evaluate the representation transferability of QCLIP using all mainstream methods including zero-shot inference, one-shot prompt learning, and linear probing and show that QCLIP outperforms the classical CLIP model on various datasets (section 4.2).A brief description of the experimental setup and numerical results are summarized in figure1.We also provide experimental results on different training configurations (section 4.3) and NISQ IBM_Auckland quantum computer (section 4.4).Our results show that the proposed QCLIP model outperforms the classical CLIP model in all test cases.

Table 2 .
Performance comparison between QCLIP and CLIP on representation transfer.
Figure 7. QCLIP performance results on few-shot learning.

Table C1 .
QCLIP performance on one-shot prompt learning using different encoding methods and VQC circuit ansatzes.