Parallel Quantum Computation Approach for Quantum Deep Learning and Classical-Quantum Models

The paradigm of Quantum computing and artificial intelligence has been growing steadily in recent years and given the potential of this technology by recognizing the computer as a physical system that can take advantage of quantum mechanics for solving problems faster, more efficiently, and accurately. We suggest experimentation of this potential through an architecture of different quantum models computed in parallel. In this work, we present encouraging results of how it is possible to use Quantum Processing Units analogically to Graphics Processing Units to accelerate algorithms and improve the performance of machine learning models through three experiments. The first experiment was a reproduction of a parity function, allowing us to see how the convergence of a given Quantum model is influenced significantly by computing it in parallel. For the second and third experiments, we implemented an image classification problem by training quantum neural networks and using pre-trained models to compare their performances with the same experiments carried out with parallel quantum computations. We obtained very similar results in the accuracies, which were close to 100% and significantly improved the execution time, approximately 15 times faster in the best-case scenario. We also propose an alternative as a proof of concept to address emotion recognition problems using optimization algorithms and how execution times can be positively affected by parallel quantum computation. To do this, we use tools such as the cross-platform software library PennyLane and Amazon Web Services to access high-end simulators with Amazon Braket and IBM quantum experience.


Introduction
Parallel computing nowadays is used to improve the performance of many systems, models, and algorithms significantly, thanks to its capacity to complete large and complex tasks in a reasonable time [1][2][3][4]. In recent years, there has been an increment of outstanding achievements on how Quantum computing (QC) can solve very complex problems that are impossible for a Classical Computer (CC) to solve in a suitable time [5]. In addition, there are ways to exploit their capabilities to enhance classic applications and perform specific tasks so that it is impossible for a CC to keep up with the pace, which is called quantum advantage and is one of the leading research interests in the field today. Thus, there is plenty of hope in how QC can improve existing technologies, one of the highlights of which is the case of machine learning [6].
Quantum machine learning is a field of study that investigates the interaction of concepts from quantum computation and machine learning [7,8], with great achievements in the last few years finding a place in implementations such as classification [9][10][11] and applications in many fields, from image and emotion recognition to cybersecurity systems [12][13][14][15][16].
The limits of what machines can learn have always been defined by the computer hardware we run our algorithms on. For example, parallel Graphics processing unit (GPU) clusters enable modern-day deep learning with neural networks.
Quantum machine learning extends the hardware pool for machine learning by an entirely new type of computing device: The quantum computer. This comparison makes sense since both are alternatives to speed up computation for expensive computational resource tasks. However, the scalability of the problems that we can address is quite different. To the best of our knowledge, this is the first time that a study has been carried out to unveil the potential of parallel quantum computation for Noisy Intermediate-scale Quantum (NISQ) [17,18] hardware in near-term applications such as quantum deep learning.

Framework and Materials
For the development of this work and presentation of results, we mainly use the PennyLane framework, which is a cross-platform Python library for differentiable programming of quantum computers, allowing us to build and optimize hybrid computations [19]. We can perform these executions on real quantum hardware or a classical simulator. Both options are available today on the platforms of IBM, Amazon, among others.

Parallel Quantum Computation Architecture
Quantum computation is known for its potential of improving performance in the execution of tasks thanks to a physical phenomenon called quantum parallelism, which consists of the capability of quantum systems to do many evaluations simultaneously thanks to the property of superposition [20]. However, for this implementation, we extend this concept as parallel quantum computation. The difference is that regardless of superposition. There is parallel execution of evaluations of circuits taking place in two or more Quantum Processing Units (QPUs) or in a simulator that has enabled distributed parallel executions of circuit batches, such as the mainly used in this work, the SV1 by Amazon Braket. Theoretically, there is also an analogy between parallel computing and parallel quantum computing proposed in [21], where is defined the Quantum Parallel Complexity QNC in Equation (1), which is the quantum version of parallel complexity NC and states that to design a shallow parallel circuit for a given quantum operator, we want to be able to use additional quantum bits (qubits) or "ancillae" as workspace in the computation, equivalent to other processors in a parallel quantum computer. Also, as is presented in [22] it is possible to emulate parallel quantum computation to process signals in parallel mainly through encoding them into a superposition state of two signals, ψ and ϕ (2).
We used an architecture that systematically proved the advantage of parallel quantum computation in different quantum processors in various models for different data types for our implementation.
The circuit in Figure 1 shows a generic quantum classifier mainly used for the realization of the experiments of this work. It consists of a minimum of two layers and single-qubit rotations, and CNOT gates as entanglers (3) and a measurement layer (4) mutable depending on the experiment in each one of them, based on the model presented in [23].  Figure 1. General representation of the circuit ansatz used on each experiment. There are four qubits and two layers for the minimum configuration, 24 qubits for the maximum, and at least one measurement in the Z basis of the Bloch sphere in one qubit. Note that this is a general representation, which means that it can vary depending on the experiment.
As stated above, the parallel execution of the circuit shown in Figure 1 occurs in two ways. First, a given circuit runs on many devices asynchronously, where each device could be a simulator or an actual QPU. Thus allowing the collection of shots on the circuit to be more efficient, or in batches in one device, where the same circuit is executed multiple times in parallel. The SV1 simulator of Amazon Braket allows us to execute up to 20 circuits in parallel or simultaneously. We can harness this capability during circuit training, which requires lots of circuit variations to execute. As we know, calculating the gradient involves multiple device executions. For each trainable parameter, we must run our circuit on the device typically more than once. Practical applications involve many trainable parameters. The result of this is a vast number of device executions for each optimization step.
This approach allows us to compare each execution with its respective result with three experiments in which we see the ability of quantum parallel computation to improve quantum deep learning algorithms. These experiments were carried out with the same parameters for every device to prevent bias in the simulations. Furthermore, we used 1000 shots per cost function evaluation on every experiment. In the following sections, the presentation of results labeled as "unparalleled", means sequential execution of the circuits for each training process.
It is important to emphasize that the devices used in the experiments were all simulators. This means that we have not considered the influence of noise on the models because we wanted to keep the integrity of each experiment and each device as close as possible. So as not to fall into the error of reaching conclusions based on a single experience, but in general. The parameters for this case study are the same for each experiment, but be aware that noise is an external factor that influences the behavior and performance of current quantum devices.

Experiment 1: Parity function
In this experiment, we demonstrate that our variational quantum classifier can reproduce the parity function (5) more efficiently through parallel execution on two SV1 simulators by amazon Braket in the training process.
f : x ∈ {0, 1} ⊗n → y = 1 if uneven number of ones in x 0 otherwise This optimization example demonstrates how to encode binary inputs into the initial state of the variational circuit, which is simply a computational basis state. We can see in Figure  2 and Figure 3 that for the parallel execution, we begin to see an increase in the convergence of the model from about the 6th iteration approximately. For the cost function in Figure 2 we evidenced a significant improvement for the parallel execution, reaching convergence in 10 iterations approximately with a considerable advantage over the 25 iterations of the unparalleled implementation of the same model. For the accuracy in Figure 3, although the model behaves similarly, we evidence a difference of 5 iterations between the parallel and unparalleled execution by the time each of them reaches maximum convergence. As we can see in Table 1, the parallel execution achieved convergence in 1.5 times fewer iterations in terms of accuracy. This result is relevant because it gives us an eye-opener on how parallel execution can improve the efficiency and performance of quantum models. Furthermore, if we scale this experiment and apply it to a problem of higher complexity, that value could become very significant.  Figure 3. Accuracy during the training process of experiment 1.

Experiment 2: Hymenoptera Dataset Classification
In this experiment, we used the scheme presented in [13], where we numerically trained and tested the model, which is a quantum model inspired by the official PyTorch tutorial on classical transfer learning using PennyLane with the PyTorch interface [19]. In this case, the parallel execution occurs on two devices. In Figure 4 we can see a preview of the data, which consist of high resolution images with two classes, ants and bees. The data was encoded into quantum states via local embedding. It means that all qubits are first initialized in a balanced superposition of up and down states. Then, they are rotated according to the input parameters.  Figure 5 shows the evolution through the training process of the loss landscape for each model. First, we trained the variational parameters of the model for 30 epochs over the training dataset with a batch size of four and an initial learning rate of n = 0.0004, which was successively reduced by a factor of 0.1 every ten epochs [13]. Then, after each epoch, the model was validated concerning the test dataset, obtaining a maximum accuracy of 0.9477.  Also, we can see in Figure 5 that the loss landscape doesn't vary too much from the unparalleled execution to the parallel execution. The reason is that the model's hyperparameters were optimized successfully in both cases, with a bit of loss peak at the end of the training process with parallel execution. However, in Table 2 we see that where the significant difference is in the execution time, where we evidenced a remarkable speedup of 7.52 times faster run time.

Experiment 3: CIFAR Transfer Learning Classification
In this experiment, we make use of a pre-trained model and implementation presented in [13], where we evaluate its performance with the same architecture as Section 3.2. Still, with a different dataset, the standard CIFAR-10 restricted to two classes, cats and dogs as we can see in Figure 6. The data was encoded into quantum states in the same way as the previous section via local embedding. All qubits are first initialized in a balanced superposition of up and down states. Then they are rotated according to the input parameters, determined by the matrix of the image previously processed for use in the model.
As we can see in Table 3, the results are nearly the same as expected except for the execution time, in which we evidenced a speedup of approximately 15.17 times faster execution. Also, in Figure 7 we see the loss landscape of the model, which was the same for both implementations since it is a pre-trained model. However, we expect that similarly to the experiment in Section 3.2 there will be variations in the loss of the model since the parallel computation allows better performance as hyperparameter optimization is done more efficiently.

Proof of Concept: Graph Optimization
In this proof of concept, we are going to evaluate the feasibility of using the implementation presented in [14], with a parallel architecture to exploit to the full the capabilities of quantum computing in graph optimization as a near-term application. The implementation presented in [14] consists of a quantum classifier for facial expression recognition using graphs. Each graph was encoded in a quantum state, making use of the elements of the adjacency vector into the amplitudes of the quantum state, where |G is the quantum state constructed in Equation (6), and G is the graph.
In Figure 8 we can see the example of a human face whose landmarks have been recognized and drawn as points, which will act as nodes in the graph associated with each class. As this is not an experimental reproduction, we scale the problem for this proof of concept. We take 20 nodes that represent the qubits of our architecture described in Figure 1, and the landmark points of the main feature, and make the graph. In this case, as we don't classify the expression of the face, our graph is arbitrary to fulfill the purpose of this proof of concept whose basis is to demonstrate the efficiency of parallel evaluation of circuits and how we can exploit this for quantum deep learning models. We address an optimization problem called the maximum cut , which consists of finding a partition of nodes into two sets, such that the number of edges in the graph with endpoints in different groups is maximized. That is achieved using the Quantum Approximate Optimization Algorithm (QAOA), a widely-studied method for solving combinatorial optimization problems on NISQ devices. QAOA begins by associating the optimization problem with a cost Hamiltonian H C (7) and choosing a mixer Hamiltonian H M . It proceeds by repetitively applying multiple layers of the unitaries exp (−iγ i H C ) and exp (−iα i H M ) with controllable parameters γ i and α i . As we approach the problem, the algorithm just required us to pass a graph to it. This graph, shown in Figure 9 represents the landmark points of the feature of a face, in which we perform QAOA to solve the maximum cut, which is an NP-hard problem in terms of computational complexity. This problem consists in finding the partition of the graph's nodes into two groups so that the number of edges crossed or "cut" by the partition is maximized. We choose this problem because it is very computationally demanding.  Figure 9. Arbitrary graph of 20 nodes representing the landmark points. Table 4 shows the summary of results of this proof of concept, where we can see how important it would be to fully address graph optimization problems in quantum machine learning applications by running quantum circuits in parallel. We have evidenced a speedup of about 12.8 times faster execution for the QAOA in question.  Figure 10. Highlighted edges and nodes results of maximum cut.
In Figure 10 we can see the nodes resulting from the QAOA optimization of maximum cut, where the purple violet represents the set S, and the blue ones mean the set T . Their shared edges are highlighted in magenta.

Conclusions and Future Work
This work presented an implementation of parallel quantum computation for near-term applications. We proved the efficiency of this methodology by conducting three experiments, and we proposed a proof of concept. However, although the results are encouraging, quantum parallelism is still very limited to the hardware available today to demonstrate the true potential of quantum computers and how they process information. Still, as shown in this paper, there are viable alternatives to improve the way quantum algorithms work in the rapidly advancing NISQ era. For future work, we aim to fully address the proof of concept introduced in this paper by creating a model that would perform graph optimization for quantum machine learning tasks through a parallel quantum architecture. We also aim to introduce noise models to the simulations to see how the results are affected by this parameter in order to have a closer look at the behavior of quantum models in real hardware.