Full reconstruction of a 14-qubit state within four hours

Full quantum state tomography (FQST) plays a unique role in the estimation of the state of a quantum system without \emph{a priori} knowledge or assumptions. Unfortunately, since FQST requires informationally (over)complete measurements, both the number of measurement bases and the computational complexity of data processing suffer an exponential growth with the size of the quantum system. A 14-qubit entangled state has already been experimentally prepared in an ion trap, and the data processing capability for FQST of a 14-qubit state seems to be far away from practical applications. In this paper, the computational capability of FQST is pushed forward to reconstruct a 14-qubit state with a run time of only 3.35 hours using the linear regression estimation (LRE) algorithm, even when informationally overcomplete Pauli measurements are employed. The computational complexity of the LRE algorithm is first reduced from $O(10^{19})$ to $O(10^{15})$ for a 14-qubit state, by dropping all the zero elements, and its computational efficiency is further sped up by fully exploiting the parallelism of the LRE algorithm with parallel Graphic Processing Unit (GPU) programming. Our result can play an important role in quantum information technologies with large quantum systems.

To reconstruct quantum states wherein we have no a priori information, we can resort to informationally (over)complete measurements. Quantum state tomography [11,12] using informationally (over)complete measurements is referred as full quantum state tomography (FQST) in this paper. As there are (d 1 2 ) independent parameters to characterize the density matrix of a d-dimensional quantum state, FQST needs at least (d 1 2 ) measurement operators. Note that the dimension d grows exponentially with the size of the quantum system. Thus, the number of measurements and the computational complexity of data processing in FQST suffer the curse of dimensionality. Moreover, the time of data processing was found to be even much longer than the time required for implementing the measurements. As reported in [13][14][15], reconstructing an 8-qubit state using maximum likelihood estimation (MLE) [1,16] took almost a week, while the measurement Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. time was only 10 hours [13]. With the breakthrough and rapid development of experimental techniques, the size of quantum systems with entanglement or coherence prepared in the laboratory has already grown from 2 qubits [17,18] to 10 qubits in photonic systems (e.g., 8 qubits in [19,20] and 10 qubits in [21]), 12 qubits in NMR [22] and to even 14 qubits [23] in ion traps, overwhelming the capability of the available full quantum state tomography.
Significant effort has been devoted to improving the performance of quantum state tomography [24][25][26][27][28]. Some methods focused on extracting partial concerned information. For example, entanglement witness [29,30] can detect entanglement with few measurements; direct purity estimation [31] and fidelity estimation [13,32] were utilized to obtain the purity of the prepared state and its fidelity with the ideal state; permutationally invariant tomography [33] was used to extract information that will not change under permutation. Several approaches were concerned on performing quantum state tomography with a priori knowledge or assumptions. For example, compressed sensing [34][35][36] can perform quantum state tomography for quantum states with a low rank. If a quantum state is a matrix product state, it is possible to develop efficient tomography algorithms [37,38]. However, these methods either extract partial information or have some prior information about the state to be reconstructed.
Several approaches have also been presented to reduce the computational complexity in the reconstruction algorithms in FQST [14,39]. For example, the authors in [14] developed an algorithm which can be used to efficiently reconstruct a 9-qubit state in about five minutes. However, when the size of the quantum system increases one qubit, the running time will increase by a factor of more than ten according to figure 1 in [14], resulting in years of computation time for a 14-qubit state. In [39], a linear regression estimation (LRE) algorithm was proposed which has a much lower computational complexity than that of MLE for quantum state tomography [40]. In this paper, we push the data processing capability of FQST to a 14-qubit state using the informationally overcomplete Pauli measurements by optimizing the LRE algorithm in [39] and employing parallel programming with graphics processing unit (GPU).
For experimental ease and high level of estimation accuracy, Pauli measurements are the preferred choice in experiments of FQST, although they are informationally overcomplete. LRE was demonstrated to be much more efficient than MLE in FQST [39]. However, in order to reconstruct a 14-qubit state efficiently, we need to further optimize the LRE method. The efficiency of FQST in this paper refers solely to the reduced computational complexity of data processing. Our first optimization is based on the fact that the representation of Pauli bases in the algorithm has very few nonzero elements under a proper choice of the representation. Furthermore, thanks to the simple LRE algorithm which only involves additions and multiplications of vectors and matrices, it is naturally suitable to be sped up by parallel programming. In parallel programming of matrix computation, GPU works much better than the central processing unit (CPU). Hence, we can use GPU parallel programming to realize the LRE algorithm and enhance the FQST capability. Compared with the result in [39], in this paper we optimize the LRE method with reduced computational complexity and storage requirement, and implement the optimized LRE method using GPU parallel programming. The results show significant enhancement of the FQST capability.
The rest of the paper is organized as follows. Section 2 provides a brief introduction to the three steps of LRE and analyzes the computational complexity and the storage requirement in the case of informationally complete measurements. In section 3, computational complexity and storage requirement are discussed based on Pauli measurements. In section 4, the LRE algorithm in the first two steps is realized using parallel GPU programming. The run time of the algorithm with GPU speeding up is also compared with that using CPU programming. In section 5, the estimation error of reconstructing a maximally-mixed state is analyzed in terms of squared Hilbert-Schmidt (HS) distance and infidelity. Section 6 presents the summary and prospect of this paper.

Linear regression estimation algorithm
In linear regression estimation (LRE) for a d-dimensional quantum system, the density matrix and measurement bases take a vector form after we choose a representation basis set The whole LRE algorithm consists of three steps: step (i) Obtain the estimate of Θ using measurement data; step (ii) Construct a Hermitian matrixm satisfyingm = Tr 1from the estimate of Θ; and step (iii) Find a physical density matrixr close tom.
In step (i), a least-squared estimateQ LS , without consideration of the positivity restriction of quantum states, is given byˆ( . For a 14-qubit state, = d 2 14 and the computational complexity is~10 16 .
In step (ii), on the basis of the solutionQ LS to (3) in step (i), we can obtain a Hermitian matrixm witĥ m = Tr 1byˆˆ( The computational complexity in this step is also ( ) O d 4 . The state estimatem obtained in (4) may have negative eigenvalues and be nonphysical due to the randomness of the finite measurement results. In step (iii), a proper method needs to be adopted to pullm back to a physical state. In this step, we use the fast algorithm proposed in [14], where a physical estimater is chosen to be the closest density matrix tom under the matrix 2-norm. According to [14], the computational complexity in this step is ( ) O d 3 . It is clear that the computational complexity in LRE is dominated by the first two steps, which is ( ) O d 4 . For a 14-qubit state, this is ∼10 16 . In terms of storage, ( ) O d 4 and ( ) O d 2 bytes are required to store all the measurement bases and measurement results, respectively, for an informationally complete measurement set. For a 14-qubit state, this needs tens of thousands of terabytes, which is beyond the capability for practical applications. In the following, we develop a method to reduce the computational complexity and storage requirement.

Computational complexity and storage with Pauli measurements
Pauli measurements are a good choice to extract information in quantum state tomography of n-qubit systems ( = d 2 n ) because of not only their experimental ease but also the ability to achieve a high level of accuracy. With all the possible combinations of Pauli measurements for n-qubit systems, the total number of measurement bases is = M 6 n . Without optimization, this informationally overcomplete measurement set further increases the computational complexity in step (i) to ∼10 19 for a 14-qubit state and also increases the storage requirement from ∼10 16 to ∼10 19 .
The computational complexity and storage requirement can be greatly reduced because many terms in ( ) . The term  X X is calculated to be a diagonal matrix with { }  = X X diag 3, 1, 1, 1 . As for the n-qubit scenario, since the measurement bases are the tensor product of those for single qubits, ( ) G j and  X X are also the tensor products of their counterparts for relevant 1-qubit cases. Thus, there will be only 2 n nonzero elements in ( ) G j , instead of the original 4 n , and  X X are diagonal. Dropping all zero elements will reduce the computational complexity in step

Parallel GPU programming
Thanks to the direct and simple formula of the LRE algorithm in (3) and (4), only addition and multiplication operations of matrices or vectors are involved, which are naturally suitable for parallel programming. Graphic processing units possess powerful capability for parallel programming. This technique is exploited to speed up the computation in both step (i) and step (ii). The computer hardware includes 500 GB hard drive, 16 GB memory, i7-4770 CPU with 3.5 GHz, 4 cores and 8 threads, and GTX780 GPU with 2304 CUDA cores and 3G standard memory.
As analyzed in section 3, step (i) has the dominant computational complexity ( ) O 12 n . GPU parallel programming is employed in this step. According to the locations of nonzero elements, all the measurement bases can be divided into 3 n groups. Each group has 2 n measurement bases. In parallel GPU programming, firstly, one group of measurement bases and their measurement data are put in, wherein the data include á 2 2 n n nonzero matrix, 2 n nonzero locations and 2 n measurement results. To be parallel, each thread is devoted to calculating one element ofQ LS , whose location is in the 2 n nonzero locations. Hence, only 2 n elements of the total 4 n elements inQ LS need to be calculated and updated for one group of data. All the threads are synchronized after updating these 2 n elements ofQ LS to get ready for the computation of the next group of data.
Then this process continues until all the 3 n groups of data are computed. In order to show the relationship between the amount of speed up and the number of parallel threads in GPU programming, the reciprocal of the run time in step (i) for a maximally-mixed 12-qubit state is plotted with respect to m in figure 1, where m is the number of streaming multiprocessors (SMs). Here, we only consider step (i) for simplicity, because step (i) has the dominant computational complexity. There are 12 SMs in the GTX780 GPU and each SM contains 192 CUDA cores. Since each CUDA core is occupied by a thread, m SMs can execute 192m threads in parallel. Figure 1 shows that the speed in step (i), which is denoted by the reciprocal of the time cost, increases almost proportionally with the number of parallel threads, represented by the number of SMs. Hence, all the SMs are used in GPU programming to gain the fastest speed in the rest simulations.
Using this approach we now demonstrate the computation time of reconstructing multi-qubit states using our algorithm. The computation time of GPU programming in step (i) costs 2.78 hours for a 14-qubit state as depicted in figure 2(a). In comparison, to reconstruct a 11-qubit state, the CPU programming has taken almost half an hour for step (i) and will be difficult to compute larger systems as shown in figure 2(a).
In step (ii), the time cost of CPU programming for a 14-qubit state approximates to be 10 hours according to the numerical fitting line (blue dashed line) in figure 2(b), which is much longer than the run time of step (i) via GPU programming. Hence, GPU programming is also used to speed up the computation in this step. We note that the nonzero elements inÍ 2 2 and s z have the same locations, and so it is for s x and s y . Thus all the d 4 W i , which are constructed in (5) by the above four matrices, can be divided into 2 n groups according to the same locations of nonzero elements. In parallel GPU programming, one group of 2 n W i and the corresponding 2 n elements inQ LS are put in. Each thread is responsible to calculate one of the corresponding 2 n elements inm. All the threads are synchronized after finishing the update of the 2 n elements inm to get prepared for the next group of data. Then this step is repeated until all the 2 n groups of data are calculated. With parallel GPU programming, the computation time in step (ii) is reduced to only 0.08 hours, as shown in figure 2(b). In step (iii), we do not need GPU programming since the time cost of CPU programming with Mathematica is only 0.49 hours (see figure 2(c)) for a 14-qubit state, which is already about 5 times shorter than the computation time in step (i) using GPU programming. The total computation time of the whole process in reconstructing a 14-qubit state turns out to be only 3.35 hours, as shown in figure 2(d). From figure 2(d), we know that our optimized LRE algorithm based on CPU programming (blue) is already more than 100 times faster than the efficient algorithm in [14] for reconstructing a 9-qubit state. However, according to the numerical fitting line in figure 2(d), reconstructing a 14-qubit state will take more than a month using our optimized algorithm with CPU programming. This is also a reason why we resort to GPU programming for further speeding up through considering the parallelism of our algorithm. In our numerical simulations, all states are chosen as the maximally-mixed states in n-qubit systems only for the ease of obtaining the simulated measurement frequency. Our method is applicable to other states because the computational complexity and run time are state independent.

Estimation error
In this section, the error between the estimate and the real state is discussed in terms of the squared Hilbert-Schmidt (HS) distance and infidelity. The mean squared HS distance between the estimate statem in step (ii) and the true state ρ is , asymptotically given by where N is the total number of copies for all the M operators and Here we have one more factor of d 1 than equation (9) in [39] since every d measurement bases form a complete set of POVM and can be measured simultaneously. As these d measurement bases are measured simultaneously, the corresponding off-diagonal elements in P are of ( ) O p p i j rather than zero. But these off-diagonal elements are much smaller than the diagonal elements when the size of the quantum system is large. Therefore, P is still approximated to only have diagonal elements in the case of simultaneous measurements.
This mean squared HS distance obviously depends on the unknown state. Here we calculate the mean squared HS distance when ρ is chosen as the maximally-mixed state, i.e., r =Í d d d . One reason for this state choice is that an elegant theoretical result can be derived due to the simple form of a maximally-mixed state; the other reason is that the maximally-mixed state often gives the larger mean squared HS distance than the other states. In this case, we have =Ṕ I d d d in the first order of approximation. When Pauli measurements are performed on copies of a n-qubit maximally-mixed state, the mean squared HS distance (blue line in figure 3) is equal to ( ) . When ρ is chosen as the maximally-mixed state, infidelity is directly related to the squared HS distance by 3 . In the large-number limit of N 0 ,m has non-negative eigenvalues and we haveˆm r = . Thus, the average infidelity is well approximated by ( ) N n 1 4 5 3 0 (green line) in this limit. However, when N 0 is small,m will have negative eigenvalues and needs to be pulled back to a physical statê r. Hence, as shown in figure 3, the real infidelity (green diamonds) is much smaller than the prediction (green line) for < N 2 10 in the 14-qubit scenario.

Summary and prospect
In this paper, we have fully reconstructed a 14-qubit state with a modest computation time of 3.35 hours using informationally overcomplete Pauli measurements. By a smart choice of the representation basis set in the algorithm, only very few nonzero elements exist and need to be stored and calculated, which reduce the computational complexity by a factor of 2 n . Furthermore, parallel GPU programming is used to fully exploit the Figure 3. Estimation error of a maximally-mixed 14-qubit state with respect to different N 0 . The total number of copies is = N N 6 14 0 . The numerical results of squared HS distance betweenm and ρ (blue crosses) agree well with the asymptotic mean squared HS distances (dashed blue line). The squared HS distance betweenr and ρ (red dots) is smaller than that betweenm and ρ, which is due to the positivity property [36,39]. When the measured number of copies goes large,m is readily a physical state and will not change in step (iii). When N 0 gets larger than 2 10 , numerical results of infidelity betweenr and ρ (green diamonds) match the dashed green line, which represents the infidelity betweenr and ρ in the large-number limit. parallelism in the simple LRE algorithm. It is worth pointing out that our method can push the capability of FQST forward to even larger quantum systems. This is because the measurement bases in our analysis and simulations are chosen as Pauli measurements, which are overcomplete. If we reduce the number of measurement bases from 6 n to 4 n , the computational complexity can be reduced to ( ) O 8 n from ( ) O 12 n , which can further enhance the capability of FQST via our optimized LRE.