Benchmarking quantum tomography completeness and fidelity with machine learning

We train convolutional neural networks to predict whether or not a set of measurements is informationally complete to uniquely reconstruct any given quantum state with no prior information. In addition, we perform fidelity benchmarking based on this measurement set without explicitly carrying out state tomography. The networks are trained to recognize the fidelity and a reliable measure for informational completeness. By gradually accumulating measurements and data, these trained convolutional networks can efficiently establish a compressive quantum-state characterization scheme by accelerating runtime computation and greatly reducing systematic drifts in experiments. We confirm the potential of this machine-learning approach by presenting experimental results for both spatial-mode and multiphoton systems of large dimensions. These predictions are further shown to improve when the networks are trained with additional bootstrapped training sets from real experimental data. Using a realistic beam-profile displacement error model for Hermite–Gaussian sources, we further demonstrate numerically that the orders-of-magnitude reduction in certification time with trained networks greatly increases the computation yield of a large-scale quantum processor using these sources, before state fidelity deteriorates significantly.

As typical quantum tasks involve pure states, unitary gates, and projective measurements, there also exists a series of compressed-sensing-related proposals [43][44][45][46][47][48][49][50][51] that fully reconstruct low-rank quantum components with few measurement resources. However, they rely on prior knowledge about the rank, which often turns out to be unreliable in practice because of noise. Very recently, compressive-tomography methods without assuming any prior information has been developed and applied to the individual low-resource characterization of quantum states, processes and measurements [52][53][54][55][56]. A crucial ingredient in these methods is informational completeness certification that determines whether or not a given measurement set and its corresponding data is informationally complete (IC). This is done by computing a uniqueness measure based on the given measurements. Such a computation can be performed with classical semidefinite pro-grams (SDPs) [57] of (worst-case) polynomial-time complexities.
Like any tomography scheme that invokes rounds of optimization routines, an accumulation of errors can occur in real experiments while running SDPs on-the-fly. As a practically feasible solution, we propose to train an artificial neural network to verify the IC property for a set of quantum measurements and corresponding raw data. We further introduce an auxiliary network to be used concurrently for us to validate the fidelity of the unknown state for the given measurement set without carrying out explicit reconstruction. Once a set of IC measurement data is collected, it takes only one final round of state reconstruction to obtain the unique physical estimator, if so desired.
Network training can be done offline using simulated noisy datasets, and the stored network model can later be retrieved and used in real experiments with statistical noise. More specifically, a convolutional neural-network (CNN) architecture shall be used for training and prediction. Among other kinds of networks that have been widely adopted by the quantum-information community [58][59][60][61][62][63], this is a popular network architecture that is used in image-pattern recognition [64][65][66][67], with boosted support by a recent universality proof [68] that such networks can indeed forecast any continuous function mapping. Both its classical application and quantum analog have also gained traction in quantum-information science [69][70][71][72].
In this work, we train an informational completeness certification net (ICCNet) and a fidelity prediction net (FidNet), each made up of a sequence of convolution and pooling neural layers that is reasonably deep. Partnered with FidNet for direct fidelity benchmarking without the need for state tomography, ICCNet constitutes the foundational core for deciding if the given measurement resources are sufficient to uniquely arXiv:2103.01535v4 [quant-ph] 23 Oct 2021 characterize any unknown state in real experimental situations. Neural-network training is versatile in the sense that noise models may be incorporated into the training procedure to improve the predictive power of the networks. After offline training, the network models can heavily reduce the computation time of the uniqueness certification by orders of magnitude for large dimensions while running the experiments. This essentially realizes a compressive tomography scheme that is drift-proof, comprising a highly efficient uniqueness certification and fidelity-benchmarking protocols.
Apart from Monte Carlo simulations, we also use real data obtained from two separate groups of experiments to demonstrate that the resulting trained network models can predict the average behaviors of both the IC property and fidelity very well, despite the presence of errors and experimental imperfections. We also show that performances in predicting both properties can be further boosted when the neural networks are trained with additional bootstrapped experimental datasets. Finally, simulations on a time-dependent error model relevant to Hermite-Gaussian sources are performed as an example to illustrate the effectiveness of our neural-network certification tools in suppressing systematic drifts during quantum computation.

A. Ascertaining informational completeness
The main procedure for certifying whether a generic measurement dataset is sufficient to unambiguously determine an unknown quantum state can be represented as a simple flowchart in Fig. 1(a). Given a positive operator-valued measure (POVM) that models the measurements performed, the corresponding data counts are noisy due to statistical fluctuation arising from finite data samples. Proper data analysis first entails the extraction of physical probabilities from the accumulated data, which can be done with well-established statistical methods, such as those of maximum likelihood (ML) [23,24,[73][74][75] and least squares (LS) [76,77], subject to the physical constraint of density matrices (refer also to Sec. II A).
After obtaining the physical probabilities, one may proceed to evaluate the measurements and find out whether they are IC. More specifically, a uniqueness indicator 0 ≤ s CVX ≤ 1 can be directly computed from the POVM and data with the help of SDPs-the informational completeness certification (ICC). When s CVX > 0, there is equivalently a convex set of state estimators that are consistent with the physical probabilities. It can be shown [52] that a unique estimator is obtained from the measured POVM and corresponding data if and only if s CVX = 0.
Bottom-up resource-efficient quantum-state tomography is thus an iterative program involving rounds of extracting physical probabilities from the measurement data and certifying uniqueness based on these probabilities. At each round, the computed s CVX is used to decide whether new measurements are needed in the next one. In this manner, the POVM outcomes may be accumulated bottom-up until s CVX = 0, af-ter which a physical state reconstruction using either the ML or LS scheme is carried out to obtain the unique estimator. The size of the resulting IC POVM is minimized accordingly. We remark that ICC turns out to be the limiting procedure in practical implementation relative to a typical quantum-state reconstruction. This is because an estimation over the space of quantum states can be very efficiently implemented with an iterative scheme, where each step involves a regular gradient update and just one round of convex projection [75]. (The case for quantum processes has also been discussed [78]). On the other hand, satisfying both Born's rule and quantum positivity constraint in ICC requires a separate iterative procedure just to carry out the correct convex projection onto their intersection [79]. To date, there exists no efficient way to perform projections of these constraints to the authors' knowledge.
By recalling the results in Refs. [52,53], we briefly describe the simple procedures that deterministically verify whether a set of POVM outcomes {Π j ≥ 0} is IC given their corresponding set of relative frequency data {ν j }. The first necessary step is to acquire the physical probabilities from ν j . To this end, we consider two popular choices often considered in quantum tomography, namely the ML and LS methods. In ML, we maximize the log-likelihood function log L that best describes the physical scenario over the quantum state space. Since we predominantly discuss von Neumann measurement bases, each basis induces a multinomial distribution such that we have the form log L ∝ j ν j log p j , where p j = tr{ρ Π j } are our sought-after physical probabilities to be optimized over the operator space in which ρ ≥ 0 and tr{ρ } = 1. In LS, which we have adopted to deal with arbitrary projective measurements that do not sum to the identity operator in general, the distance D = ν j − p j 2 is minimized with respect to p j over the space of ρ ≥ 0, this time with the unit-trace constraint relaxed and later reinstated after the minimization is completed.
Using the obtained physical probability estimators p j through the aforementioned optimization strategies, we can now define and fix a randomly generated full-rank state Z and conduct the following two SDPs: minimize/maximize f = tr{ρ Z} over ρ subject to: It is now clear why the SDPs are to be carried out with the physical probabilities p j instead of raw data ν j : the relative frequencies ν j are statistically noisy and in general do not correspond to a feasible solution set in (2.1). It has been shown that when s CVX ≡ f max − f min is zero, this implies that any quantum-state estimator reconstructed from {Π j } and {ν j } is unique and equal to the solution for (2.1).

B. Training the ICCNet and FidNet
We propose to tackle the combined problem of physical probabilities extraction and uniqueness certification by predicting with trained neural networks. We also demonstrate FIG. 1. The physical-probabilities extraction and SDP-based ICC of a resource-efficient quantum-state tomography scheme in (a) may be entirely replaced by the ICCNet and FidNet shown in (b), each of which is a sequence of convolutional blocks (shown here for d = 16 as an example). Each convolutional block typically consists of a convolutional layer (conv), a batch normalization layer (BN), the relu activation layer, a dropout layer and a pooling layer (maxpool or avgpool) (more details in Sec. II B). Specific network structures may vary for systems of different dimensions. Numerical values after the convolutional blocks are flattened and activated with the sigmoid function just before the sCVX computation, and passed through a fully-connected (FC) layer before the fidelity F computation. the possibility of performing fidelity evaluation on the reconstruction with such neural networks without explicitly carrying out physical state tomography. To do this, we introduce the ICCNet and FidNet, illustrated in Fig. 1(b), where each possesses a convolutional network architecture that analyzes the given POVM and data by regarding them as images. Such a treatment allows one to train the networks with far less trainable parameters to recognize s CVX and fidelity F( ρ, ρ targ ) = tr ρ ρ targ ρ 2 between the state estimator ρ and some target state ρ targ as compared to using, for instance, the multilayer perceptron (feed-forward) architecture [80,81] that consists only of fully-connected or dense layers. The purpose of FidNet is to assess the quality of the reconstruction after each measurement set is made. Before the point of informational completeness, the reconstructed state ρ is not unique by definition. Throughout this article, for consistency, ρ shall always be taken as the ML estimator that minimizes the linear function in (2.1). This is only a particular choice used to define FidNet that is chosen as a standard. One may end up with a slightly more conservative FidNet by setting ρ to be the minimum-fidelity estimator with respect to the SDP constraints stated in the last line of (2.1). This would require another round of fidelity minimization in every step of the training-data-generation phase.
For predicting s CVX and F, both ICCNet and FidNet employ a sequence of two-dimensional array manipulating layers. Two important types of layers responsible for these operations are the convolution layer, which are two-dimensional filters that carry out multiplicative convolutions with the layer input numerical array, and the pooling layer that generally down-samples a layer input array into a smaller output array with a simple numerical-summarizing computation. To each convolution layer, an activation function is applied to further introduce nonlinear characteristics for predicting general network output functions.
The convolutional ICCNet and FidNet take on a similar architecture, which consists of convolution, max-pooling and average-pooling layers. Each convolution layer consists of n f filters, where each filter is a 3 × 3 array window that slides vertically and across layer input arrays with stride 1 in both directions. We design the sequence of convolution layers to have an exponentially increasing n f with the network depth. These pooling layers are generally responsible for shrinking the layer input array to a smaller layer output array. The actions of all types of layers are summarized in Fig. 2. We insert the default rectified linear unit ("relu") activation function after every convolution layer, which is defined as f relu (x) = max(0, x). At the end of ICCNet and FidNet, the respective output values are computed with the sigmoid activation function given by f sigmoid (x) = 1/(1 + e −x ).
Overfitting can be an issue in machine learning, in which case the neural networks are prone to fitting training datasets much better than unseen ones. It is therefore essential to regulate network training by keeping the problem of overfitting in check so that the resulting trained models have high predictive power. This problem often arises when the neural network is deep. The addition of dropout layers has been proven to be an effective method for combating overfitting [82][83][84], which randomly exclude trainable parameters. More recently, it has been demonstrated that neural-network training can be further enhanced by adding batch normalization layers. This was supported not only by the initial observation that the dis-FIG. 2. Operations carried out by the convolution and pooling layers. A max-pooling layer picks the maximum number from the layer input within a selected window, while an average-pooling layer computes average values over the selected window. In this example, the 8×8 input layer is reduced to a 6×6 output layer after going through a convolution layer consisting of a single 3 × 3 filter array of trainable parameters that takes stride 1. This output layer becomes the input layer with respect to either the max-pooling or average-pooling layer that each consists of a single 2 × 2 filter array of stride 2. The final output layer (rounded off for illustration purposes) is therefore a 4 × 4 numerical array.
tribution of layer input values are stabilized with batch normalization [85], but also by the even more relevant finding that the gradient landscape of the network loss function (the figure of merit quantifying the difference between the actual output value and that computed by the network) seen by the optimization routine that trains the network becomes smoother [86], making training much more stable.
All trainable parameters in the relevant neural layers of ICCNet and FidNet are optimized using a variant of stochastic gradient descent known as NAdam [87], where the network gradients are computed in batches of the training data. To prepare ICCNet training input datasets, for von Neumann measurements of a fixed number (K) of bases considered in Secs. IV A and IV B, the initial network input X X X is an m × K(d 2 + d) matrix that contains m training datasets, each recording the K measured bases and corresponding relative frequencies {ν jk } d−1 j=0 K k=1 ( j ν jk = 1). To encode the measurement bases, we regard all bases as some unitary rotation U k |j j|U † k of the standard computational basis {|j } d−1 j=0 , where U 1 = 1. These unitary operators are then logarithmized in order to obtain their Hermitian exponents H k = −i log U k (H 1 = 0), from which the diagonals and upper triangular real and imaginary matrix elements are extracted. Each row of X X X is thus a flattened K(d 2 + d)- dimensional row of real numerical values formatted properly to encode U 1 , U 2 , . . ., U K , ν 01 , . . ., ν d−1 1 , ν 02 , . . ., ν d−1 2 , . . ., ν 0 K , . . ., ν d−1 K in this order. This input matrix is then processed into a K(d 2 + d) × K(d 2 + d) square training array of elements, which is then fed to ICCNet (see Fig. 3). Zeros are padded to this array in order to complete the square. Similarly, for a fixed set of L projective measurements discussed in Sec. IV C, analogous arguments lead to the necessary L(d 2 + 1) × L(d 2 + 1) input square training array. For each dimension, the randomly generated full-rank state Z needed to solve (2.1) is fixed during the training and testing stages.
On the other hand, training the FidNet requires input information about not only the measured bases (or projectors) and their corresponding data, but also the additional m target states to be included as inputs, one for each dataset. The correct dimensions of the training arrays are (K + 1)d 2 + Kd × (K + 1)d 2 + Kd or (L + 1)d 2 + L × (L + 1)d 2 + L respectively for basis and projective measurements. We note that to predict fidelities for simulated test datasets of d = 16, 32 and 64 as shown in Fig. 7

and 8,
For all purely-simulation figures, FidNet training is done with target states defined by the true states that generated the simulated training datasets. On the other hand, for all experimental results in Fig. 9, FidNet training is carried out simultaneously with the target states derived from the corresponding true states and those that deviate from them in order to account for systematic errors more effectively and improve average prediction accuracy. The list of hyperparameters that define the architectures of ICCNet and FidNet, as well as the technical analyses of network input-data generation and network training are given in Appendices B and C.
Once an IC set of measurements are performed and assessed with ICCNet and FidNet, the density matrix representing the final state estimator may be obtained using the accelerated projected-gradient algorithm developed in [75]. Alternatively, it is possible to append our networks with ad- ditional conditional generative networks to yield the density matrix [62,63].

III. EXPERIMENTS
A. Spatial-mode photonic systems Apart from evaluating simulation test datasets, we also run the trained ICCNet and FidNet models to benchmark real experimental datasets. In the first group of experiments, we showcase the accuracy of ICCNet and FidNet predictions on experimental data acquired from an attenuated laser source prepared in quantum states projected onto Hermite-Gaussian spatial-mode bases of various dimensions d. With this group of experiments, for the sake of variety, we shall consider measurement bases that are obtained from adaptive compressive tomography (ACT). These are eigenbases of the state that minimizes the von Neumann entropy subject to the same SDP constraints in (2.1). It has been demonstrated that successive measurements of such eigenbases result in a fast convergence of s CVX [52,53]. An explicit protocol to construct these bases is given in Appendix A.
The Hilbert space of photonic spatial degrees of freedom is typically discretized using an appropriate basis of transverse modes. To produce high-dimensional quantum states we attenuate an 808-nm diode laser, filter the resulting radiation with a single-mode optical fiber and then adjust the spatial structure of the light field with a spatial light modulator (SLM, see Fig. 4). The holographic approach [88] allows us to transform the incident light into arbitrary transverse modes by controlling the phase pattern on the SLM's display.
We work with Hermite-Gaussian (HG) modes HG nm (x, y), which are the solutions to the Helmholtz equation in Cartesian coordinates (x, y) and form a complete orthonormal basis. By bounding the sum of beam orders n + m, we restrict the dimension of the generated quantum systems. Since holograms displayed on the SLM make use of a blazed grating, in order to select the first diffraction order, we place an iris in the middle of the telescope, where different diffraction orders are well separated. Using a second SLM, a single-mode optical fiber, followed by a single photon counting module, we realize a well-known technique of projective measurements in the spatial-mode space [89]. These allows us to also conveniently implement general ACT basis measurements in arbitrary dimensions.

B. Multiphoton systems
In the second group of experiments, we switch to a different flavor of informational completeness by discussing twomode photon-number states. In particular, we look at quantum states of up to three photons occupying two optical modes. Such three-photon states were of interest in the study of high-order quantum polarization properties beyond the Stokes vectors [90]. The resulting Hilbert space is effectively 4- Here n H and n V denote the number of photons in the horizontal and vertical polarization modes, respectively.
To perform tomography on the multiphoton quantum states, expectation values of a set of 16 rank-one projectors are measured. In principle, any set of 16 linearly independent projectors are suitable for a complete characterization of arbitrary 4-dimensional states without ICC. For these experiments, we define each projector Π j by a ket b † and a † H and a † V are the creation operators of the horizontal and vertical polarization modes [91]. Clearly, j Π j = 1 this time, as the projectors are independently measured. Figure 5 depicts the experimental setup to generate and characterize three-photon states. Four photons are produced through double pair emission of non-collinear spontaneous parametric down-conversion (SPDC) process. The initial state is prepared in |2, 2 by combining two horizontally polarized photons and two vertically polarized photons with a polarizing beam splitter (PBS). To ensure that the photons are indistinguishable in the frequency domain, interference filters of 3 nm bandwidth centered at 780 nm are placed before sending the photons into the PBS. The four photons are then reduced into three photons by detecting a photon at D 1 and the reflected three photons from a partially-polarizing beam splitter (PPBS) are in the state of |1, 2 1, 2|. The PPBS perfectly reflects ver-tically polarized photons and reflects 1/3 of horizontally polarized photons. The half-wave plate (HWP) setting of θ 1 = 0 • leaves the state unchanged, whereas the setting of θ 1 = 45 • transforms the state into |2, 1 2, 1|. In addition, the mixed state (|1, 2 1, 2| + |2, 1 2, 1|)/2 is obtained by incoherently adding the relevant pure states through post-processing. These three-photon states are used to demonstrate the performances of ICCNet and FidNet in Fig. 9(b).
Experimentally [90], the three-photon states were characterized by acquiring the four-fold coincidence counts at D 1 , D 2 , D 3 , and D 4 for 16 rank-one projectors after passing through a PBS and beam splitters (BS). The SU(2) unitary operators U j that define the projectors Π j = b † j 3 |0, 0 1 6 0, 0|b 3 j according to rule (3.1) are determined by the quarter-wave plate (QWP) and HWP angles of θ 2 and θ 3 inasmuch as where the matrix representations for the wave plates are given by In our experiments, we consider SU(2) rotations that fairly distribute the single-photon component b † j |0, 0 on three Bloch-spherical circles parallel to the equatorial plane [90,92] as shown in Fig. 6. The measurement angles that realize these projectors are given in Fig. 5.

A. Simulations -Neural-network performances
We first present performance graphs of ICCNet and Fid-Net in Fig. 7 based on two sets of simulations on four-qubit states (d = 16) using random measurement bases generated with the Haar measure for the unitary group (see Appendix A), and bases found using ACT. In each set of simulations, for both cases where statistical noise is either absent or present, we collect simulation data of various number (K) of bases (s CVX is normalized to 1 at K = 1 by default), each case recording measurements of 5000 randomly-generated quantum states of uniformly distributed rank 1 ≤ r ≤ 3. The explicit CNN architecture employed is specified in Sec. II B. The accurate fit between the actual computed values and those predicted by ICCNet and FidNet suggests that faithful neural network predictions of both the degree of informational completeness and fidelity are a definite possibility in both noiseless and statistically noisy environments. Sample codes for network training and evaluation with four-qubit simulation datasets are available online [93].
In separate simulations on four-(d = 16), five-(d = 32) and six-qubit (d = 64) systems with random Haar measurement bases, numerical evidence presented in Fig. 8 shows that the computation times in s CVX neural-network predictions can be significantly reduced by about four orders of magnitude rel- ative to ordinary SDP calculations, and this difference grows wider with larger dimensions.

B. Experimental performance with spatial-mode photonic states
For each value of d, we experimentally generated random pure states and construct their respective ACT measurement bases in order to evaluate the performance of ICCNet and FidNet, which were previously trained with 10000 simulation datasets of random quantum states of uniformly distributed rank 1 ≤ r ≤ 3 and different K values. These simulated training datasets are modeled with statistical noise arising from a multinomial distribution defined by N = 5000 sampling copies per basis, which is close to the experimental average.
Owing to experimental noise, the resulting spatial-mode quantum states are, as a matter of fact, nearly pure but sufficiently low-rank. Figure 9(a) confirms that ICC and fidelity benchmarking with simulation-trained neural network models are accurate even with real experimental test data. One can ob- serve the relative network-prediction stability of s CVX in contrast with that of F. This coincides with the expectation that while the fidelity is strongly affected by statistical noise and other imperfections such as systematic errors, the degree of informational completeness is more intimately related to the quantum measurements and rank of the quantum state, such that noise only introduces perturbations on the functional behavior of s CVX . Regardless, Fig. 9(a) shows that all predictions made by the simulation-trained ICCNet and FidNet models remain roughly within the error margins of actual computed values.

C. Experimental performance with multiphoton states
For every fixed number (L) of projectors chosen from the complete set of 16 defined in Sec. III B, simulation datasets of 10000 random d = 4 quantum states of uniformly distributed r are fed into both ICCNet and FidNet for training. These datasets are obtained from randomized sequences of the 16 projectors described above. Statistical noise is introduced into the simulation with multinomial distributions defined by N = 500 per projective measurement. To test the trained models and acquire prediction results depicted in Fig. 9(b), we make use of three different sets of 20 experimental runs outside the training datasets, each set corresponding to a dif- ferent quantum state.

D. Noise training and reduction
Experimental noise due to imperfections and systematic errors are always present in any real dataset. Fluctuating deviations of neural-network predicted values from actual ones as observed in Fig. 9 arise from the lack of such experimental noise in all simulated training datasets, apart from purely statistical fluctuations, used to train ICCNet and FidNet.
When more knowledge about the noisy environment is acquired, data simulation from such knowledge may be carried out to improve the network predictions under such an environment. Here, we show that when some samples of experimental data that are sufficiently representative of the overall noise behavior can be spared for training, it is possible to train ICCNet and FidNet with both statistically-noisy simulated datasets and bootstrapped experimental datasets in order to learn the experimental noise effects approximately well and improve network predictions. Bootstrapping entails using a given experimental dataset to generate numerous mock datasets using Monte Carlo procedures. More specifically, in the multinomial setting, the column ν ν ν k of relative frequencies for the kth basis possess a Gaussian distribution of mean p p p k and covariance matrix Σ Σ Σ (k) p p p = [diag(p p p k ) −p p p k p p p k T ]/N for sufficiently large N owing to the central limit theorem, where diag( · ) forms a diagonal matrix whose diagonals are defined by the argument. A direct substitution of ν ν ν k for p p p k leads to the following simple rule for bootstrapping experimental ACT datasets from Hermite-Gaussian mode photonic system: ν ν ν k = N ≥0 {ν ν ν k +w w w k }, where w w w k is a column of random variables collectively distributed according to the Gaussian distribution of zero mean and covariance matrix Σ Σ Σ ν ν ν is to be evaluated with the measurement relative frequencies of the particular kth basis and N is set to 5000, which is the estimated number of copies per ACT basis considered in Sec. IV B. The operation N ≥0 is a composition of absolute value of the argument followed by its sum normalization over 0 ≤ j ≤ d − 1 for the kth ACT basis. Finally, the states that produce the bases relative frequencies used in the bootstrapping procedure are different from the test states used to evaluate the network predictions.
Owing to a limited set of three-photon states, we adopt a different method to bootstrap experimental datasets acquired from these states. Since these datasets are obtained from measuring independent projectors, we randomly permute these projectors and their corresponding relative (unnormalized) FIG. 11. (a) Random single-direction displacements of the beam-profile center shows an increasing variance around the zero-displacement origin that is expected of random-walk Wiener processes. Such variance drifts give rise to the fidelity curves with respect to ideal true states ρ in (b) for the noisy true states ρ = ρ(t) and those in (c) for the reconstructed estimators ρ = ρ(t), in contrast to the driftless scenarios in (d).
frequencies in order generate new measurement sequences as mock datasets. The 16 projectors offer us a total of 16! permutations for each state, allowing us to conveniently generate an abundance of bootstrapped training datasets that are clearly different from those used for testing. By a similar token to the spatial-mode photonic systems, each relative frequency ν l here is a binomial random variable normalized by the number of copies N used to measure the lth projector. Therefore, bootstrapping these relative frequencies may be carried out by additive Gaussian random variables inasmuch as ν l = ν l + w l ν l (1 − ν l )/N , where w l is a standard Gaussian random variable of zero mean and unit variance, and N = 500 is fixed as the estimated number of copies used to obtain the measured relative frequency for each projector, consistent with Fig. 9(b). Figure 10 shows the enhanced prediction performances of ICCNet and FidNet. To generate this figure, a total of 5000 simulated and 5000 bootstrapped datasets are employed (m = 10000) for each group of experiments to train the networks for every value of K and L. These new plots indicate that slightly fluctuating neural-network prediction curves on noisy experimental data can be smoothened when bootstrapped information about the noisy environment is incorporated into the training.

E. Suppression of systematic errors
To end this section, we shall now discuss the implications of all presented results, especially the computation performance graphs shown in Fig. 8, as far as real-time experimental systematic errors are concerned [94,95]. As analytical results are unavailable, we resort to numerical analyses on the effects of ICC-computation-time reduction on such errors. For this purpose, we provide an important example of a kind of systematic drift phenomenon that is highly typical in optical fibers that carry spatial-mode photons, such as those of Hermite-Gaussian modes discussed in this article.
Focusing only on the transverse plane relative to the propagation direction of a laser beam, a given Hermite-Gaussian mode function u m (x) of order m in the spatial x-coordinate is given by [96] u m (x) = 2 π with w 0 being the beam waist, and H m (x) a degree-m Hermite polynomial in x. Upon using the ket notation, one can express the mode function in the familiar manner inasmuch as x|m; 0 HG = u m (x). An ideal fiber would carry spatialmode photons of a mode function that has a stable center point, which is usually set at the Cartesian origin as in (4.1). In real experiments, however, a main source of time-dependent systematic errors is transversal displacements of u m (x) away from the origin [97,98], which we may approximately model as random Wiener processes. Such displacements would distort the originally intended true state. Suppose that a basis ket |m HG ≡ |m; 0 HG is displaced away from the origin by a, it is shown in Appendix D that the resulting displaced |m; a HG = e −iaP |m; 0 HG , where x|e −iaP = x − a|, possesses the following transformation function |m; a HG = l |l; 0 HG HG l; 0|m; a HG , n>−n< × U(−n < , 1 + n > − n < , a 2 /w 2 0 ) , (4.2) where n > = max{m, l}, n < = min{m, l}, and is Kummer's confluent hypergeometric function. This result reduces to that in [97] for the m = 0 special case. The complete two-dimensional transverse profile of a Hermite-Gaussian mode in space is therefore described by the mode function u m,n (x, y) ≡ u m (x)u n (y), where the corresponding displaced kets |m, n; a a a HG ≡ |m; a 1 HG |n; a 2 HG exhibit the transformation |m, n; a a a HG = l,l |l, l ; 0 0 0 HG HG l; 0|m; a 1 HG HG l ; 0|n; a 2 HG , (4.4) where the coefficients are pair-products of one-dimensional transformation functions as in (4.2). Components of the twodimensional displacement a a a = (a 1 a 2 ) are assumed to be independent. After a period of time t, the noisy true state ρ = m,n,m ,n |m, n; a a a(t) HG ρ mn,m n HG m , n ; a a a(t)| is now defined in the displaced Hermite-Gaussian basis of some displacement a a a(t). The same type of disturbances apply to the POVM outcomes since they are implemented digitally using SLMs with the same beams.
In the experiments that gathered data used in plotting Fig. 9(a), the root-mean-square displacement of the Hermite-Gaussian beam center was measured to be about 5% of the beam waist w 0 after a period of 24 hours. This approximately coincides with a model of a Wiener process specified by the random displacement variable a(t) = a(t−1)+b(t−1), which is a cumulative temporal sum of random variables b(t) that are each distributed according to the standard Gaussian distribution defined by the standard deviation σ ≈ w 0 /95 when the time coordinate t is in units of an hour. Figure 11 shows the Hermite-Gaussian beam-profile center displacement and fidelity curves in time t for various dimensions d, where d = d 2 0 is the product of the individual dimensions d 0 of the truncated Hilbert space spanned by the finite set of Hermite-Gaussian basis kets of orders 0 ≤ m, n ≤ d 0 − 1. In actual experiments, there most likely exist other time-dependent sources of errors not modeled here that could worsen the fidelities.
To appreciate the significance of systematic-error suppression, we consider a quantum processor that utilizes Hermite-Gaussian beams as a source for producing high-dimensional initial states for quantum computation. Suppose that the processor is running continuously under server conditions, and maintenance is carried out before significant systematic drifts are anticipated. Whenever the processor refreshes after each set of computations is completed, the newly prepared initial state ρ undergoes compressive state tomography to ensure that it is within the expected error margins.
As an instructive example, we consider at an eightqubit processor controlled by a d = 256 Hermite-Gaussian source [99]. Each projector exposure time is about 1.5 seconds (sec), so that the measurement time of K von Neumann bases is t meas = 1.5 Kd = 1920 sec. Preliminary calibration indicates that such an exposure period yields N = 1000d = 2.56 × 10 5 state copies per basis. For a realistic error modeling, the expressions of t meas and N in d were calibrated from the actual setup used to collect our experimental data for Figs. 9 and 10. We focus on the preparation of pure states, each of which requires only K = 5 random bases for an IC reconstruction [53], the average time, estimated over five random sets of K = 5 bases, for an ICC verification (two SDPs) is t SDP = 4000 sec using the personal computer with hardware specification given in the caption of Fig. 8. After carrying out the basis measurements and ICC, the final state estimator ρ is given by the ML estimator that takes an average of t ML ≈ 240 sec to generate using the accelerated projectedgradient algorithm in [75], which is insignificant in comparison to t SDP . The time for each round of quantum computation depends on the actual application. For simplicity, we assume that each round of quantum computation is executed almost instantly since no classical post-processing is needed. From these specifications, we note that t SDP ≈ 2t meas , and the prefactor grows with d > 256. Therefore, in the case of d = 256, replacing the two SDP algorithms in ICC with a trained ICC-Net would shave about 66% of the total computation time off. For larger dimensions, if so desired, the ML estimation procedure may be completely replaced with trained conditional generative networks [62,63] to eliminate t ML . Figure 12 shows the fidelity of the estimator ρ with the generated initial state ρ before every quantum-computation step (N c of them in total). The graphs highlight the adverse effects of systematic errors when initial-state verification takes too long. Hence, compressive tomography performed with trained neural networks provides a better solution to real-time device certification with a higher fidelity stability, so that quantum computation can run much more smoothly with a greater N c output before drift maintenance is applied.

V. CONCLUDING REMARKS
We took advantage of the universality of convolutional networks to train two neural networks that can very efficiently certify a low-measurement-cost quantum-state characterization scheme. These networks can respectively benchmark the quantum completeness of a given set of measurement outcomes and corresponding data for reconstructing an unknown quantum state, as well as the resulting fidelity without explicitly carrying out the state reconstruction. Our machinelearning-assisted scheme therefore allows experimentalists to rapidly assess the sufficiency of measurement resources for an unambiguous characterization of arbitrary quantum states and achieve accelerated real-time verification without having to perform any optimization routine during the experiment. This becomes essential for many practical quantum tasks that do require fast execution times to avoid noise accumulation and drifts.
An arguably interesting problem would be to minimize the time required to acquire the trained neural networks. This includes both the training time and the generation of adequate training datasets. While the former can now be easily parallelized with graphics processing units, the latter involves two rounds of semidefinite programming per dataset as discussed in Sec. II A, the acceleration of which is still a subject of ongoing research [100].
On the other hand, while classical algorithms for these procedures have worst-case polynomial time complexities in the dimension of the Hilbert space, it is known [101] that quantum algorithms can execute semidefinite programs with polylogarithmic time complexities in the dimension. This immediately reveals the possibility of completely transforming the neural networks employed here, or part thereof, into their quantum counterparts (fused with the training-data processing procedures that use quantum semidefinite programming) that could assimilate into a much larger set of networks for a grander purpose. Practical feasibility in implementing such extended quantum neural networks still remains to be seen.
Note.-Nearing the submission of our work, we discovered another very recent preprint reference [102] that purely discusses the estimation of the fidelity with fully-connected networks. Apart from the clear distinction in architectures, we remark that while the networks in this reference were specifically trained for Pauli measurements, our CNN-based FidNet is compatible with generalized measurement inputs that can be readily used in parallel with ICCNet or any other quantum task that relies on arbitrary measurements. The objectives of both works are hence very different. The next key distinction is network training, which in this reference is based on categorical training that splits the continuous fidelity range into small intervals. As mentioned in the preprint itself, network training can be slow when the intervals are too small. In our current work, FidNet directly computes the fidelity values without such output splitting, and hence training efficiency is not sacrificed for prediction accuracy.  3. Define R R R diag = Diag{R R R} (sets all off-diagonal elements to zero) 4. Define L L L = R R R diag |R R R diag | ( refers to the Hadamard division).

Define the new basis |u
The next kind of measurement is an adaptive set of von Neumann bases that are sequentially inferred directly from previous measurement data. They were meant for establishing an ACT scheme that is highly compressive [52,53]. The logic behind their construction is that low-rank true states are relatively closer to rank-deficient state estimators, and minimum-entropy estimators obtained from non-IC bases set give a very compressive sequence of eigenbases to quickly reach informational completeness. We state the construction of such a set of K bases {B 1 , B 2 , . . . , B K } below, with the first basis B 1 = {|l l|} d−1 l=0 being the standard computational basis:

Constructing an ACT basis
Beginning with k = 1 and a random computational basis B 1 : 1. Measure B k and collect the relative frequency data 3. Perform ICC with the physical probabilities and compute s CVX,k : • If s CVX,k < , terminate this ACT scheme and take ρ max ≈ ρ min as the estimator and report s CVX,k .
4. Choose an estimator ρ k that minimizes the von Neumann entropy S( ρ k ) = −tr{ ρ k log ρ k } subject to the positivity and data constraints.

5.
Define B k+1 to be the eigenbasis of ρ k .

Appendix B: Hyperparameters of ICCNet and FidNet
All hyperparameters used in both ICCNet and FidNet are manually optimized so that the validation loss is minimized (see Appendix C). Figure 13 succinctly consolidates all important operational hyperparameter settings adopted for training ICCNet in various dimensions. Networks (a) and (b) have identical architecture with different dropout rates. For larger dimensions such as d = 64, we find that the use of deeper CNN networks can yield better training results. In these cases, residual blocks implemented in network (c) help to avoid the so-called vanishing-gradient problem [67]. Each residual block consists of repeated convolutional blocks (BLK s (t)) that maintain the array dimensions, sandwiched by a skip connection that adds the input of these repeated convolutional blocks to their output. Another notable difference between Fig. 13(c) and Figs. 13(a) and (b) is the inclusion of an average-pooling layer and one fully-connected layer, which are standard components of residual-based networks. Figure 14 shows the corresponding hyperparameters for Fid-Net.
Appendix C: Explicit network training procedures

Input data preparation
The initial input data matrix X X X for training the ICC-Net comprises the m datasets of K measurement bases, or L projectors, and their corresponding relative frequencies.
These data are reshaped into either a m × K(d 2 + d) × K(d 2 + d) or m× L(d 2 + 1) × L(d 2 + 1) threedimensional matrix X X X to be processed by the convolution networks.
The input data matrix for training the FidNet requires the additional m target states to be assigned to the respective datasets. For d = 16, 32 and 64, FidNet is trained by supplying the true states as the ("right") target states. This is sufficient as only statistical fluctuation exist in the simulation data obtained from finite copies. To benchmark fidelities for real experimental data, it is important that FidNet also recognizes inputs with systematic errors. We numerically show that training FidNet with both the "right" and "wrong" target states, the latter referring to targets differing from the true states for the same datasets, can improve the benchmarking accuracy on average. Combining these datasets give the resulting reshaped three-dimensional matrix that is either of As a demonstration, we take "wrong" target state ρ wrong to be a randomly generated operator from the true state ρ = U U U Λ Λ Λ U U U † diagonalized with the unitary matrix U U U and diagonal matrix Λ Λ Λ according to the following prescription: where λ 1 and λ 2 are uniformly distributed in [λ min ,λ max ], diag( · ) and Diag( · ) are respectively diagonal-element extracting and diagonal-matrix transforming operations, N { · } normalizes a column by its element-wise sum, denotes the Hadamard product, and wHaar(λ) refers to the weighted Haar unitary function that outputs a random unitary according to the assigned weight λ. By definition, the special case wHaar(0) = 1 1 1 holds, and the more general function is given by Weighted Haar unitary (wHaar) of weight λ 1. Generate a random d × d matrix A A A with entries i.i.d. standard Gaussian distribution.

Compute Q Q Q and R R R from the QR decomposition
4. Define R R R diag = Diag{R R R} and L L L = R R R diag |R R R diag | ( refers to the Hadamard division).

Define
It is clear that ρ wrong so defined has exactly the same rank as ρ, and at times can be close to ρ. A practical justification for these sort of target states is that very typically in experiments, although the target states are not exactly ρ due to various noisy imperfections, the actual true states are, nevertheless, very often almost as rank-deficient as the intended target states, with a rapidly decaying eigenvalue spectrum. Figures 16 and 17 show that fidelity benchmarking is typically optimal when training is performed with both the "right" and "wrong" target states simultaneously. For these experimental data, we find that λ 1 = 0.8 and λ 2 = 1 gives rather accurate fidelity benchmarking. For more general noisy situations, we may need to introduce more structured noise models in generating the simulated training datasets.

Training validation
The specifications of every input data matrix are tabulated in Tab. I. Generally speaking, the action of training these convolutional networks is equivalent to carrying out an optimization routine to minimize the "distance", quantified by a socalled loss function, between the predicted output y y y pred and the original training output y y y. In all training procedures, the momentum-based gradient-descent algorithm NAdam [87] is employed for the minimization. The batch size is chosen to strike a compromise between training convergence and gradient-computation accuracy. For ICCNet, we consider the mean absolute-error (MAE) as the loss function for training. For FidNet, the mean squared-error (MSE) loss is used.
It is important to track the training activities so that overtraining or overfitting does not happen. To do this, a very common way is to first split the complete data ( X X X,y y y) into data for training ( X X X train ,y y y train ), validation ( X X X val ,y y y val ) and testing ( X X X test ,y y y test ). During training, at each epoch (gradientdescent iterative step), as the loss function between y y y pred and y y y train is reduced, the resulting validation loss, that is loss between y y y pred and y y y val , is also reported. Training is successful when both training and validation losses decay simultaneously with the number of epochs. As an additional precaution, we confirm both the training and validation progress by performing one final prediction with y y y test to verifying that the test loss is also comparatively small. In our context, the split ratio between training, validation and test datasets is set to 0.8:0.1:0.1.
For all the training datasets reflected in Tab. I, each row of X X X consists of information about the POVM and relative frequencies for the case of ICCNet (and an additional target state for FidNet) that originate from a randomly generated quantum state of rank r ∈ [1,3]. The distribution of continuous entries in the output y y y can also affect training efficiency. In the case of s CVX output for ICCNet, there can coexist two groups of values, one group containing values that are substantially far away from zero and the other containing those that are almost zero (IC). We find that a reversible mapping that maps each output value y ≡ s CVX to y = − log 10 (y)/10, with the conditional definition y < 10 −10 → y = 10 −10 to ensure that y ≤ 1, can improve training efficiencies. All ICCNet architectures shown in Fig. 13 and ICCNet training graphs in Fig. 18 refer to these logarithmized outputs. One can understand this logarithmic training as a switch of training focus to order of magnitude estimation for s CVX , which is an alternatively relevant outcome since all one really needs to know is whether a given measurement is IC or not.

Training results analyses
After every network training, only the model weights corresponding to the lowest validation loss are saved for later predictions. As sample illustrations, we explicitly show the progress of training and validation losses for d = 16 in Fig. 18. For this dimension, the input datasets of all four data types listed in Tab. I are stacked for training ICCNet and FidNet at one go. In order to verify the test accuracies offered by the trained neural-network models, we also supply Figs. 19 and 20 for K = 4 assorted von Neumann bases. For completeness, we also furnish numerical performance indicators for ICCNet and FidNet in Tabs. II and III to supplement Figs. 7, 9 and 10 in the main text.
As far as the analyses of the training results are concerned, the aforementioned figures and tables are sufficient to verify the training qualities of ICCNet and FidNet. Going by a different route, one may additionally fall back on other more conventional tools to analyze the neural-network-predicted s CVX values. In theory, the measurements are IC when s CVX is zero. In practice, however, we assign a small threshold that distinguishes datasets that are IC from those that are not. In Fig. 21, we present the so-called confusion matrix that simplistically quantifies how well the datasets are correctly grouped into the "IC" and "non-IC" classes. The threshold is set at 10 −3 . Clearly, the ideal prediction result is such that the off-diagonal elements of the confusion matrix are all zero. Such a perfect binary classification does not exist in realistic machine learning applications. Instead, upon labeling the IC cases as the "positives", there would be predictions that are false positives (fp) (as opposed to the true positives tp) or false negatives (fn) (as opposed to the true negatives tn), giving rise to nonzero, but small, off-diagonal values. We define the concepts of precision prec = tp/(tp + fp) and recall rec = tp/(tp + fn), and a predictive ICCNet should generally output values that have high precision and recall. More specifically, there exists the so-called F1 score, or more appropriately the Sørensen-Dice coefficient [104,105], F 1 = 2 tp/(2 tp + fp + fn) defined as the harmonic mean of pr and rec that speculates such a binary prediction power.
Rather than fixing a particular threshold, it is more objective to scan a range of threshold values and parametrically plot the so-called precision-recall (PR) curves as shown in Fig. 22. Then a natural figure of merit to gauge the binary classification power would be the "area-under-curve" (AUC) measure for these curves, since a unit area entails the largest possible coverage of prec and rec. This figure of merit also possesses one crucial advantage. If we remember that all training datasets are obtained from random states of uniformly distributed ranks r ∈ [1,3], it is then easy to see that the K = 4 datasets, for instance, have much fewer positive cases as com-pared to negative cases. Such a class imbalance biases the binary classification analysis, and occurs ubiquitously in our context since informational completeness is rank-sensitive, and thus highly dependent on the state ranks used to generate the training datasets. The AUC for the PR curve is consequently lowered mainly because of this bias rather than a weak binary-classification capability. In such cases, perhaps a better option would be to investigate the AUC of the so-called receiver operating characteristic (ROC) curve [106]. This plots the true positive rate (tpr = tp/total positives) against the false positive rate (fpr = fp/total negatives). For such imbalanced cases, while the PR curve takes a larger K value to recuperate its area, the AUC of the ROC curve remains high for all tested K values (see Fig. 22). A loosely, yet intuitive understanding is that the ROC curve accounts for both tp and fp values evenly, whereas the PR curve focuses only the tp values, which form the minority class in a class imbalance situation.
Despite the above observations, just like any other singlenumber criterion that is popularly adopted in statistics and machine learning owing to its computation simplicity, these figures of merit are ad hoc by nature. Therefore, care must be taken in interpreting these measures.

Appendix D: Mode transformation function of a displaced
Hermite-Gaussian mode of arbitrary order We proceed to calculate the integral where It turns out that the closed-loop integral representation of a degree-n Hermite polynomial in the complex plane is extremely useful for this endeavor, where the contour C is a closed loop encircling the origin. This yields the triple integrals which involves another simple Gaussian integral We implicitly consider the case where m ≤ n and first evaluate the z integral of I m,n,a . After some simplification of exponential functions, an application of Cauchy's residue theorem immediately gives We are now ready to finalize the calculation for m ≤ n by evaluating the remaining z integral, and a second application of the residue theorem results in Here, U(·, ·, ·) is the Kummer confluent hypergeometric function. The second line follows from Leibniz's rule of differentiation and the identity d dx for l ≤ n.
In order to complete the calculation, we need the expression for I m>n,a . Without repeating the above procedures, we simply note that Eq. (D1) has the equivalent form via a trivial variable substitution and reordering of terms. This time, we may proceed to carry out the z integral first, followed by the z one, where the displacement is now −a. This equivalence allows us to immediately state the answer by simply interchanging m and n, and replacing −a for a. Note that the phase factor retains the exponent m. Finally, introducing the shorthand notations n > = max{m, n} and n < = min{m, n} allows us to write down the compact form      Fig. 8 in the main text. Statistical noise from N = 1000 copies per basis has been considered (refer to Tab. I). In this case, the FidNet is trained to recognize fidelities with the "right" target states for simplicity, which is valid as no systematic errors are present here. FidNet training stops at K = 6 for d = 64 due to limited GPU resources.  Test example no.