Neural network decoder for topological color codes with circuit level noise

A quantum computer needs the assistance of a classical algorithm to detect and identify errors that affect encoded quantum information. At this interface of classical and quantum computing the technique of machine learning has appeared as a way to tailor such an algorithm to the specific error processes of an experiment—without the need for a priori knowledge of the error model. Here, we apply this technique to topological color codes. We demonstrate that a recurrent neural network with long short-term memory cells can be trained to reduce the error rate ϵL of the encoded logical qubit to values much below the error rate ϵphys of the physical qubits—fitting the expected power law scaling ϵ L ∝ ϵ phys ( d + 1 ) / 2 , with d the code distance. The neural network incorporates the information from ‘flag qubits’ to avoid reduction in the effective code distance caused by the circuit. As a test, we apply the neural network decoder to a density-matrix based simulation of a superconducting quantum computer, demonstrating that the logical qubit has a longer life-time than the constituting physical qubits with near-term experimental parameters.


I. INTRODUCTION
In fault-tolerant quantum information processing, a topological code stores the logical qubits nonlocally on a lattice of physical qubits, thereby protecting the data from local sources of noise [1,2].To ensure that this protection is not spoiled by logical gate operations, they should act locally.A gate where the j-th qubit in a code block interacts only with the j-th qubit of another block is called "transversal" [3].Transversal gates are desirable both because they do not spread the error in one qubit to other qubits in a block, and because they can be implemented efficiently by parallel operations.
Two families of two-dimensional (2D) topological codes have been extensively investigated.The first family, derived from Kitaev's toric code [4], is the surface code on a square lattice [5,6].It has a favorably high threshold error rate for fault tolerance, but only cnot, X, and Z gates can be performed transversally [7].The second family, introduced by Bombin and Martin-Delgado [8,9], is defined on a honeycomb lattice, or more generally on a trivalent face-three-colorable graph -hence the name color code.
While the threshold error rate is smaller than for surface codes [10,11], the advantage of color codes is that they allow for the transversal implementation of the full Clifford group of quantum gates (with Hadamard, π/4 phase gate, and cnot gate as generators) [12,13].This is not yet computationally universal, but can be rendered universal using gate teleportation [14] and magic state distillation [15].Moreover, color codes are particularly suitable for topological quantum computation with Majorana qubits, since high-fidelity Clifford gates are accessible by braiding [16,17].
A drawback of color codes is that quantum error correction is more complicated than for surface codes.The identification of errors in a surface code (the "decoding" problem) can be mapped onto a matching problem in a graph [18], for which there exists an efficient solution called the "blossom" algorithm [19].This graph-theoretic approach does not carry over to color codes, motivating the search for a decoder with performance comparable to the blossom decoder [20][21][22][23][24].
An additional complication of color codes is that the parity checks are prone to "hook" errors, where singlequbit errors on the ancilla qubits propagate to higher weight errors on data qubits, reducing the effective distance of the code.There exist methods due to Shor [25], Steane [26], and Knill [27] to mitigate this, but these error correction methods come with much overhead because of the need for additional circuitry.An alternative scheme with reduced overhead uses dedicated ancillas ("flag qubits") to signal the hook errors [28][29][30][31][32].
Here we show that a neural network can be trained to fault-tolerantly decode a color code with high efficiency, using only measurable data as input.No a priori knowledge of the error model is required.This machine learning approach has been previously shown to be successful for the family of surface codes [33][34][35][36][37], and applications to color codes are now being investigated [38,39,41].We adapt the recurrent neural network of Ref. 35 to decode color codes with distances up to 7, fully incorporating the information from flag qubits.A test on a density matrixbased simulator of a superconducting quantum computer [42] shows that the performance of the decoder is close to optimal, and would surpass the quantum memory threshold under realistic experimental conditions.

A. Color code
The color code belongs to the class of stabilizer codes [43], which operate by the following general scheme.We denote by I, X, Y, Z the Pauli matrices on a single qubit arXiv:1804.02926v1[quant-ph] 9 Apr 2018 and by Π n = {I, X, Y, Z} ⊗n the Pauli group on n qubits.A set of k logical qubits is encoded as a 2 k -dimensional Hilbert space H L across n noisy physical qubits (with 2 n -dimensional Hilbert space H P ).The logical Hilbert space is stabilized by the repeated measurement of n − k parity checks S i ∈ Π n that generate the stabilizer S(H L ), defined as where B(H P ) is the algebra of bounded operators on the physical Hilbert space.
As errors accumulate in the physical hardware, an initial state |ψ L (t = 0) may rotate out of H L .Measurement of the parity bits discretizes this rotation, either projecting |ψ L (t) back into H L , or into an errordetected subspace H s(t) .The syndrome s(t) ∈ Z n−k 2 is determined by the measurement of the parity checks: . It is the job of a classical decoder to interpret the multiple syndrome cycles and determine a correction that maps H s(t) → H L , such that the combined action of error accumulation and correction leaves the system unperturbed.
This job can be split into a computationally easy task of determining a unitary that maps H s(t) → H L (a socalled 'pure error' [40]), and a computationally difficult task of determining a logical operation within H L to undo any unwanted logical errors.The former task (known as 'excitation removal' [41]) can be performed by a 'simple decoder' [34].The latter task is reduced, within the stabilizer formalism, to determining at most two parity bits per qubit, which is equivalent to determining the logical parity of the qubit upon measurement at time t [35].
We implement the color code [8,9] on an hexagonal lattice inside a triangle, see Fig. 1. (This is the 6,6,6 color code of Ref. 11.)One logical qubit is encoded by mapping vertices v to data qubits q v , and tiles T to the stabilizers X T = v∈T X v , Z T = v∈T Z v .The simultaneous +1 eigenstate of all the stabilizers (the "code space") is twofold degenerate [13], so it can be used to define a logical qubit.The number of data qubits that encodes one logical qubit is n data = 7, 19, or 37 for a code with distance d = 3, 5, or 7, respectively.(For any odd integer d, a distance-d code can correct (d − 1)/2 errors.)Note that n data is less than for a surface code with the same d.
An X error on a data qubit switches the parity of the surrounding Z T stabilizers, and similarly a Z error switches the parity of the surrounding X T stabilizers.These parity switches are collected in the binary vector of syndrome increments δ s(t), such that δs i = 1 signals an error on the qubits surrounding ancilla i.The syndrome increments themselves are sufficient for a classical decoder to infer the errors on the physical data qubits.Parity checks are performed by entangling ancilla qubits at the center of each tile with the data qubits around the border, and then measuring the ancilla qubits (see App.A for the quantum circuit).
FIG. 1: Schematic layout of the distance-5 triangular color code.A hexagonal lattice inside an equilateral triangle encodes one logical qubit in 19 data qubits (one at each vertex).The code is stabilized by 6-fold X and Z parity checks on the corners of each hexagon in the interior of the triangle, and 4-fold parity checks on the boundary.For the parity checks, the data qubits are entangled with a pair of ancilla qubits inside each tile, resulting in a total of (3/2)d 2 qubits used to realize a distance-d code.Operations on the logical qubit can be performed along any side of the triangle, and two-qubit operations can be applied transversally to logical qubits on adjacent triangles.

B. Error model
We consider two types of circuit-level noise models, both of which incorporate flag qubits to signal hook errors.Firstly, a simple Pauli error model allows us to develop and test the codes up to distance d = 7. (For larger d the training of the neural network becomes computationally too expensive.)Secondly, the d = 3 code is applied to a realistic density-matrix error model derived for superconducting qubits.
In the Pauli error model, one error correction cycle of duration t cycle = N 0 t step consists of a sequence of N 0 = 20 steps of duration t step , in which a particular qubit is left idle, measured, or acted upon with a singlequbit rotation gate or a two-qubit conditional-phase gate.Before the first cycle we prepare all the qubits in an initial state, and we reset the ancilla qubits after each measurement.We allow for an error to appear at each step of the circuit and during the preparation, including the reset of the ancilla qubits, with probability p error .For the preparation errors, idle errors, or rotation errors we introduce the possibility of an X, Y , or Z error with probability p error /3.Upon measurement, we record the wrong result with probability p error .Finally, after the conditionalphase gate we apply with probability p error /15 one of the following two-qubit errors: I ⊗ P , P ⊗ I, P ⊗ Q, with P, Q ∈ {X, Y, Z}.We assume that p error 1 and that all errors are independent, so that we can identify p error ≡ phys with the physical error rate per step.
The density matrix simulation uses the quantumsim simulator of Ref. 42.We adopt the experimental parameters from that work, which match the state-of-the-art performance of superconducting transmon qubits.In the density-matrix error model the qubits are not reset between cycles of error correction.Because of this, parity checks are determined by the difference between subsequent cycles of ancilla measurement.This error model cannot be parametrized by a single error rate, and instead we compare to the decay rate of a resting, unencoded superconducting qubit.

C. Fault-tolerance
The objective of quantum error correction is to arrive at a much smaller error rate L of the encoded logical qubit.If error propagation through the syndrome measurement circuit is limited, and a "good" decoder is used, the logical error rate should exhibit the power law scaling [6] with C d a prefactor that depends on the distance d of the code but not on the physical error rate.The so-called "pseudothreshold" [44,45] is the physical error rate below which the logical qubit can store information for a longer time than a single physical qubit.

D. Flag qubits
During the measurement of a weight-w parity check with a single ancilla qubit, an error on the ancilla qubit may propagate to as many as w/2 errors on data qubits.This reduces the effective distance of the code in Eq. ( 2).The surface code can be made resilient to such hook errors, but the color code cannot: Hook errors reduce the effective distance of the color code by a factor of two.
To avoid this degradation of the code distance, we follow Refs.28-32 by adding a small number of additional ancilla qubits, socalled "flag qubits", to detect hook errors.For our chosen color code with weight-6 parity checks, we require one flag qubit for each ancilla qubit used to make a stabilizer measurement.(This is a much reduced overhead in comparison to alternative approaches [25][26][27].)Flag and ancilla qubits are entangled during measurement and read out simultaneously.The circuits are given in App. A.

III. NEURAL NETWORK DECODER
Consider a logical qubit, prepared in a state |ψ L , kept for a certain time T , and then measured with outcome m ∈ {−1, 1} in the logical Z-basis.Upon measurement, phase information is lost.Hence, the only information needed in addition to m is the parity of bit flips in the FIG.2: Architecture of the recurrent neural network decoder.After a body of recurrent layers the network branches into two heads, each of which estimates the probability p or p that the parity of bit flips at time T is odd.The upper head does this solely based on syndrome increments δ s and flag measurements s flag from the ancilla qubits, while the lower head additionally gets the syndrome increment δ f from the final measurement of the data qubits.During training both heads are active, during validation and testing only the lower head is used.Ovals denote the two long short-term memory (LSTM) layers and the fully connected evaluation layers, while boxes denote input and output data.Solid arrows indicate data flow in the system (with h The task of decoding amounts to the estimation of the probability p that the logical qubit has had an odd number of bit flips.The experimentally accessible data for this estimation consists of measurements of ancilla and flag qubits, contained in the vectors δ s(t) and s flag of syndrome increments and flag measurements, and, at the end of the experiment, the readout of the data qubits.From this data qubit readout a final syndrome increment vector δ f (T ) can be calculated.Depending on the measurement basis, it will only contain the X or the Z stabilizers.Additionally, from the data qubit readout we obtain the true bit flip parity p true by comparing the measured logical state to the initial logical state that was prepared at the beginning of the experiment.
An efficient decoder must be able to decode an arbitrary and unspecified number of error correction cycles.It is not possible, in practice, to create a training data set which contains all possible sequence lengths, let alone train a neural network decoder efficiently on such a data set.Therefore, the decoder must be a cycle-based algorithm that is translationally invariant in time.To achieve this, we follow Ref. 35 and use a recurrent neural network of long short-term memory (LSTM) layers [46] -with one significant modification, which we now describe.
The time-translation invariance of the error propagation holds for the ancilla qubits, but it is broken by the final measurement of the data qubits -since any error in these qubits will not propagate forward in time.To extract the time-translation invariant part of the training data, in Ref. 35 two separate networks were trained in parallel, one with and one without the final measure-ment input.Here, we instead use a single network with two heads, as illustrated in Fig. 2. The upper head sees only the translationally invariant data, while the lower head solves the full decoding problem.
The switch from two parallel networks to a single network with two heads offers several advantages: (1) The number of LSTM layers and the computational cost is cut in half; (2) The network can be trained on a single large error rate, then used for smaller error rates without retraining; (3) The bit flip probability from the upper head provides a so-called Pauli frame decoder [2].
In the training stage the bit flip probabilities p and p ∈ [0, 1] from the upper and lower head are compared with the true bit flip parity p true ∈ {0, 1}.By adjusting the weights of the network connections a cost function is minimized in order to bring p , p close to p true .We carry out this machine learning procedure using the TensorFlow library [47], see App.B for details of the implementation.
After the training of the neural network has been completed we test the decoder on a fresh data set.Only the lower head is active during the testing stage.If the output probability p < 0.5, the parity of bit flip errors is predicted to be even and otherwise odd.We then compare this to p true and average over the test data set to obtain the logical fidelity F(t).Using a two-parameter fit to [42] we determine the logical error rate L per step of the decoder.

IV. NEURAL NETWORK A. Power law scaling of the logical error rate
Results for the distance-3 color code are shown in Fig. 3 (with similar plots for distance-5 and distance-7 codes in App.C).These results demonstrate that the neural network decoder is able to decode a large number of consecutive error correction cycles.The dashed lines are fits to Eq. ( 4), which allow us to extract the logical error rate L per step, for different physical error rates phys per step.
Figure 4 shows that the neural network decoder follows a power law scaling (2) with d fixed to the code distance.This shows that the decoder, once trained using a single error rate, operates equally efficiently when the error rate is varied, and that our flag error correction scheme is indeed fault-tolerant.The corresponding pseudothresholds (3) are listed in Table I.

B. Implementation in a physical model
To assess the performance of the decoder in a realistic setting, we have applied it to a density matrix-based The dashed line through the data points has the slope given by Eq. ( 2).In gray: Error rate of a single physical (unencoded) qubit.The error rates at which this line intersects with the lines for the encoded qubits are the pseudothresholds.
simulator of an array of superconducting transmon qubits [42].In Fig. 5 we compare the decay of the fidelity of the logical qubit as it results from the neural network decoder with the fidelity extracted from the simulation [42].The latter fidelity determines via Eq.( 4) the logical error rate optimal of an optimal decoder.For the distance-3 code we find L = 0.0148 and optimal = 0.0132 per microsecond, resulting in a decoder efficiency optimal / L [42] of 0.89.The dashed gray line is the average fidelity (following Eq. ( 4)) of a single physical qubit at rest, corresponding to an error rate of 0.0164 [42].This demonstrates that, even with realistic experimental parameters, a logical qubit encoded with the color code has a longer distance d pseudothreshold pseudo 3 0.0034 5 0.0028 7 0.0023 TABLE I: Pseudothresholds calculated from the data of Fig. 4, giving the physical error rate below which the logical qubit can store information for a longer time than a single physical qubit.
FIG. 5: Same as Fig. 3, but for a density matrix-based simulation of an array of superconducting transmon qubits.Each point is an average over 10 4 samples.The density matrixbased simulation gives the performance of an optimal decoder, with a logical error rate optimal = 0.0132 per microsecond.From this, and the error rate L = 0.0148 per microsecond obtained by the neural network, we calculate the neural network decoder efficiency to be 0.91.The average fidelity of an unencoded transmon qubit at rest with the same physical parameters is plotted in gray.
life-time than a physical qubit.

V. CONCLUSION
We have presented a machine-learning based approach to quantum error correction for the topological color code.We believe that this approach to fault-tolerant quantum computation can be used efficiently in experiments on near-term quantum devices with relatively high physical error rates (so that the neural network can be trained with relatively small data sets).In support of this, we have presented a density matrix simulation [42] of superconducting transmon qubits (Fig. 5), where we obtain a decoder efficiency of η d = 0.89.Independently of our investigation, three recent works have shown how a neural network can be applied to color code decoding.Refs.38 and 41 only consider single rounds of error correction, and cannot be extended to a multi-round experiment or circuit-level noise.Ref. 39 uses the Steane and Knill error correction schemes when considering color codes, which are also fault-tolerant against circuit-level noise, but have larger physical qubit requirements than flag error correction.None of these works includes a test on a simulation of physical hardware.
For the density matrix simulation, neither ancilla qubits nor flag qubits are reset between cycles, leading to a more involved extraction process of both δ s(t) and s flag (t), as we now explain.
Let m(t) and m flag (t) be the actual ancilla and flag qubit measurements taken in cycle t, and m 0 (t), m 0 flag (t) be compensation vectors of ancilla and flag measurements that would have been observed had no errors occurred in this cycle.Then, Calculation of the compensation vectors m 0 (t) and m 0 flag (t) requires knowledge of the stabilizer s(t − 1), and the initialization of the ancilla qubits m(t−1) and the flag qubits m flag (t − 1), being the combination of the effects of individual non-zero terms in each of these.
Note that a flag qubit being initialized in |1 will cause errors to propagate onto nearby data qubits, but these errors can be predicted and removed prior to decoding with the neural network.In particular, let us concatenate m(t), m flag (t) and s(t) to form a vector d(t).The update may then be written as a matrix multiplication: Where M f is a sparse, binary matrix.The syndromes s(t) may be updated in a similar fashion where M s is likewise sparse.Both M f and M s may be constructed by modeling the stabilizer measurement circuit in the absence of errors.The sparsity in both matrices reflect the connectivity between data and ancilla qubits; for a topological code, both M f and M s are local.The calculation of the syndrome increments δ s(t) via Eq.(A1) does not require prior calculation of s(t).
Appendix B: Details of the neural network decoder 1. Architecture The decoder consists of a double headed network, see Fig. 2, which we implement using the TensorFlow library [47].It maps a list of syndrome increments δ s(t) with t/t cycle = 1, 2, ..., T to a pair of probabilities p , p ∈ [0, 1].
(In what follows we measure time in units of the cycle duration t cycle = N 0 t step , with N 0 = 20.)The lower head gets as additional input a single final syndrome increment δ f (T ).The cost function I that we seek to minimize by varying the weights w and biases b of the network is the cross-entropy between these output probabilities and the true final parity p true ∈ {0, 1} of bit flip errors: The term c||w EVAL || 2 with c 1 is a regularizer, where w EVAL ⊂ w are the weights of the evaluation layer.
The body of the double headed network is a recurrent neural network, consisting of two LSTM layers [46,48].Each of the LSTM layers has two internal states, representing the long-term memory c (i) t ∈ R N and the shortterm memory h (i) t ∈ R N , where N = 32, 64, 128 for distances d = 3, 5, 7. The first LSTM layer gets the syndrome increments δ s(t) as input, and outputs its internal states h t .These states are in turn the input to the second LSTM layer.
The heads of the network consist of a single layer of rectified linear units, whose outputs are mapped onto a single probability using a sigmoid activation function.The input of the two heads is the last short-term memory state of the second LSTM layer, subject to a rectified linear activation function ReL( h T ).For the lower head we concatenate ReL( h T ) with the final syndrome increment δ f (T ).error rate of p = 10 −3 for distances 3 and 5, and of 5 • 10 6 sequences for distance 7.At the end of each sequence, it contains the final syndrome increment δ f (T ) and the final parity of bit flip errors p true .After each training epoch, consisting of 3000 to 5000 mini-batches of size 64, we validate the network (using only the lower head) on a validation dataset consisting of 10 3 sequences of 30 different lengths between 1 and 10 4 cycles.The error rates in the validation datasets are 1 • 10 −4 , 2.5 • 10 −4 , 4 • 10 −4 for distances 3, 5, 7 respectively, chosen such that they are the largest error rate for which the expected logical fidelity is larger than 0.6 after 10 4 cycles (see Fig. 7).If the logical error rate reaches a new minimum on the validation dataset, we store this instance of the network.To keep the computational effort tractable, for the density matrix-based simulation (Fig. 5) we only train on 10 6 sequences of lengths between T = 1 and T = 20 cycles and validate on 10 4 sequences of lengths between T = 1 and T = 30 cycles.For the density matrix-based simulation, all datasets have the same error rate.
We train using the Adam optimizer [49] with a learning rate of 10 −3 .To avoid over-fitting and reach a better generalization of the network to unseen data, we employ two additional regularization methods: Dropout and weight regularization.Dropout with a keep probability of 0.8 is applied to the output of each LSTM layer and to the output of the hidden units of the evaluation layers.Weight regularization, with a prefactor of c = 10 −5 , is only applied to the weights of the evaluation layers, but not to the biases.
After training is complete we evaluate the decoder on a test dataset consisting of 10 3 (10 4 for the density matrixbased simulation) sequences of lengths such that the logical fidelity decays to approximately 0.6, but no more than T = 10 4 cycles.Unlike for the training and validation datasets, for the test dataset we sample a final syndrome increment and the corresponding final parity of bit flip errors after each cycle.We then select an evenly distributed subset of t n = n∆T < T max cycles, where ∆T is the smallest integer for which the total number of points is less than 50, for evaluation.This is done in order to reduce the needed computational resources.The logical error rate per step is determined by a fit of the fidelity to Eq. (4).

Pauli frame updater
We operate the neural network as a bit-flip decoder, but we could have alternatively operated it as a Pauli frame updater.We briefly discuss the connection between the two modes of operation.
Generally, a decoder executes a classical algorithm that determines the operator P (t) ∈ Π n (the so-called Pauli frame) which transforms |ψ L (t) back into the logical qubit space H 0 = H L .Equivalently (with minimal overhead), a decoder may keep track of logical parity bits p that determine whether the Pauli frame of a 'simple decoder' [34] commutes with a set of chosen logical operators for each logical qubit.
The second approach of bit-flip decoding has two advantages over Pauli frame updates: Firstly, it removes the gauge degree of freedom of the Pauli frame (SP (t) is an equivalent Pauli frame for any stabilizer S).Secondly, the logical parity can be measured in an experiment, where no 'true' Pauli frame exists (due to the gauge degree of freedom).
Note that in the scheme where flag qubits are used without reset, the errors from qubits initialized in |1 may be removed by the simple decoder without any additional input required by the neural network.
of the first and second LSTM layers), and dashed arrows indicate the internal memory flow of the LSTM layers.measurement basis.(A separate decoder is invoked for each measurement basis.)If the bit flip parity is odd, we correct the error by negating m → −m.

FIG. 6 :
FIG. 6: Top left: Schematic of a 6-6-6 color code with distance 3. Top right: Circuit for stabilizer measurements at a boundary.Bottom left: Partial schematic of a 6-6-6 color code with distance larger than 3. Bottom right: Circuit for stabilizer measurements.

2 .
FIG. 7: Same as Fig. 4. The blue ellipse indicates the error rates used during training, and the green ellipse indicates the error rates used for validation.

Figures 8 and 9
Figures 8 and 9 show the decay curves for the d = 5 and d = 7 color codes, similar to the d = 3 figure 3 in the main text.