Deep Bayesian Experimental Design for Quantum Many-Body Systems

Bayesian experimental design is a technique that allows to efficiently select measurements to characterize a physical system by maximizing the expected information gain. Recent developments in deep neural networks and normalizing flows allow for a more efficient approximation of the posterior and thus the extension of this technique to complex high-dimensional situations. In this paper, we show how this approach holds promise for adaptive measurement strategies to characterize present-day quantum technology platforms. In particular, we focus on arrays of coupled cavities and qubit arrays. Both represent model systems of high relevance for modern applications, like quantum simulations and computing, and both have been realized in platforms where measurement and control can be exploited to characterize and counteract unavoidable disorder. Thus, they represent ideal targets for applications of Bayesian experimental design.


Introduction
We are currently witnessing rapid scaling in the number of components for quantum technology platforms.Fulfilling the promise of fault-tolerant quantum computation will eventually require millions of qubits, and while current implementations still fall far short of that goal, the specific challenges of scaling are already apparent for the present-day devices including on the order of hundred qubits [1].Other areas like integrated photonic circuits [2] are following a similar trajectory, for applications such as neuromorphic computing [3] or sensing and more fundamental studies in areas like topological transport [4].There, large networks of beamsplitters, waveguides and resonators, again with component counts on the order of dozens or hundreds, are being fabricated and deployed.Similar large-scale networks have now been fabricated and investigated for coupled mechanical resonators, producing phononic circuits with local access to vibrational modes.
Characterizing any of these devices is a very important but nontrivial task, especially if it is to be done in an efficient manner, within a limited time budget [5].
Active learning [6], in the form of optimal experimental design can help, provided that one can employ techniques able to deal with the large number of parameters that are to be determined.
These parameters comprise resonance frequencies of optical or microwave modes, couplings between components (e.g. between resonators and qubits or waveguides, or beamsplitter transparencies), nonlinearities, propagation phase shifts, matrix elements for the effect of external drives, and decay rates.Many different measurement approaches can be drawn upon, each of them coming with its respective parameters that can be adjusted prior to each new measurement.In linear devices, specific components of the scattering matrix can be measured by injecting waves in some port and performing a homodyne measurement on another port.Here, the frequency would be a continuous parameter, while, depending on the setup, the choice of ports could be another, discrete parameter.In nonlinear systems, such as circuits comprised of qubits, manipulation via pulses in Ramsey-type schemes is the natural choice, with a final projective measurement of one or several qubits.The drive amplitude and pulse times would then represent the measurement parameters to be optimized.Systems that couple qubits and resonators or waveguides can also be characterized via nonlinear transmission, with the amplitude and drive frequency being treated as adjustable.
In all these cases, there is unavoidable fundamental noise in the measurement outcomes, namely shot noise in the case of wave transmission measurements and quantum projection noise in the case of qubit measurements [7].This noise can naturally be reduced by extending the measurement duration, increasing the wave amplitude, or repeating multiple times the qubit pulse sequence together with its final projective qubit measurement (multi-shot measurement).As a consequence, the information gain per such extended measurement increases.However, there are limits to the usefulness of this naive optimization: wave amplitudes can be increased only so far before entering nonlinear regimes or heating up the device.In addition, one cannot keep measuring at only one single choice of the measurement parameter, since that will eventually only pinpoint a particular function of the many underlying setup parameters and not allow to resolve all the parameters individually.That is where active learning, i.e. choosing next measurement settings efficiently, can offer true benefits.
In this paper, we investigate recent and promising techniques from machine learning [8,9] to efficiently Figure 1.Sketch of the model of a system's response and summary of the parameter estimation procedure.At the top: the observed system produces a noisy observation given the measurement settings and the system's parameter λ.At the bottom: starting from the prior we have about the system, we can perform multiple steps.In each step, given the prior and the system's likelihood, we approximate the posterior distribution by minimizing the Barber-Agakov bound (3).At the same time, the minimization also gives us the optimal measurement setting to perform the measurement at.After the measurement has been performed, the approximated posterior, conditioned on the measured value and outcome, becomes the new prior, and a new iteration can be performed.
and accurately make the best parameter prediction given past measurement, and propose the next best measurement to perform.We show applications of these techniques to the above-mentioned quantum devices.We focus especially on quantum systems, such as chains of coupled cavities and arrays of qubits.The presented framework is nonetheless completely general and appliable in a similar way to other settings.

Related Work
The idea of finding the best experiment to perform is known as Active Learning in the machine learning literature [10,6,11] and as Bayesian optimal experimental design [12] in more specific parameter estimation applications [13,14].It is very common in science to have a class of possible models, dependent on a set of parameters, and to perform experiments to find those that better describe the true system.The outcome of each experiment does not give direct knowledge on the parameters themselves because of noise and measurement errors, but it provides partial information.Therefore, it is generally not enough to perform as many measurements as the number of unknown parameters of the system, and more experiments are required.When enough information is collected through measurements, we can identify the true parameters with a certain confidence.When experiments are expensive, either in terms of cost, time and effort, it can be important to be efficient in the number of experiments required to characterize the given physical system.In those cases, it is desirable to exploit the information obtained from each experiment as much as possible, and to choose the sequence of experiments in such a way that the smallest number of experiments is required.
Information Theory provides satisfactory answers to this problem, at least from the theoretical point of view.In particular, the Bayes theorem shows the proper way to include a new measurement into our knowledge and update our predictions of the parameters of the system, and expected information gain [15] is the quantity to look at to find the best possible experiments to perform.
There have already been many remarkable applications in science [16,17] and in physics in particular, ranging from devising quantum metrological procedures [18] to characterizing quantum dots [19,20], superconducting quantum processors [21], or sensors [22], multi-phase estimation [23,24], and their use is becoming more and more common.Parameter estimation can then be further generalized to the complete design of the experiment.For example, [25,26,27] completely automate the search for the experiment that solves a given task, in the case of quantum optics.
Nevertheless, optimal experimental design techniques have not been used in their full power up to recently, even though these techniques have been known for decades.The main reason is that the exact evaluation of those statistical quantities can be very expensive.Therefore, very rough approximations were usually made, including the use of empirical heuristics, the use of Gaussian processes [28] and Gaussian posterior approximation [29,30] and the optimization through maximum likelihood estimation [31].The approximation usually depends on where the bottleneck is in practice, according to the specific problem: highdimensional parameter space, large number of different possible experiments to perform, large number of measured quantities of each experiment, experiment execution time and cost, and total allowed number of experiments.
Modern neural networks [32] that employ normalizing flows [33,34] for the approximation of the posterior distribution, developed in the past few years, recently allowed for quite more efficient estimations [8,9] that do not require such rough (and often unjustified) assumptions.The price to pay for the increased precision of the approximation is clearly a larger computational effort.

Methods
Bayesian experimental design can be implemented as follows.The settings of the experiment, which could be, for example, a measurement at a particular frequency, are described by a vector x.
The physical system is identified by hidden parameters λ, which we would like to determine.Because of measurement noise, the outcome of the experiment, y, is not deterministic, but distributed according to a distribution P (y|λ, x), called likelihood of an observation.We can imagine, for example, y = f (x, λ) + ϵ, where f is a deterministic function and ϵ is a random variable, e.g.Gaussian distributed.We can update our knowledge of the parameters of the system with the Bayes rule [35]: given our inferred distribution P n (λ|M n ) after n measurements M n = {(x 1 , y 1 ), . . ., (x n , y n )} (to which we will refer to as prior at step n + 1), and the subsequent measurement (x n+1 , y n+1 ), the updated distribution (posterior at step n + 1) is The initial prior distribution P 0 (λ) is chosen arbitrarily, and it reflects our initial assumptions on the parameters of the system.
To choose the next measurement to perform, it is possible to define a query function, which assigns to each possible experiment x its expected value: the x for which the query function is maximized is the one that is expected to be the most useful to measure [6].A possible choice of the query function is the expected information gain when measuring at x: This can be interpreted as the mutual information between y and λ given a measurement at x, or in other words, as the entropy reduction after measuring (x, y).However, Eq. ( 2) is hard to estimate.First, it requires to be able to sample from the prior distribution P 0 (λ), and second, it requires the value itself of the posterior probability distribution P 1 (λ|y 0 , x 0 ), since it appears inside the logarithm.In principle, we could obtain the posterior distribution with the Bayes rule, Eq.
(1), but it is not efficient to calculate its normalization factor P 0 (y|x) = dλP 0 (λ) log P 0 (λ).Overall, to get Eq.( 2) we would require two nested Monte Carlo estimates, one for the evidence and one for the mutual information estimation.Then, we should optimize over x to get the best measurement to perform.
Modern neural network techniques allow implementing variational bounds and perform a much more efficient estimate [36].For example, we can avoid evaluating the posterior [37] explicitly by introducing a new function Q(λ|y, x) and calculating the quantity If Q is a probability distribution, we know that KL(P 1 (λ|y, x)||Q(λ|y, x)) ≥ 0 thus This is the so-called Barber-Agakov bound of mutual information [38].We can parametrize Q(λ|y, x) with a neural network, so that we do not need to find a different Q function for every different (x, y), but we can interpolate, amortizing the costs of the evaluation.
In particular, we use a conditional normalizing flow [39]: we start from a normal Gaussian distribution G(z) and perform a series of invertible transformations (parametrized by neural networks) having a Jacobian that is simple to estimate.Optimizing over Q(λ|y, x) means to optimize over the parameters of the neural networks that implement the transformations.Normalizing flows can be easily implemented in common deep learning frameworks like TensorFlow [40].In particular, we use TensorFlow Probability [41].The procedure is summarized in figure 1. Please refer to Appendix A for more details.
There are multiple ways to estimate the performance of the proposed result (and compare with alternative techniques).The sum of the information gained at step n + 1 after measuring (x n+1 , y n+1 ), can be a measure for how much we understood about the system.However, in contrast to (3), which represents the expected information gain averaged over all the possible measurement outcomes, this quantity can also be negative.Indeed, it is possible that an unexpected outcome of a measurement increases the global uncertainty on the parameters.Properties of the final posterior distribution such as its variance, or the comparison of its mean (or most likely value) with the true parameters are also a possible metric to consider.On the other hand, the best prediction for an observation is actually the average over all the allowed parameters It is possible to compare this prediction with the likelihood calculated at the true parameters λ * , for example using the Kullback-Leibler divergence.This gives an idea of how much the predictions of our model can differ from the observations.It is important to emphasize that if the physical model is correct, we would expect only a single parameter set to be relevant, corresponding to the true parameters of the system.However, if the model does not represent well the true system, the use of multiple values may be helpful to produce a better approximation.

Applications and discussion
We have already mentioned in the introduction the wide variety of quantum technology platforms, system parameters, and potential measurement approaches that could benefit in principle from active learning approaches.In the following, we have chosen two illustrative examples that are sufficiently distinct in their characteristics.As a first example, we consider a linear device, where wave transmission is measured to extract the resonance frequencies of a coupledcavity array.
Such cavity arrays [42,43,4,44] have been considered in studies of transport as coupled-resonator optical waveguides in optical setups or quantum many-body physics of photons, when combined with nonlinear elements like qubits.In our second example, we turn to an array of qubits, which is a widespread platform used for quantum simulations and quantum computing [43,45,46,47,44,48].There, we consider a typical pulsed scheme followed by a projective measurement to extract qubit parameters.Appendix B provides additional technical details about the implementations of the examples.

Coupled cavities
We consider a cavity array with the following Hamiltonian: where the sum runs over all but the last of the cavity modes in this chain with open boundary conditions.
Here and in what follows, we consider h = 1.
Since the setup we are dealing with is linear, it is sufficient to solve the classical equations of motion for the coupled modes, driven by a wave entering from a waveguide coupled to the first cavity.To this end, we consider the classical coherentstate amplitudes a j corresponding to the quantum operators âj , including drive and decay as prescribed by input-output theory [?].For brevity of our notation, we collect all these amplitudes in a vector and all frequencies in a matrix, where first-neighbour interactions are J and proper frequencies Ω i .Let our system also have some internal decay κ int and external decay κ ext , which only applies to the cavities coupled to the environment, for example the first and last.Given the entering fluctuating field a in , which can also contain a laser drive, the output field a out is given by [49] In our example, we imagine driving the first cavity and looking at the response at the last cavity.
Our goal is to extract the scattering matrix element for transmission from the first to the last cavity, given by the S 0N in (8).Being it a complex number, its knowledge corresponds to measuring both the amplitude and phase of the emitted light.By performing the smallest possible number of measurements at a frequency ω close to the resonance frequency, we want to discover the frequencies of all the cavities.For example, they may all be close to a common resonance frequency, but slightly detuned.Each measurement is affected by noise ϵ, which can be assumed to be Gaussian [7].This noise, being observed in a measurement of the field amplitude, ultimately arises from quantum vacuum noise being injected into the input ports of the device (and through the dissipation channels) and propagating linearly through the setup.
In figure 2, we show our results.We consider 6 coupled cavities, with each random coupling uniformly chosen in J j /κ ∈ [1,3].We assume a setting where the couplings J and the dissipation κ have already been calibrated and are therefore known, and our goal is to find the detunings.In the general notation of section 3, the measurement setting x corresponds in this case to the choice of the drive frequency ω, the response function f (x, λ) corresponds to the scattering matrix element S 0N , and the unknown parameters of the system λ are the frequencies of the cavities Ω i .
As more measurements are performed, we improve our estimate of the cavity frequencies.We see that the information provided by new measurements decreases as we make more observations.At the same time, the posterior distribution converges to a sharp peak around the true values.For example, figure 2e shows the evolution of the last cavity frequency distribution.It is important to emphasize that even if the posterior eventually converges to a Gaussian, it is not Gaussian at the beginning, as shown in figure 2d.This is why it is necessary to approximate the posterior using a neural network approximation.
By looking at the measurement frequencies that the active approach selects, we can identify a pattern, or measurement strategy: it measures the frequencies where the slope of the response is larger, alternating between different points.These points give indeed the largest information on the intrinsic frequencies of the cavities.
We compare the results of the inference with an active choice of the measurement with two other strategies: fixed and random.The fixed strategy is the simplest: all the measurements are performed at the same x value, and it is clearly often not possible to learn all the parameters from that.There will usually be a region of the response function, close to the measurement region, that is very well learned, while other regions will not be very accurate.A better naive measurement selection approach is the random one.In this case, measurements are chosen uniformly within the specified measurement range (here, x = ω/κ ∈ [ −12, 12]).We can see that in figure 3  active strategy, i.e. the one that chooses the next measurement greedily by maximizing the expected information gain, learns the parameters of the system faster than the random one.Asymptotically, both the random and active strategy will converge to the right values.However, active selection allows to save a lot of measurements.

Array of qubits
We consider an array of N qubits, whose many-body Hamiltonian reads like At the beginning, the system starts in its ground state |ψ 0 ⟩.We assume that qubits can be individually addressed, and can be subjected to microwave pulses and projective measurements.
In particular, we excite the first qubit of the array by applying a microwave pulse with pulse area π.This corresponds to implementing the unitary We let the system evolve freely for some time t, and then we perform a projective measurement on the first qubit.The evolution allows for the propagation of the external excitation.The outcome of the measurement is binary, according to probability where P(0) ↑ = | ↑⟩ 0 ⟨↑ | is the projection operator on the "up"-state of the first qubit (qubit number 0).In each experiment, we can choose the duration of the free evolution after the initial pulse, which we will denote as the "measurement pulse duration" x ∈ [0, 5].By performing the smallest possible number of experiments, we want to discover the frequencies ω i of all the qubits.
We considered a 4-qubit system and performed 100 measurements in series, choosing the pulse duration for each of them.In this case, each of those measurements is in itself a multishot measurement, being repeated 100 times to give better statistics (otherwise the outcome would only be either 0 or 1).Again, as shown in figure 4, we compare the active strategy with a random and a fixed one, and show the efficiency advantages of the active one in terms of gained information.In this case, we also show the performance of a uniform strategy, which takes sequential equally-spaced measurements, i.e. sweeping the pulse duration from 0 to the maximum allowed value.It is interesting to observe that this strategy eventually learns the parameters of the system, but it requires many more measurements than the random one.
Also, in figure 4d we see that the active strategy usually requires, on average, sensibly fewer measurements than the random strategy to provide the same total information gained.For example, to reach a total information gained of 10, we require about 10 measurements with the active approach or about 20 measurements with the random one.
The fixed strategy always measures at x = 0, i.e. does not allow any relaxation after the microwave pulse on the first qubit.As a consequence, this measurement effectively only provides information about the interacting ground state of the system, which already contains some limited information about the qubit frequencies.We emphasize, indeed, that since the interacting quantum many-body ground state is not an eigenstate of σ z acting on the first qubit, the probability of measuring ↑ at t = 0 is not 1.
Furthermore, it is possible to spot a sort of strategy in the pattern of the measurements performed by the active selection: a few measurement regions are alternated cyclically.Indeed, we can imagine that some regions give the largest information about some parameters.After we measure in one region, we increase the accuracy on some parameters, and another region becomes more relevant, until it is useful to measure again at the initial spot.This cycle is repeated until we reach the desired accuracy on the parameters.

Outlook
In this paper, we have introduced deep optimal Bayesian experimental design for modern quantum technologies.
This approach approximates the posterior update with a variational bound and a deep neural network and allows extracting the optimal measurement to perform at each step.We have shown the application of this technique to two promising quantum platforms, cavity arrays and qubit chains.In both cases, the active measurement selection technique allows learning the parameters of the system with fewer measurements than other strategies.
Depending on the application, different assumptions should be made, leading to different tradeoffs between the approximation precision and the time efficiency.The approach we propose is especially advantageous in the setting where each measurement is very expensive and it is worth spending a long calculation time to really find the optimal next measurement setting.
The main challenge at present is the time it takes to update the neural network representing the Bayes posterior distribution as well as optimizing over possible measurement settings.This time is still too large in order for this technique to be deployed economically "as is" in the given scenarios, where individual measurements can happen on microsecond time scales, and even extended sequences used to reduce shot noise will probably not last longer than a millisecond for one measurement setting, except when extreme accuracy is called for.In the present (not yet very optimized) technique, an optimization run still takes on the order of hundreds of seconds per measurement step, which is currently a drawback of Bayesian Optimal Experimental Design based on deep neural networks in general.However, a remedy would consist in performing the optimization on a number of different example scenarios, in simulations, and then train a neural network in a supervised fashion to learn the suitable choice of the next measurement setting based on all the previous settings and outcomes.In this way, we could really develop a sort of policy that can be deployed to characterize unknown systems of the same class.
A specific cost function may also be added to take into account that some measurement values may require more resources and thus be more expensive than others [6].
A further generalization can be to drop the likelihood assumption and try to develop a likelihoodfree active inference technique.This would be similar to [24].However, we do not want to discretize the input space x, since it would not scale up to large dimensions.Finally, we notice that we are employing a greedy strategy which always chooses the single next best measurement to perform.Alternative approaches may include providing an overall measurement budget (e.g. total number of measurements) or set a target total information gain to achieve.It would be interesting to study how these sequential strategies [50], perhaps approximated with a reinforcement learning agent [51,22], may suggest better measurement strategies than the greedy one we applied in this paper.
From the broader perspective, we can imagine in the future to generalize the measurement setting x to become an entire experimental setup, and to perform the search on a broader class of experiments.This would be a very useful aspect of a future "artificial scientist" that tries to explore the world and learn a model by performing experiments autonomously.
We consider a cavity array with the following Hamiltonian: where the sum runs over all but the last of the cavity modes in this chain with open boundary conditions.We consider the classical coherent-state amplitudes a j corresponding to the quantum operators âj , including drive and decay as prescribed by inputoutput theory [53].Given first-neighbour interactions J i and proper frequencies Ω i , we can define the matrix We also include some internal decay κ int and external decay κ ext , which only applies to the cavities coupled to the environment, for example the first and last.The overall decay can be represented by a vector Let a in be the entering fluctuating field, which can also contain a laser drive, and a out the output field.The behavior of the system is described by the input-output relations [49] B1.
Parameters for the coupled cavity example discussed in the main text.On the left, the true frequencies we want to discover; on the right, the known parameters of the system.All the frequencies in the bulk of the spectrum are doubly degenerate, and the spacing becomes smaller near the boundaries of the spectrum.This picture changes, and the degeneracies are broken, when we introduce variation in the onsite frequencies Ω j , which will be the generic situation we want to explore.
The system we discuss in the main text has couplings J uniformly sampled in [1,3] for each cavity, and proper frequencies sampled from a Gaussian distribution with zero mean and unit variance.The values for the specific example in figure 2 and figure 3 are shown in table B1.We optimize the posterior neural network for 8000 steps after each measurement, using a batch size of 1500 samples in (3).

Appendix B.3. Qubit chain
Table B2 shows the parameters employed for the qubit chain example.We optimize the posterior neural network for at least 2500 steps after each measurement, using a batch size of 700 samples in (3).When also optimizing over x (i.e.optimal measurement setting), we keep optimizing the loss until the change in 500 steps is smaller than 0.05.This choice allows decreasing the training time for this more computationally-demanding system, since we train for a longer time only when required.Table B2.Parameters for the qubit chain example discussed in the main text.On the left, the true frequencies we want to discover; on the right, the known parameters of the system.In orange, curves given by sampling parameters from the posterior; in blue, the true response function.The initial best guess is given by sampling from the prior distribution, before any measurement is performed.
It is also interesting to look at how the prediction of the response function of a qubit system as in figure 4 improves after the measurements suggested by an active strategy.As shown in figure B1, starting from a very uncertain prediction, the distribution of possible response functions converges to the true curve.

Figure 2 .
Figure2.Deep Bayesian optimal experimental design applied to estimating the frequencies of an array of 6 coupled cavities.(a) Sketch of the system.(b) Measurements at each step.The red symbols show the chosen frequency x to measure at each step.In the background, the value of the expected information gain for each possible x (brighter is higher), normalized at each step, i.e.IG(x)/ maxx IG(x).(c) Expected information gain values for each possible measurement x.A peak represents an optimal value to measure.Different lines represent different measurement steps.(d) Inferred parameter distribution after 15 measurements.The diagonal shows the marginal distribution P (λ i ), the off-diagonal plots the two-variable slice P (λ i , λ j ).(e) Evolution of the marginalized posterior distribution of the last cavity frequency, until converging to a sharp peak around the true value.

FixedFigure 3 .
Figure 3.Comparison of different measurement strategies for the coupled cavities system.(a) Final response functions after inferring the parameters with the active, random, or fixed strategy.The orange curves represent the response function induced by various samples of parameters λ from the final posterior distribution.In blue, the true response.Red symbols mark the performed measurements and their outcome.(b) Evolution of the KL-divergence between the inferred response function distribution P (y|x) and the system likelihood during the various measurements.Brighter means higher.(c) Cumulative sum of information gained after each measurement, i.e.

N
n=1 log Pn(λ|y, x) − log P n−1 (λ).Higher values represent learning more about the system.Results are averaged over 3 runs.

Figure 4 .
Figure 4. Deep Bayesian experimental design applied to the estimation of the frequencies of an array of 4 qubits.(a) Sketch of the system.(b) Chosen pulse duration x at each step.The red symbols show the chosen x to measure at each step.In the background, the value of the expected information gain for each possible x (brighter is higher), normalized at each step maximum.(c) Response functions after 10 measurements, comparing the active, random, fixed and uniformly spaced approaches.The orange curves represent the response function induced by various samples of parameters λ from the final posterior distribution.In blue, the true response.Red symbols mark the performed measurements and their outcome.(d) Sum of information gained after each measurement, i.e.N n=1 log Pn(λ|y, x)−log P n−1 (λ).Higher values imply learning more about the system.Results are averaged over 3 runs.

Figure B1 .
Figure B1.Prediction of the response function after a different number of steps.In orange, curves given by sampling parameters from the posterior; in blue, the true response function.The initial best guess is given by sampling from the prior distribution, before any measurement is performed.