This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy. Close this notification
Paper The following article is Open access

Structured filtering

and

Published 16 August 2017 © 2017 IOP Publishing Ltd and Deutsche Physikalische Gesellschaft
, , Citation Christopher Granade and Nathan Wiebe 2017 New J. Phys. 19 083014 DOI 10.1088/1367-2630/aa77cf

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1367-2630/19/8/083014

Abstract

A major challenge facing existing sequential Monte Carlo methods for parameter estimation in physics stems from the inability of existing approaches to robustly deal with experiments that have different mechanisms that yield the results with equivalent probability. We address this problem here by proposing a form of particle filtering that clusters the particles that comprise the sequential Monte Carlo approximation to the posterior before applying a resampler. Through a new graphical approach to thinking about such models, we are able to devise an artificial-intelligence based strategy that automatically learns the shape and number of the clusters in the support of the posterior. We demonstrate the power of our approach by applying it to randomized gap estimation and a form of low circuit-depth phase estimation where existing methods from the physics literature either exhibit much worse performance or even fail completely.

Export citation and abstract BibTeX RIS

1. Introduction

Across a range of physical sciences, much of the work of the experimentalist centers around learning from observed data. In metrology, for instance, experimental data is used to infer parameters of interest such as magnetic fields, temperature, or other physical quantities [1]. The process by which these parameters are learned from data is thus of critical importance to tasks as diverse as metrology and quantum information processing [26].

The importance of learning physical parameters has motivated developing and making practical advances in statistically-principled approaches to parameter estimation. Bayesian methods in general and Bayesian inference in particular have proven to provide a compelling framework for drawing inferences from experimental observations in a rigorous, robust and practical manner [7]. Numerical algorithms such as particle filtering then provide a general and practical framework for approximate Bayesian inference, as well as for statistical computation more generally [8].

In practice, existing approaches suffer from a great deficiency: they implicitly assume that the probability distribution that describes the current state of knowledge about the parameters has a particular structure to it. These assumptions are often reasonable, such as the assumption that the probability distribution is Gaussian or, weaker still, unimodal. While these reasonable sounding assumptions often work well, they can fail in catastrophic ways when evidence is provided that favors a multitude of equivalent (or near-equivalent) hypotheses. We refer to such learning problems as degenerate because of the strong analogy that degenerate likelihoods of models have with degenerate Hamiltonians in quantum mechanics.

The challenges posed by degeneracy can sometimes be circumvented through the use of cunningly designed experiments, but this places additional experimental demands, and is not always possible. As a result, there are broad classes of parameter estimation problems for which we have no robust automated methods for parameter estimation.

In this paper, we address these challenges by allowing the inference algorithm to learn the structure of its posterior distribution over the parameters of interest. We show that doing so also allows for relaxing these experimental constraints. Specifically, we demonstrate that this can allow such algorithms to learn when traditional particle filtering methods such as [9] fail. Our algorithm augments these traditional approaches with a dynamically generated tree describing the structure of a Bayesian posterior as a hierarchal clustering of particle filters. We rely on statistically principled approaches to model selection to remove structural elements that are inconsistent with the observed data, ensuring that the trees generated by our algorithm usefully represent its state of knowledge.

The advantages offered by our structured filtering algorithm are especially relevant in the case of degeneracy, which presents a significant challenge to existing work. Conventional methods are either inefficient or subtly bias the inference towards unimodal distributions. If the model for a learning problem contains two sets of parameters that are nearly equally likely for the data observed, these efficient inference algorithms usually fail in catastrophic ways. Methods for dealing with this, such as annealing [10], qualitatively fail to give us a solution because they cause the solution to focus on just one of the families of degenerate models. Similarly, techniques such as expectation propagation [11] work well in cases where the model has a fixed number of known degeneracies but can fail when the number of degeneracies is not known. Commonly used solutions such as these are therefore, at best, imperfect solutions for estimating posterior distributions over quantum models. Furthermore, these limitations prevent the application of them to cases where dynamical systems are probed using uninformative experiments, which often yield outcomes that are consistent with several hypothetical dynamical models; that in turn places several experimental constraints that are relaxed by our algorithm.

We begin in section 2 by reviewing Bayesian inference as a framework for learning parameters from experimental observations, then introduce our structured filtering algorithm for Bayesian inference in section 3. In sections 4 and 5, we present numerical evidence for the efficiency and correctness of our approach before concluding in section 6.

2. Bayesian inference

Bayesian inference has become a fundamental tool for modeling quantum systems. The basic object in Bayesian inference is the prior distribution. The prior distribution is a probability distribution over a set of hypotheses that could describe the system. In particular, let us assume that we have a model for a data set that is parameterized by ${\boldsymbol{x}}$ then the prior distribution is $P({\boldsymbol{x}})$. The prior distribution represents any beliefs that the experimenter may hold about the true model before processing any subsequent data. While the prior distribution is subjective, the Bernstein–von Mises theorem [12] states that under most circumstances a poorly chosen prior will simply slow down the inference. This subjectivity has made the use of prior distributions a source of contention between frequentists and Bayesian statisticians. If one dislikes the Bayesian interpretation of the prior it is quite often possible to eschew the discussion of the prior (almost) completely by making the initial prior uniform and avoiding adaptive strategies. Moreover, by performing Bayesian inference on an artificial likelihood function, one can derive a useful approximation of maximum-likelihood estimation [7], such that the utility of Bayesian inference does not hinge on adopting a particular philosophical view. That said, we will take a Bayesian interpretation of probability in the following as a matter of convenience.

The likelihood function is the second component of Bayesian inference. Its purpose is to compute the probability of observing a vector of experimental outcomes ${\bf{D}}$ given that a hypothesis ${\boldsymbol{x}}$ is true. It is denoted $P({\bf{D}}| {\boldsymbol{x}};t)$. For example, in quantum mechanics consider ${\boldsymbol{x}}=[\omega ]$ to be a Rabi frequency for the Hamiltonian $H({\boldsymbol{x}})=\omega {\sigma }_{x}/2$. Then given a state ${{\rm{e}}}^{-{\rm{i}}{H}({\boldsymbol{x}})t}| 0\rangle $ and that ${\bf{D}}=[0]$ is observed, the likelihood function for this experiment is given by

Equation (1)

The likelihood function is then used to update the user's prior beliefs, conditioned on the observed data, using Bayes' theorem:

Equation (2)

The probability density $P({\boldsymbol{x}}| {\bf{D}};t)$ is known as the posterior distribution, and can be interpreted as the probability distribution that an experimenter should hold after being confronted with data ${\bf{D}}$ given their preconceptions, which are represented through the prior $P({\boldsymbol{x}})$.

Except in a few special cases, such as when conjugate priors are used, exact Bayesian inference is intractable. This is often addressed by using sequential Monte Carlo (SMC) [8] approximations, also known as particle filters. SMC works by approximating the probability density using a convex combination of Dirac-delta functions. In particular, the probability density at each step is approximated by

Equation (3)

This notably reduces the integral in Bayes' theorem to a discrete sum over ${N}_{\mathrm{part}}$ discrete hypotheses, each of which is called a particle. The wj are positive real numbers called weights, and satisfy ${\sum }_{j}{w}_{j}=1$. Thus the SMC approximation can be interpreted as a discrete probability distribution.

A drawback of SMC is that as the algorithm proceeds, the vast majority of the weights tend to zero. This is because, with high probability, none of the initial particles correspond to the true hypothesis. Thus, in the case of Bayesian inference, the particle filter approximation above will require a number of particles that is exponential in the length of ${\boldsymbol{x}}$ to estimate the true hypothesis within fixed error.

The exponential scaling of naïve particle filter approximations can be addressed by moving the particles and changing their weights during the inference process. This allows us in many cases to achieve an arbitrarily accurate approximation of the true model with a small (in some cases constant) number of particles. Conventionally, this is done by resampling particles to draw a new set of particles representing the same posterior distribution. Resampling algorithms exploit a duality in the SMC approximation: any probability density can be described either by choosing the weights of particles appropriately or by moving them such that their concentrations represent the probability density. Resampling takes the latter tack. It resamples the particles from a new distribution that maintains certain structural features of the posterior and assigns all the new particles to have the same weight. This in effect re-concentrates the particles in regions where higher probability regions while removing them from places where the probability is low.

One such resampler that has gained great popularity for physics applications is the Liu–West resampler [9]. The Liu–West resampler draws ${N}_{\mathrm{part}}$ particles from the probability distribution $P({\boldsymbol{x}})$. The mean μ and covariance matrix Σ are then computed for the SMC approximation. The resampler then, for parameter $a\in [0,1]$, shifts the particle slightly towards the mean of the SMC distribution, letting ${\mu }_{j}\leftarrow a{{\boldsymbol{x}}}_{j}+(1-a)\mu $. Finally it draws a new particle from the distribution ${ \mathcal N }({\mu }_{j},[1-{a}^{2}]{\rm{\Sigma }})$ and assigns a weight of $1/{N}_{\mathrm{part}}$ to the particle.

In cases where a = 1, Liu–West resampling is equivalent to the bootstrap filter [13], in which each new particle is drawn from the original SMC distribution with replacement. If a = 0, the resampler simply draws particles from a Gaussian that matches the posterior distributions mean and covariance, as is useful for rejection filtering [14]. Typically $a=0.98$ works well in practice [9]; although the optimal value can depend on the likelihood function. Furthermore, for any $a\in [0,1]$, the distribution of accepted samples has the same mean and covariance as the initial SMC approximation. Thus the Liu–West resampler is specifically designed to preserve the first two moments, keeping much of the structure-preserving features of the bootstrap filter, while allowing a model to be estimated with exponentially fewer particles than the bootstrap filter requires.

The Liu–West resampling algorithm works in a wide range of applications [5, 1618], providing a practical means of implementing Bayesian inference in experimental practice. There is a major drawback to this approach though, in that the benefits of the Liu–West algorithm do not extend to multi-modal distributions. Though techniques such as Hamiltonian Monte Carlo [19] can readily sample from multi-modal posterior distributions and are well-supported by high-quality software tools [20], we focus here on particle filtering with Liu–West resampling as it is exceptionally well suited to online applications such as in experimental physics. From this perspective, then, understanding the mechanism by which Liu–West fails in problems of interest will allow us to formulate an improved approach that extends online Bayesian inference to multi-modal posteriors.

In particular, by choosing to perturb the resampled particles towards the mean of the SMC distribution, the Liu–West resampler implicitly biases the distribution towards a unimodal posterior. In cases where the posterior distribution is multi-modal this strategy can fail horribly because very little probability density is actually located near the mean. We see this in figure 1, where we consider learning the Rabi frequency given by (1), using Liu–West resampling for $\omega \in [-1,1]$. In this case, the likelihood function assigns equal values for both positive and negative frequencies. We refer to likelihood functions that have such symmetries as degenerate. This means that regardless of the true frequency, exact Bayesian inference on a uniform prior will always yield a bimodal distribution with zero mean. Since the Liu–West resampler always moves particles towards the mean, we expect Liu–West resampling to fail [21] and indeed notice such a failure in the numerical experiments given in figure 1.

Figure 1.

Figure 1. Failure of Liu–West resampling to capture multi-modal posterior for the Rabi likelihood (1). The LW-approximated and exact posteriors are shown for 40 single-shot measurements at times ${t}_{k}={(9/8)}^{k}$ following the suggestion of Ferrie et al [15], with a uniform prior on $\omega \in [-1,1]$.

Standard image High-resolution image

In some ways this is a trivial problem, since the experiment cannot possibly distinguish between positive and negative frequency. Thus, if the user is aware of this degeneracy, they can choose $\omega \in [0,1]$ and learn the sign in subsequent experiments. We see that this approach is successful in figure 2. However, not all degeneracies are so obvious or so easily countered. For example, the degeneracies that appear in randomized gap estimation (RGE) [22] are significantly harder to incorporate than in the Rabi example. Rather than placing the onus of data processing on the user, it is highly desirable to have a method for automatically addressing such problems as they appear.

Figure 2.

Figure 2. Success of Liu–West resampling for capturing the unimodal posterior for the Rabi likelihood (1). The LW-approximated and exact posteriors are shown for 40 single-shot measurements at times ${t}_{k}={(9/8)}^{k}$, with a uniform prior on $\omega \in [0,1]$.

Standard image High-resolution image

Similar challenges emerge when learning with nearly degenerate likelihood functions. Current practices for mitigating this include using significantly more particles and less frequent resampling steps, avoiding transient failures of the Liu–West algorithm [23]. In such cases, Liu–West resampling will ultimately be successful once enough data has been accumulated to break the degeneracy. However, until the degeneracy can be resolved the algorithm needs to keep track of all the potential hypotheses that could explain the distribution, mandating a much larger number of particles. Liu–West resampling will often fail long before it is able to break the degeneracy. There are a range of different approaches for addressing this problem, which largely fall into one of two categories. Approaches such as annealing, as distinct from simulated annealing, with multiple restarts [10], work somewhat generally, but do not a good estimate of the posterior distribution and require substantial fine tuning. On the other hand, approaches such as expectation–maximization [7] are more robust for finding maximimum a posteriori estimates in highly multimodal cases, but require specific models for multimodality such as a Gaussian mixture model. Instead, it would be very useful to have a resampler that infers and preserves the actual structure of the posterior distribution without requiring domain knowledge. That is, we desire an approach that provides its user with introspection into the structure of a parameter estimation problem. This is especially useful when no domain knowledge is available which might provide prior insight into the structure of a posterior, such that a user might then learn properties useful for future experiment or algorithm design.

In the next section, we detail our algorithm for finding appropriate mixture models and applying resampling on the resultant structure in a way that preserves even highly multi-modal posteriors. In doing so, we describe how our algorithm improves upon existing methods by automatically providing a description of posterior structure through introspection into the inference process.

3. Structured filtering

How does one discover the structures that need to be preserved in the posterior distribution to allow resampling to succeed for degenerate distributions? One simple approach is to cluster the posterior using an algorithm like k-means clustering [7]; however, unless we have domain knowledge it may not be clear a priori how many clusters to use. Here we take a somewhat bold step and use an AI-based approach that allows a computer to entertain multiple potential clusterings and reason about which clustering is best. By allowing the algorithm to decide upon the structures that are most relevant and are most parsimonious with future data, the problem of dealing with spurious near-degeneracies can be avoided.

We discuss the components of this AI-based solution below. First, we discuss clustering. Second, we discuss our method for selecting a model for the posterior. Third we discuss how clustering and selection can be combined together in a single graphical framework. Finally we discuss the details of how our structured filtering algorithm combines these features to learn a model for the posterior distribution.

3.1. Weighted and unweighted clustering

Clustering seeks to solve a problem that is often second nature to humans: clustering data points into groups of related examples. While humans excel at this in low dimensions, getting computers to effectively cluster data with the same robustness that humans exhibit can be comparably challenging. Perhaps the most popular algorithm for clustering data is the k-means clustering algorithm. The algorithm seeks to divide a set of N points into k clusters that minimizes the intra-cluster variances of the clusters. Specifically we define Sp to be the pth cluster, and then seek cluster centroids $\{{{\boldsymbol{y}}}_{p}\,:p=1,\,\ldots ,\,k\}$ to satisfy

Equation (4)

An exact solution to (4) is unlikely to be found in general. In fact, finding an optimal cluster assignment is known to be NP-complete, which means that the existence of any efficient algorithm for finding the optimal clusters would imply P = NP. Since it is widely conjectured that ${\mathsf{P}}\ne {\mathsf{NP}}$, it is fair to assume that k-means clustering is hard in general. Despite the apparent difficulty of such clustering problems, it can be shown that small perturbations about the hard instances can render them efficiently solvable [25]. Thus the average complexity of clustering is polynomial, which is why at least approximate clustering is not a computationally challenging task.

The k-means clustering algorithm is simple. Given k clusters of vectors compute the centroids of each cluster and assign $\{{\boldsymbol{y}}\}$ to these values. For each vector (data point) in the set find the cluster centroid that is closest to the example with respect to the Euclidean norm and take each Sp to be the corresponding cluster. This procedure is then repeated until convergence is reached.

The standard k-means clustering algorithm cannot be directly applied to weighted data, such as particles in a SMC approximation, as seen in figure 3. The reason for this is that the objective function in (4) applies a penalty solely on the distance between the vectors and the centroid. This is to say that it considers all vectors in the data set equally important, irrespective of their weights.

Figure 3.

Figure 3. Comparison of unweighted (left) and weighted (right) k-means clustering on 5000 weighted particles. The positions of each particle were drawn uniformly at random, while the weights were chosen to represent a mixture of four Gaussian distributions. The magnitude of each weight is visualized by the size of the points in the figure, with very low-weight particles being ommited for visual clarity. The centroids found by each algorithm are indicated by stars. The unweighted clustering was performed using SciKit-Learn [24], while the weighted clustering was performed using algorithm 2.

Standard image High-resolution image

Fortunately we can address this by using a weighted k-means algorithm. This algorithm instead seeks to minimize

Equation (5)

The weighted k-means algorithm proceeds exactly as the unweighted version except the cluster centroids are computed as the expectation values of the vectors in each cluster (after renormalizing the weights into a probability distribution) rather than simply by summing the results. We note in figure 3 that this modification allows multi-modal weighted data to be appropriately divided into clusters. A formal algorithm for this procedure is given as algorithm 2 in appendix B.

We utilize the weighted k-means algorithm with the k-means++ heuristic for initial centroid selection [26] to divide our cluster into a set of different clusters. In practice, practitioners will often choose which value of k to use in modeling distributions by plotting the objective function (4) and choosing an inflection point, as such inflection points are suggestive of overfitting. By contrast, model selection provides a more formal approach to reasoning about overfitting [27]. In particular, below we discuss the Bayes factor [28], which allows us to algorithmically decide on the correct cluster number k for application to posterior distributions.

3.2. Model selection for the posterior

Ultimately, the task of selecting the number of clusters to use to represent a SMC approximation to a posterior is one of model selection. For example, we could have one model that uses k = 2 another that uses k = 4 and we wish to know which model (i.e. clustering) does a better job of representing the data. In this case there's a straight forward approach: consider both clusterings for the posterior and only make a decision between the two when sufficient evidence mounts for the superiority of one of the two models.

We use Bayes factors to compare the validity of two models. The Bayes factor can be thought of as a generalization of the likelihood ratio test, and is defined for models M1 and M2 and data record D as

Equation (6)

Here $K\gt 1$ implies model 1 is superior to model 2 and vice versa. It reduces to the likelihood ratio test when the prior over models is uniform ($P({M}_{1})=P({M}_{2})$), and when priors within each model are chosen such that

and similarly for the second model. However, Bayes factors have a very nice feature absent from the likelihood ratio test. The values of the integrals in (6) depend on the volumes of the parameter spaces of the models. This penalizes models with more parameters, and gives us a simple alternative to model selection that does not involve maximization as in the Bayesian information criteria [29]. This means that Bayes factors can easily be computed using SMC approximations from the particle weights and the likelihood function [30].

In practice, we also use Bayes factors to choose one option among several possible competing options. In such cases, we compare the values of K for all possible pairs of models that we want to compare and then select a model once ${\max }_{k}({\min }_{j}K({M}_{j},{M}_{k}))\geqslant {K}_{\mathrm{champ}}$ where ${K}_{\mathrm{champ}}$ is a user specified threshold. Typically a value of ${K}_{\mathrm{champ}}\geqslant 100$ is considered strong evidence in favor of one of the models. In practice, we usually take ${K}_{\mathrm{champ}}\geqslant 2000$ to be our threshold. We do this because excluding the correct number of clusters can often have a worse impact on the algorithm performance than entertaining a hypothesis than is likely false and because we perform many such tests in structured filtering.

3.3. Graphical models for posteriors

In order to convert the problem of automatically choosing the optimal number of clusters in the data into one that a computer can easily solve, we introduce a graphical model for describing the reasoning process that is employed to decide between different clusterings of the posterior distribution. The graphs we consider are directed rooted trees with edges pointed away from the root node, but more general directed acyclic graphs could also be considered. We provide examples of our graphical notation in figure 4.

Figure 4.

Figure 4. Examples of graphical models for (a) conventional particle filtering, (b) model selection between particle filters using distinct likelihood functions, and (c) model selection between two different clusterings of the posterior into mixture models. In each example, particle filtering is indicated by square nodes, model selection is indicated by circle nodes, and mixture models are indicated by triangle nodes. The shades of each edge correspond to their weights with darker edges corresponding to higher weight.

Standard image High-resolution image

The vertices in these graphs are given one of three distinct labels, in addition to their index. The first such label denotes that the vertex is a filter node, which serves as a container for a set of SMC particles that we assume are approximately unimodal. These nodes are denoted by squares. It is important to note that the filter nodes do not necessarily need to have the same likelihood functions, allowing for the inclusion of more general model selection problems. Furthermore, we do not need to use Liu–West resampling inside each of the filter nodes. Other filters such as the bootstrap filter, assumed density filtering [11] or rejection filtering [14] can be used in its place.

The second type is a mixture node. The mixture node defines a distribution that is a weighted mixture of the subtrees that descend from it, but contains no particles itself. For example, a two cluster approximation to the posterior would be described by a mixture node with two filter nodes as leaves. We denote mixture nodes as triangles. The ability to mix multiple nodes has several interesting properties. First, in principle we can emulate a filter node consisting of many particles with a mixture of single-particle filter nodes. Second, we can have mixtures of mixtures. This allows us to generate very rich clusterings even if the maximum degree for the graph is 3 (that is, if we restrict each node to have at most two children and one parent).

The final label that a vertex can be assigned corresponds to a decision node. A decision node is a mixture node but is put in place explicitly to test between the hypotheses described by the subtrees that descend from it. Such nodes are denoted with a circle. By convention, the root node is always a decision node . The edges in our graphical model describe the relationships between the different types of nodes. By definition a filter node has no children, and is hence always a leaf node; however, mixture nodes and decision nodes must have children. The edges between any two nodes are used to assign weights. In a mixture node these weights are used to set the weights properly for the particles that reside in the filter nodes within the subtrees that descend from them. In particular, the actual weights are the products of the weights within the filter nodes and all outgoing edges of mixture nodes that connect it to the root. In a decision node these weights serve to represent the algorithm's confidence that one of the competing hypotheses, described by the descendant subtrees, is correct.

The inference process will often uncover structure in the posterior that was not apparent in earlier steps of the learning process. For instance, posterior modes at a particular step might be inconsistent with new data, such that they may be safely eliminated to focus on those modes still supported by the posterior. Alternatively, two modes that are distinct at later steps may appear at early steps as a single mode, such that we must be able to refine the structure currently inferred for a posterior distribution. These concerns necessitate adapting our graphs in order to match the changing structure. We do this using three simple rules, which we demonstrate in figures 5 and 6. The champion rule and floor pruning rules are designed to simplify the graph when certain structures are not needed to represent the data. The champion rule states that when the weight of one edge overwhelms the sum total of the other weights sufficiently then all other hypothetical subtrees are disregarded except for the 'champion.' The floor prune rule immediately eliminates a subtree if the edge weight is smaller than a threshold. This rule is useful because it allows the algorithm to free up memory and processing to address more fruitful clusterings when one has been effectively excluded by the data.

Figure 5.

Figure 5. Examples of the champion and floor pruning rules applied to a model selection node with three initial child particle filter nodes. All such replacement rules work identically if the filter node is replaced with a subtree.

Standard image High-resolution image
Figure 6.

Figure 6. An example of a splitting move, in which a single particle filter node is replaced by a model selection over different clusterings, each of which is represented by an appropriate mixture model. In this example, the structured filter considers ${n}_{{\rm{clusters}}}\in \{1,2\};$ since the ${n}_{{\rm{clusters}}}=1$ mixture model node would be redundant, it is immediately eliminated, promoting its only particle filter child to be a child of the root model selection node. Similarly, the new decision node over the number of clusterings is redundant with the root decision node, and is also eliminated immediately.

Standard image High-resolution image

The splitting move, on the other hand, increases the complexity of the graph. It takes a particle filter node and replaces that node by a subtree, as illustrated in figure 6. Specifically, we replace the node with a decision node that has the original filter node as a child and also a mixture node that has at least two filter nodes as children. Although splitting moves could be performed at any time, we only deploy them during a resampling step to optimize the performance of the algorithm. These moves are implemented in our algorithm by taking the particle filter in the original node and using weighted k-means clustering, using weighted k-means++ to divide it into two or more clusters. This guarantees that the new mixture of particle filters is equivalent to the initial particle filter before resampling is applied. After splitting, each particle filter leaf node is then resampled independently. Thus, in the branch of the decision node corresponding to the correct number of clusters, the resampling takes place only locally within each cluster and preserves the multi-modal structure of the posterior.

A complication comes in with the number of particles assigned to each cluster. By default we divide the particles into two sub-clusters, assigning each original particle to one of the new clusters using the labels returned by the k-means algorithm. Thus, we retain the total number of particles through a splitting move. However, this presents the danger that a cluster with a small number of particles will become numerically unstable, even if the weight assigned to that cluster is large. To remedy this, we also allow the number of particles used to vary dynamically by setting a minimum number of particles in each cluster. For example, imagine we wish to split a cluster with 1800 particles into three clusters and we set the minimum number of particles per cluster to be 1000. Then, at least one cluster would ordinarily be assigned less than the minimum number of particles that we have decided upon. Our algorithm will draw additional particles with correspondingly smaller weights for such clusters, preserving numerical stability while introducing no new approximations. This results in three clusters with at least 1000 particles each. Since we choose to apply restructuring moves (such as splitting) only when a resample would be triggered by the Liu–West resampler, generating additional particles is easy to do by following the same perscription used in the resampler.

Applied directly, the splitting move would generate exponentially-large trees for even simple models. We limit this by imposing a maximum depth dmax at which the splitting move may be applied. If a particle filter node with depth $d\geqslant {d}_{\max }$ must be resampled, then we apply traditional Liu–West resampling at that node instead. In this way, we can control the maximum size of the structure that our algorithm is allowed to explore. Importantly, by imposing a maximum depth, we rely on the champion and floor pruning rules to allow our algorithm to reduce the current depth and make room for further refinement through application of the splitting rule. Thus, the pruning rules are not only important to efficiently find structure, but are also needed in order to correctly learn posterior structure.

Though these three moves are sufficient to correctly implement structured filtering, we also consider two other pruning moves which reduce graph complexity without additional approximation. These moves allow the algorithm to reduce the depth of the tree dynamically as the floor and champion rules discussed above eliminate uninformative branches of the tree. In particular, the only-child and single-child pruning rules shown in figure 7 replace the current tree with a simpler tree describing the same structure by eliminating redundant intermediate nodes. Under the only-child pruning rule, a node is eliminated if it is the only child of its parent and has one or more children, such that those children can be attached directly to its parent. Similarly, the single-child rule removes any node with exactly one child, and places that child directly onto its grandparent. Trees matching the preconditions for only-child and single-child pruning are generated by applying the champion and floor pruning rules, as each of those eliminate leaf and intermediate notes that do not contribute substantially to the final estimate. By removing these intermediate nodes, the depth of nodes relevant to the final estimate can be decreased, allowing for the splitting move to be applied again.

Figure 7.

Figure 7. Examples of the only-child and single-child pruning moves, used to eliminate redundant intermediate nodes and reduce the depth of the structured filtering tree. Each of the pruning rules shown above exactly preserve the structure with a simpler tree.

Standard image High-resolution image

Importantly, the floor, champion, and splitting moves are each parameterized, offering quality parameters for the particle filter approximation represented by the output of these moves. To allow these pruning and splitting parameters to dynamically depend on the tree structure, we encode them as properties of each node, collectively called a context. In this way, our algorithm can be customized with various starting trees representing prior knowledge about initial clustering, model selection problems of interest, or other structure of interest. Structured filtering recognizes the following context parameters, each of which may be specified at a given node, or inherited from the a node's parent:

  • Prune (boolean): if this context parameter is set to false, then no pruning is applied at this node; in particular, no floor, champion, only-child or single-child rules are applied. For brevity, we omit this context parameter in the algorithm below.
  • Mixture floor (real): this context parameter sets the minimum edge weight from a mixture node to one of its children that will be preserved by floor pruning.
  • Decision floor (real): this context parameter sets the minimum edge weight from a decision node to one of its children that will be preserved by floor pruning.
  • Decision champion ${K}_{\mathrm{champ}}$ (real): this context parameter sets the ratio by which the edge weight from a decision node to one of its children must exceed the sum of all other outgoing edge weights from the same decision node before that child will be considered a champion node.
  • Decision region estimation champions (integer): this context parameter controls the number of children of each decision node that will be kept when reporting region estimates, as described in section 3.4.

3.4. Structured region estimation

As described by Ferrie [31], clustering can also be used to report region estimates of higher posterior density than unclustered methods. We use and generalize this observation by exploting the tree structure generated by splitting and pruning moves to form powerful credible region estimators. In particular, each particle filter (leaf) node already yields conventional region estimators such as covariance ellipsoids, convex hulls, and minimum volume enclosing ellipsoids [5], such that we can complete our region estimation procedure by specifying region estimators for each decision and mixture node, recursively.

Following this strategy, at each mixture node, our estimation procedure assigns a region estimate that is the union of the region estimates for each of its children. At each decision node, our procedure reports the union of the first n of its children's region estimates, with its children arranged in descending order by their weights, and with n obtained from the corresponding context parameter, as described in section 3.3. The final region estimate Xα can thus be interpreted as guaranteeing that the probability the model vector ${\boldsymbol{x}}$ is within Xα, conditioned on the model with the highest posterior probability, is at least α. Importantly, the credibility parameter α is only used at the leaf nodes as a parameter to the local region estimation procedure.

3.5. Structured filtering algorithm

Our algorithm depends on two global parameters as well as the context parameters described above.

  • dmax (integer): This global parameter sets the maximum depth that a node is allowed to be from the root in the structure graph.
  • $\{{n}_{\mathrm{cluster},i}\}$ (set of integers): This global parameter sets degrees of each vertex in the structure graph that will be considered by splitting moves. This imposes a limit on the maximum number of clusters for the data of ${({{\rm{\max }}}_{i}{n}_{\mathrm{cluster},i})}^{{d}_{{\rm{\max }}}-1}$.

With this description in place, we now present our algorithm in full as algorithm 1. Important subroutines are listed separately in appendix B. The splitting move used in algorithm 1 is demonstrated graphically in figure 6.

The numerical examples shown in sections 4, 5 and appendix A were obtained using an implementation in Python 2.7 (Anaconda distribution), with the NumPy [32], SciPy [33], and QInfer [34] libraries.

Algorithm 1. Structured filtering algorithm.

Input: Number of particles ${n}_{{\rm{part}}}$, minimum number of particles ${n}_{\min ,{\rm{part}}}$, max depth dmax, set of cluster numbers to consider $\{{n}_{\mathrm{cluster},i}\}$, initial context ${\bf{c}}$, local resampler R, local resample threshold $r\in [0,1]$, number of experiments ${n}_{{\rm{\exp }}}$, prior distribution $\pi ({\boldsymbol{x}})$, likelihood function $\Pr (d| {\boldsymbol{x}})$.
function StructuredFilter (${n}_{{\rm{part}}}$, dmax, $\{{n}_{\mathrm{cluster},i}\}$, ${\bf{c}}$, R, D, π, $\Pr (d| {\boldsymbol{x}})$)
Initialization.
Create a new filter node ϕ by drawing ${n}_{{\rm{part}}}$ particles from π.
Create a new decision node ρ with ϕ as its only child, and assign ${\bf{c}}$ as its context.
Set the weight of the edge $\rho \to \phi $ to one.
Data processing.
for ${i}_{{\rm{\exp }}}\in \{1,...,{n}_{{\rm{\exp }}}\}$ do
Experiment design and data collection.
for each filter node ϕ descended from ρ do
Let $\mathrm{bf}(\phi )$ be the product of the weights of each edge leading from ρ to ϕ.
end for
Let ${\phi }_{{\rm{\min }}}=\arg \,{{\rm{\min }}}_{\phi }\mathrm{bf}(\phi )$ be the filter node with minimal Bayes factor.
Draw unique $\{{{\boldsymbol{x}}}_{1},{{\boldsymbol{x}}}_{2}\}\sim \phi $.
Let $t=1/\parallel {{\boldsymbol{x}}}_{1}-{{\boldsymbol{x}}}_{2}\parallel $.
Perform the experiment ${\boldsymbol{e}}=(t,{{\boldsymbol{x}}}_{1})$, collecting the outcome d.
Update via tree traversal.
for each node ν in a depth-first traversal from ρ do
if ν is a filter node then
Let $\{{w}_{i}\}$ and $\{{{\boldsymbol{x}}}_{i}\}$ be the weights and particles at ν.
Update each wi as
${w}_{i}\mapsto {w}_{i}\times \Pr (d| {{\boldsymbol{x}}}_{i};{\boldsymbol{e}}).$
Multiply the weight of the edge to ν from its parent by ${\sum }_{i}{w}_{i}$.
Renormalize ${w}_{i}\mapsto {w}_{i}/{\sum }_{i}{w}_{i}$.
else ▷Push weights up the tree.
Multiply the weight of the edge to ν from its parent by the sum of the edge weights outgoing from ν.
Renormalize the weights outgoing from ν to sum to 1.
end if
end for
Pruning.
for each node ν descended from ρ do
if ν has at least two children then ▷See also: figure 5(a).
Let w be the largest weight of an edge outgoing from ν.
if $w/(1-w)\,\gt $ the current context's champion threshold then
Let χ be the child of ν with edge weight w.
Remove all children of ν except for χ.
Set the weight of $\nu \to \chi $ to 1.
end if
end if
if ν is a selection node then ▷See also: figure 5(b).
for each child χ of ν do
if weight of $\nu \to \chi \,\lt $ the current context's floor threshold then
Remove χ.
end if
end for
Renormalize the weights outgoing from ν to sum to 1.
end if
if ν is the only child of its parent then ▷See also: figure 7(a).
Let α be the parent of ν.
for each child χ of ν do
Move χ to be a child of α, keeping the weight of the current edge $\nu \to \chi $.
end for
Remove ν from the children of α.
end if
If ν has exactly one child then ▷ See also: figure 7(b).
Let χ and α be the child and parent of ν.
Append χ as a child of α, with weight given by the current edge $\alpha \to \nu $.
Remove ν from the children of α.
end if
end for
Structured resampling.
for each filter node ϕ descended from ρ do
Let $\{{w}_{i}\}$ and $\{{{\boldsymbol{x}}}_{i}\}$ be the weights and particles at ϕ.
Let ${n}_{\mathrm{ess}}(\phi )=1/{\sum }_{i}{w}_{i}^{2}$ be the effective sample size of ϕ and $n(\phi )$ be the number of particles at ϕ.
if ${n}_{\mathrm{ess}}/n(\phi )\lt r$ then
if depth of $\phi \lt {d}_{\max }$ then
▷Perform a splitting move (figure 6).
Replace ϕ in its parent by a new decision node δ.
for ${n}_{\mathrm{cluster}}\in \{{n}_{\mathrm{cluster},i}\}$ do ▷ Make new mixture nodes to represent each possible number of clusters.
if ${n}_{\mathrm{cluster}}=1$ then
Append a copy $\phi ^{\prime} $ of ϕ to δ.
Locally resample $\phi ^{\prime} $ using R.
else
Let $\{{{\ell }}_{i}\},\{{y}_{j}\}=$ WeightedKMeans $\{{{\boldsymbol{x}}}_{i}\}$, $\{{w}_{i}\}$, ncluster. ▷ See also algorithm 2.
Make a new mixture node μ and append it as a child of δ.
for $j\in \{1,...,{n}_{\mathrm{cluster}}\}$ do ▷Use local resampler to populate new filter nodes.
Make a new filter node ${\phi }_{j}$ and append it as a child of μ.
Let ${I}_{j}=\{i\in \{1,...,n(\phi )\}| {{\ell }}_{i}=j\}$ be the indices of the $j{\rm{t}}{\rm{h}}$ cluster.
Set the weight of $\mu \to {\phi }_{j}$ to ${\sum }_{i\in {I}_{j}}{w}_{i}$.
Draw $\max ({n}_{\min ,{\rm{part}}},| {I}_{j}| )$ particles $\{{{\boldsymbol{x}}}_{i}^{{\prime} }\}$ from $R(\{{w}_{i}\,:i\in {I}_{j}\}),\{{{\boldsymbol{x}}}_{i}\,:i\in {I}_{j}\})$.
Set the particles at ${\phi }_{j}$ to be $\{{{\boldsymbol{x}}}_{i}^{{\prime} }\}$, with uniform weights $1/| {I}_{j}| $.
end for
end if
end for
Set the weights of edges outgoing from δ to be uniform and summing to 1.
else
Locally resample ϕ with R.
end if
end if
end for
end for
end function

4. Application to RGE

To demonstrate the effectiveness of our structured filtering algorithm, we consider the RGE model of Zintchenko and Wiebe [22]. An RGE experiment consists of choosing a state $| \psi \rangle =U| 0\rangle $ for a Haar-random unitary U (or U sampled from a 2-design) and fixed preparation $| 0\rangle $, evolving under the unknown Hamiltonian for a time t, then measuring $\{| \psi \rangle \langle \psi | ,{\mathbb{1}}-| \psi \rangle \langle \psi | \}$. Labeling the measurement outcome $| \psi \rangle \langle \psi | $ as 0, we obtain the likelihood function for RGE,

Equation (7)

where $\dim \,H=k+1$ and where H has eigenvalues eigenvalues $\{{\lambda }_{0},\mathrm{...},\,{\lambda }_{k}\}$. RGE is significant because it gives a way to estimating the gaps ${{\rm{\Delta }}}_{i,j}:={\lambda }_{i}-{\lambda }_{j}$ in the spectrum of an unknown Hamiltonian without requiring entanglement with an external qubit. While these gaps can be inferred directly using SMC, not all gaps will be self consistent. It is therefore easier to learn the eigenvalues, which are unconstrained, than it is to impose the appropriate constraints on the gaps.

Importantly, the RGE likelihood function is highly degenerate, as (7) only depends on $\{{{\rm{\Delta }}}_{i,j}\}$ and not on the eigenvalues of interest. These degeneracies can be included analytically, so that we can verify the estimates obtained by structured filtering. It is worth noting that in cases where the eigenvalues are randomly distributed then with high probability there will only exist two degenerate orderings and all other orderings can in principle be resolved by solving the turnpike problem. Despite this, many approximate degeneracies are likely to occur in such settings and these can be just as hazardous to learning as exact degeneracies. As such, there is still a major need for structured filtering even in unstructured gap estimation problems.

As a final point, in practice it is likely that experimental data will be provided for RGE in batches if processing time for each update is long relative to data acquisition time. We deal with this by assuming the data is acquired using ${n}_{\mathrm{meas}}$ measurements and the resulting distribution of 0 or 1 measurements is given by a binomial distribution with $p=\Pr (0| H;t)$. In practice, we take three measurements in our numerical experiments.

There are many ways that we could pick the experimental parameter t. The best method is to choose θ to minimize the Bayes risk using a numerically optimized strategy. This approach, while highly successful, is computationally expensive [5]. Another approach is to use a heuristic known as the particle guess heuristic (PGH) [18]. The PGH, which is appropriate for periodic likelihoods like (7), guesses $t\propto 1/\sigma $ where σ is an estimate of the uncertainty in the posterior distribution. Here we approximate this by drawing unique ${{\boldsymbol{x}}}_{1}$ and ${{\boldsymbol{x}}}_{2}$ from the prior distribution and take

Equation (8)

The PGH is known to lead to asymptotically optimal scaling for phase estimation and slight variants of this have been shown to come close to saturating the Bayesian Cramér–Rao bound under the assumption of a Gaussian prior; however, less is known about its performance in multi-modal settings.

While this sampling procedure is straight forward for unstructured filtering, it is slightly more complicated in structured resampling. This is because the structured filter is simultaneously entertaining a number of different hypotheses about the clusters that compose the posterior distribution. We avoid this problem by computing the Bayes factors for each such hypothesis supported by the structured filter and only apply the PGH on the hypothesis that has the smallest Bayes factor. That is, we purposely choose the experimental time based on the least probable explanation for the data. We make this choice because it is likely to be conservative and also because it demonstrates the algorithms ability to learn despite being provided with inferior data. The samples are then drawn from the remaining filter nodes according to their weights as per the PGH.

We first benchmark the performance of structured filtering as well as unstructured filtering for estimating the gaps in a three-level system (qutrit) where we have assumed without loss of generality that the lowest energy eigenvalue is ${\lambda }_{0}=0$. We then define the eigenvalue gaps to be ${{\rm{\Delta }}}_{i,j}={\lambda }_{j}-{\lambda }_{i}$. For convenience, let us assume that ${\lambda }_{2}\geqslant {\lambda }_{1}$. The three gaps in the problem are then ${{\rm{\Delta }}}_{\mathrm{0,1}},{{\rm{\Delta }}}_{\mathrm{1,2}},{{\rm{\Delta }}}_{\mathrm{2,0}}$. The gaps that are learned can specify, up to an additive constant, the largest energy eigenvalue i.e. $\max \{{\lambda }_{1},{\lambda }_{2}\}={{\rm{\Delta }}}_{\mathrm{0,2}}={{\rm{\Delta }}}_{\mathrm{0,1}}+{{\rm{\Delta }}}_{\mathrm{1,2}}$. They cannot, however, uniquely give the first excited state. This is because both ${\lambda }_{1}={{\rm{\Delta }}}_{\mathrm{0,1}}$ and ${\lambda }_{1}={{\rm{\Delta }}}_{\mathrm{1,2}}$ are consistent with the data. There is also an obvious permutation symmetry in this problem wherein ${\lambda }_{1}\leftrightarrow {\lambda }_{2}$ leaves the likelihood invariant. In order to address these degeneracies in our assessment of the algorithm, we define a canonical ordering of the eigenvalues and assignment of the gaps that in effect removes the four-fold degeneracy. We then compute the canonical quadratic loss as

Equation (9)

which we use as our figure of merit for the inference. Here each wi is a particle weight taken from the clustered posterior distribution.

The multi-modal nature of the posterior distribution can be seen in figure 8, wherein we consider a structured posterior that consists of a choice between one and two clusters at each splitting, with ${d}_{\max }=4$, such that our algorithm includes the optimal number of clusters four as a possibility. The figure provides the posterior distribution of every particle in the filter wherein the size of the particle denotes the weight assigned to it in the filter node and all preceding mixture and decision nodes in the structure graph. See supplementary information available online at stacks.iop.org/NJP/19/083014/mmedia for an animation of the complete inference process. We observe that after only 100 updates, the structuring algorithm has clearly identified the structure of the ideal posterior, although it has not yet conclusively learned the number of clusters. After 150 updates we observe peaks in the probability density begin to congeal around the true eigenvalues (in particular near ($0.75,0.15$)).

Figure 8.

Figure 8. Posterior distributions for using structured filtering to learn potential eigenvalues for randomized gap estimation with three eigenvalues where one is without loss of generality taken to be 0. The true model has unknown eigenvalues (0.75, 0.15) and the symmetries of the problem imply the final posterior should ideally be concentrated at (0.75, 0.15), (0.15, 0.75), (0.6, 0.75), (0.75, 0.6). The floor threshold was set to 0.1 and champion threshold was set to 20. For this data and the graph was set to have maximum degree 3 and maximum depth 4. Liu–West resampling with $a=0.98$ was used to cluster each cluster in the posterior and a minimum of 1000 particles was assigned to each cluster. A video showing the operation of our algorithm is provided in the supplementary material, or online at goo.gl/4NKYaX.

Standard image High-resolution image

At 250 updates we see that the structured filtering algorithm correctly identifies all four of the equivalent degenerate solution for the eigenvalues. The algorithm further recognizes that the four cluster model for the data is by far the best model for the posterior distribution. This is significant because we make no apriori assumption that the ideal posterior is composed of four clusters. Furthermore, apart from giving an accurate point estimate of the eigenvalues, we obtain a posterior distribution from which uncertainties in these four estimations can be gleaned. This information is of great value in experiments because principled estimates of uncertainty are often hard to find. By contrast, our algorithm outputs them by default.

In addition to outputting the posterior distribution, structured filtering can output α-credible regions and also structural information about the posterior distribution. Figure 9(a) provides an $\alpha =0.95$ credible estimate of a region where the true eigenvalues can be found. Notably, because this region estimate does not output a single pair of eigenvalues its interpretation remains consistent regardless whether structured or unstructured filtering is used. The region estimate also notably overlaps with the four modes of the ideal posterior distribution in this example.

Figure 9.

Figure 9. (a) Region estimate and (b) structural tree for the example shown in figure 8. Note the three mixture nodes that directly descend from the root in subfigure (b) indicate that the posterior consists of at least four clusters, as each of the four children of these mixture nodes are decision nodes representing at least one cluster.

Standard image High-resolution image

Figure 9(b) gives the structure graph for the posterior after 300 updates. The three mixture nodes that directly descent from the root node reveal that the algorithm has completely excluded one, two and three cluster models as viable explanations for the data. This structure again was not imposed upon the posterior, nor was it directly included via our splitting rule. Instead this structure emerged through the course of the 300 updates by following the rules set out by structured filtering for both growing as well as pruning the structure tree. This shows that structured filtering is not only able to learn patterns present in a complex posterior without substantial coaxing by the user, but also convey these inferred structures in a concise human-parsable format. Additionally, by tracing through the structure graph from the root node, we can see that a four cluster (or approximately four cluster) model is preferred in each branch considered. In particular, each of the four children of the mixture nodes closest to the root are decision nodes that prefer single-cluster explanations of the posterior. This illustrates that structured filtering is capable of more than just inferring the structure of the posterior distribution: it is also capable of reporting it in a human readable format.

Figure 10 shows that this degeneracy in the likelihood function causes LW to fail in both the mean and median ${{ \mathcal L }}_{\mathrm{can}}$ after roughly 200 experiments where it saturates at a value on the order of ${10}^{-3}$. This is to be expected because the unimodality assumption implicitly made in Liu–West resampling will generically fail here just as it did in figure 1.

Figure 10.

Figure 10. Comparison of Liu–West resampling and structured filtering for randomized gap estimation (RGE). Means and medians are computed using 1000 random RGE instances, with ${\lambda }_{1}$ and ${\lambda }_{2}$ sampled from the initial prior which is uniform on ${[0,1]}^{2}$.

Standard image High-resolution image

Structured filtering, on the other hand, can learn the correct eigenvalues within numerical precision with high probability after only 1000 experiments, and further we see clear evidence of exponential scaling of the mean canonical loss. This implies that even the worst case behavior of the algorithm is not pathological. We find from linear regression that after 400 experiments the mean canonical loss obeys $\mathrm{mean}({{ \mathcal L }}_{\mathrm{can}})\approx {{\rm{e}}}^{-0.0083n-6.7}$. The data for the median does not demonstrate a single clear asymptotic scaling. The appearance of two different time scales for the problem likely correspond to the timescales needed to distinguish the two degenerate possibilities for the eigenvalues (i.e. solve the turnpike problem on the gaps) and the timescale needed to refine this knowledge once a degenerate set of solutions to the turnpike problem is found.

This clearly shows that in cases where we have a multi-modal posterior that structured filtering can succeed where the gold-standard particle filtering methods used in quantum experiments fail. It succeeds here because structured filtering is capable of recognizing the multi-modal nature of the posterior and adjust the particle filter accordingly. LW cannot succeed because it always assumes a unimodal posterior and here the ideal posterior has 4 modes. For this reason, we expect structured filtering to be a broadly applicable method capable of dealing both with the degeneracies that appear in RGE and also approximate degeneracies that appear in other applications.

5. Application to collapse-free phase estimation

Phase estimation has become a ubiquitous algorithm in quantum computing because of its ability to learn eigenvalues of a unitary using quadratically fewer queries to that unitary. Of the many variants of phase estimation, iterative phase estimation has gained in popularity owing to the fact that it does not require a large qubit register to store the estimated eigenvalues [35]. It works by replacing the quantum interference step used in traditional phase estimation with an adaptive inference algorithm to learn the most likely phase given a set of measurements. This approach requires a number of queries to the blackbox that is optimal to within a small multiplicative factor and is thus also the preferred technique if speed is also an issue.

Both iterative and traditional phase estimation require long sequences of gates in order to learn the eigenstate, which is perhaps the biggest reason why most simulation experiments eschew this approach at present [36]. These long sequences arise because long experimental times are needed to unambiguously collapse the state onto an eigenvalue with a small number of experiments according to energy/time uncertainty. One approach to address is this issue is to shift the burden of collapsing the state from the quantum computer to the classical inference algorithm by applying phase estimation on a fresh copy of the initial state with each experiment. We call this approach collapse-free phase estimation as it does not collapse the quantum state onto an eigenstate of the Hamiltonian.

A challenge facing collapse-free phase estimation is that signal is received from multiple eigenvalues when the input state is a superposition of different eigenstates. Thus any estimate of a single eigenvalue for the distribution will have to disambiguate data that comes from distinct eigenvalues. One approach to dealing with this issue is to mimic wave function collapse by implicitly introducing biases against data that is more likely to come from other eigenvalues. This approach is taken by by Wiebe and Granade [37] and by Santagati et al [38], wherein a Bayesian form of phase estimation is used to introduce such biases through applying a unimodal approximation to the posterior.

While unimodal approaches do not utterly fail here, they are not expected to perform as well because of the multi-modal nature of the ideal posterior if multiple eigenvalues are present. To examine this further, let us first define

Equation (10)

where Ej is the jth eigenvalue of a Hamiltonian H. Given such an initial state, the likelihood of measuring 'zero' in the two outcome phase estimation experiment for ${{\rm{e}}}^{-{\rm{i}}{H}{t}}$ given experimental parameters t and θ is

Equation (11)

In collapse-free phase estimation, we do not know how many eigenvalues there are in the support of $| \psi \rangle $. We also do not want to track them explicitly because there are exponentially many in the worst case. Instead, we use the following likelihood function to model the experiment:

Equation (12)

where $h\in [0,1]$ is a hedging parameter that is used to combat overconfidence in an update. This model in essence assumes that all the data arises from a single eigenvalue, and aims to estimate the most likely single E given data that arises from (11).

We similarly use (8) to choose the experimental times and take $\theta ={{\boldsymbol{x}}}_{2}$ in the PGH. Since this heuristic is known to provide asymptotically optimal scaling in cases where only one eigenvalue is present [18], we anticipate that it will provide similar advantages here as well.

We consider two eigenvalues with uniformly distributed values between 0 and 1 and attempt to learn one of the two eigenvalues. If the two eigenvalues are E1 and E2 then the loss function we consider is:

Equation (13)

where xi is the position of particle i each of which corresponds to an eigenvalue of H. We then consider the mean and the median over 1000 randomly pairs of eigenvalues, where we pick the worst case scenario of ${a}_{1}={a}_{2}=1/\sqrt{2}$.

We examine the performance of structured filtering and Liu–West resampling for this problem in figure 11. Specifically we see that while both methods are capable of rapidly learning one of the eigenvalues, Liu–West resampling performs substantially worse. This is not simply because structured filtering used more particles; Liu–West resampling does not give significantly better results when we allow it more than 12 000 particles. These differences are, unsurprisingly, most striking in the mean. With Liu–West resampling linear regression shows that the mean minimum quadratic loss decays roughly as ${{ \mathcal L }}_{\min }\approx [8.1\times {10}^{-3}]{{\rm{e}}}^{-0.0095n}$ whereas the quadratic loss decays for structured filtering as ${{ \mathcal L }}_{\min }\approx [6.7\times {10}^{-3}]{{\rm{e}}}^{-0.0164n}$, where n is the number of experiments. This shows that the two asymptotic scalings differ polynomially, specifcially by a power of rougly 3/2. This shows that structured filtering can improve the performance of collapse-free phase estimation compared to Liu–West resampling, which is arguably the most powerful efficient inference method previously used for such problems.

Figure 11.

Figure 11. Comparison of Liu–West resampling and structured filtering for collapse-free phase estimation with two true eigenvalues.

Standard image High-resolution image

6. Conclusion

Current efficient methods for parameter estimation in quantum systems are implicitly biased via structural assumptions about the posterior distribution [6, 9, 14, 39]. Here we show that even the subtle biases introduced through filtering with Liu–West resampling can catastrophically fail to estimate parameters for problems where the experiments chosen are incapable of distinguishing between two equivalent hypotheses. We address this by introducing a method that can adaptively reason about the structure of the posterior distribution and break it up into clusters that individually are ammenable to Liu–West filtering or other methods that are appropriate for unimodal distributions. Specifically we represent the algorithm's beliefs about the structure of the posterior as a weighted graph that describes the different possible clusterings for the posterior distribution, and allow it to invent new hypotheses or eliminate previous ones via a discrete set of graph manipulations. By using this approach we grant structured filtering the ability to introspect on its own beliefs about the true model parameters. This introspection is key to the success of our algorithm.

We note that by allowing the inference algorithm to adapt its assumptions about the structure of its current state of knowledge we allow Bayesian inference to succeed in problems, such as RGE [22], where existing leading methods fail unless strong prior information is provided. It also manages to outperform its unstructured counterpart in learning eigenphases of unitary operations in settings where the data comes from multiple eigenvalues. This illustrates the power and broad applicability of the technique.

Looking forward, there are many ways that these methods can be built upon. Firstly, it is important to improve our understanding of the failure modalities of our algorithm. In particular, we have designed our approach such that, in worst cases such as an insufficient tree depth or overly aggressive pruning rules, we revert to the failure modalities of Liu–West. Better understanding of the conditions which lead to reverting in this way thus allows for more robustly realizing the advantages of our approach over Liu–West. Moreover, we note that our approach also admits new less severe failure modalities than existing approaches, in that overly aggressive pruning rules can cause the incorrect truncation of an individual mode, while leaving other modes correctly intact. This allows our algorithm to correctly report some important properties of a posterior, even in cases where numerical approximations have partially failed.

Secondly, while our approach does an excellent job of approximating the posterior distribution it does a less perfect job of approximating the regions of importance in the posterior distribution. In particular, our rules for restructuring the graph assume that regions of hypothesis space that have low probability have low importance; however, since Bayes' rule is additive in logarithmic space, wild fluctuations can occur in the posterior probability. Such fluctuations can be common in cases like collapse-free phase estimation because the approximate likelihood does not match the true likelihood. This means that it may be important for the structured filters to learn how to distribute particles to accommodate data that is given low probability by the assumed likelihood function.

Furthermore, structured filtering can be viewed as an application of a family of techniques known as probabilistic programming. Further work will be needed to see if subsequent insights from probabilistic programming may yield even more powerful representations for the posterior distribution.

Finally, while we have provided a proof of principle that structured filtering can solve many of the problems plaguing SMC approximations in physics, it remains to apply them experimentally. It is our hope that these methods may prove to be of great use estimating Hamiltonian parameters that have subtle influence on experimental likelihoods, such as those that appear in second order corrections to spectral line splittings. Structured resampling, and approaches like it, may finally have enough power and robustness to tackle such problems in an automated fashion relieving the experimentalist of much of the burden of designing clever experiments to learn hard to measure quantities.

Acknowledgments

CG was supported by the US Army Research Office grant numbers W911NF-14-1-0098 and W911NF-14-1-0103, and by the Australian Research Council Center of Excellence for Engineered Quantum Systems. We thank Sarah Kaiser and Chris Ferrie for helpful comments, and Okabe and Ito [40] for their suggestion of a colorblind-safe palette for figures.

Appendix A.: Additional numerical experiments

While considerable experience has been built over the last few years studying how to optimally learn parameterizations of periodic likelihoods, the meta parameters needed to allow structured filtering to succeed in these cases is not known. Furthermore, it is not clear how robust structured filtering is to the choice of parameters. Here we perform experiments to probe these issues for RGE.

Figure 12 provides numerical experiments that probe the mean and median performance of structured filtering for RGE as a function of ${K}_{\mathrm{champ}}$. We see from the median data that as ${K}_{\mathrm{champ}}$ shrinks the performance of the algorithm improves. This improvement comes about in part because of a coupling with the version of the PGH that we choose. Since we guess experimental times based on the worst covariances kept in the problem, choosing a high champion ratio can retain poor models which leads to less informative experiments yielded by this variant of the PGH. This is why we expect, and observe, that taking ${K}_{\mathrm{champ}}=2$ provides the best performance in the median.

Figure 12.

Figure 12. Median and mean canonical quadratic loss for RGE with 3 eigenvalues as a function of the champion ratio. All structured parameters are applied at the root context.

Standard image High-resolution image

The mean canonical loss given in figure 12 tells a different story. Since the mean is not a robust statistic, it is sensitive to rare instances where the errors are much larger than the typical cases. We see from this data that while the performance tends to improve as ${K}_{\mathrm{champ}}\to 20$, the algorithm is much more likely to catastrophically fail for ${K}_{\mathrm{champ}}=2$. Again this is expected because as ${K}_{\mathrm{champ}}$ goes to 1 we expect that the probability of falsely concluding that the most probable model is the correct one increases. Thus the optimal choice of ${K}_{\mathrm{champ}}$ is analogous to trading off type 1 and type 2 errors in hypothesis testing.

For our numerics we wished to avoid catastrophic errors that could dominate the scaling of the mean so we chose ${K}_{\mathrm{champ}}=2000$. While this value dramatically increases the number of experiments needed to solve RGE problems in the median according to figure 12 relative to ${K}_{\mathrm{champ}}=2$ or 20. In principle, these drawbacks may be addressable with improved guess heuristics for the experimental times here. We leave a more thorough investigation of such questions for subsequent work.

Figure 13 addresses the question of how the experimental times should be guessed for RGE experiments using structured filtering. Previous work, suggested that for alternative filtering strategies such as rejection filtering [22], a pgh constant of 2 performs well. We see similar results here, with a constant of 2 outperforming the standard choice used in previous Hamiltonian learning work of 1. We see that this improved performance is exhibited in both the median and the mean here, although the mean performance for a constant of 1 seems much more variable than it is for larger values of the constant. While this is not necessarily a negative thing because it shows that the error is both low and dominated by a few outliers, it makes a Monte Carlo estimations of the mean more expensive. For this reason we retain the pgh constant of 1 instead of 2 and note that again these results can be optimized by choosing more intelligent experiments.

Figure 13.

Figure 13. Median and mean canonical quadratic loss for RGE with 3 eigenvalues as a function of the constant for the PGH used. A PGH constant of 1, i.e. $t=1/\parallel {x}_{1}-{x}_{2}{\parallel }_{2}$ for x1 and x2 sampled from the prior, is used in the remainder of the paper. All structured parameters are applied at the root context.

Standard image High-resolution image

Figure 14 investigates the role of the floor weight rule here. We see that the data is relatively insensitive to the choice of floor weights. This implies that usually the champion rule is responsible for the majority of the pruning of the structure graph.

Figure 14.

Figure 14. Median and mean canonical quadratic loss for RGE with 3 eigenvalues as a function of the floor weights. All structured parameters are applied at the root context.

Standard image High-resolution image

Appendix B.: Pseudocode

Algorithm 2. Weighted k-means unsupervised clustering algorithm.

Input: Particle locations $\{{{\boldsymbol{x}}}_{i}\,:i\in \{1,...,{n}_{{\rm{part}}}\}\}$, particle weights $\{{{\boldsymbol{x}}}_{i}\,:i\in \{1,...,{n}_{{\rm{part}}}\}\}$, number of clusters k.
Output: Cluster labels $\{{{\ell }}_{i}\}\subseteq \{1,...,k\}$, cluster centroids $\{{{\boldsymbol{\mu }}}_{j}:j\in \{1,...,k\}\}$.
function WeightedKMeans($\{{{\boldsymbol{x}}}_{i}\}$, $\{{w}_{i}\}$, k)
Initialization.
Let $\{{\mu }_{j}\}\,\leftarrow $ KMeans++$\{{{\boldsymbol{x}}}_{i}\}$. ▷ Initialize centroids.
Let ${{\ell }}_{i}\leftarrow {\mathrm{argmin}}_{j}\parallel {{\boldsymbol{x}}}_{i}-{{\boldsymbol{\mu }}}_{j}{\parallel }^{2}$ for each $i\in \{1,...,{n}_{{\rm{part}}}\}$. ▷ Initialize labels from centroids.
Iterative improvement.
for ${i}_{\mathrm{iter}}\in 1\to {n}_{\mathrm{iters}}$ do
▷ Note that the next line is where weighted and unweighted k-means differ, in that we consider the weights wi.
Let ${{\boldsymbol{\mu }}}_{j}\leftarrow {\sum }_{i{\rm{s.t.}}{{\ell }}_{i}=j}{w}_{i}{{\boldsymbol{x}}}_{i}/{\sum }_{i{\rm{s.t.}}{{\ell }}_{i}=j}{w}_{i}$ for each $j\in \{1,...,k\}$. ▷ Recompute centroids from previous labels.
Let ${{\ell }}_{i}\leftarrow {\mathrm{argmin}}_{j}\,\parallel {{\boldsymbol{x}}}_{i}-{{\boldsymbol{\mu }}}_{j}{\parallel }^{2}$ for each $i\in \{1,...,{n}_{{\rm{part}}}\}$. ▷ Recompute labels from new centroids.
if no labels changed this iteration then
return $\{{{\ell }}_{i}\}$, $\{{{\boldsymbol{\mu }}}_{j}\}$.
end if
end for
Error handling.
raise an error indicating too many iterations were used.
end function

Algorithm 3. k-means++ procedure [26] for initializing centroid locations.

Input: Particle locations $\{{{\boldsymbol{x}}}_{i}\,:i\in \{1,\mathrm{...},\,{n}_{{\rm{part}}}\}\}$, number of clusters k.
Output: Initial cluster centroids $\{{{\boldsymbol{\mu }}}_{i}:i\in \{1,\mathrm{...},k\}\}$.
function KMeans++($\{{{\boldsymbol{x}}}_{i}\}$, k)
Initialization.
Draw i uniformly at random from $\{1,...,{n}_{{\rm{part}}}\}$.
Let ${{\boldsymbol{\mu }}}_{1}={{\boldsymbol{x}}}_{i}$.
Iteration.
for $j\in 2\to k$ do
Draw i from $\{1,...,{n}_{{\rm{part}}}\}$ with probability ${D}^{2}(i)/{\sum }_{i}{D}^{2}(i)$, where
$D(i)\,:=\,{\min }_{j^{\prime} \in \{1,...,j-1\}}\parallel {{\boldsymbol{x}}}_{i}-{{\boldsymbol{\mu }}}_{j^{\prime} }\parallel .$
Let ${{\boldsymbol{\mu }}}_{j}={{\boldsymbol{x}}}_{i}$.
end for
Finalization.
return $\{{{\boldsymbol{\mu }}}_{j}\}$
end function

Please wait… references are loading.