Anomaly detection in aeronautics data with quantum-compatible discrete deep generative model

Deep generative learning cannot only be used for generating new data with statistical characteristics derived from input data but also for anomaly detection, by separating nominal and anomalous instances based on their reconstruction quality. In this paper, we explore the performance of three unsupervised deep generative models—variational autoencoders (VAEs) with Gaussian, Bernoulli, and Boltzmann priors—in detecting anomalies in multivariate time series of commercial-flight operations. We created two VAE models with discrete latent variables (DVAEs), one with a factorized Bernoulli prior and one with a restricted Boltzmann machine (RBM) with novel positive-phase architecture as prior, because of the demand for discrete-variable models in machine-learning applications and because the integration of quantum devices based on two-level quantum systems requires such models. To the best of our knowledge, our work is the first that applies DVAE models to anomaly-detection tasks in the aerospace field. The DVAE with RBM prior, using a relatively simple—and classically or quantum-mechanically enhanceable— sampling technique for the evolution of the RBM’s negative phase, performed better in detecting anomalies than the Bernoulli DVAE and on par with the Gaussian model, which has a continuous latent space. The transfer of a model to an unseen dataset with the same anomaly but without re-tuning of hyperparameters or re-training noticeably impaired anomaly-detection performance, but performance could be improved by post-training on the new dataset. The RBM model was robust to change of anomaly type and phase of flight during which the anomaly occurred. Our studies demonstrate the competitiveness of a discrete deep generative model with its Gaussian counterpart on anomaly-detection problems. Moreover, the DVAE model with RBM prior can be easily integrated with quantum sampling by outsourcing its generative process to measurements of quantum states obtained from a quantum annealer or gate-model device.


Introduction
The field of machine learning has experienced an explosion in the development of deep-learning methods at the beginning of the 21st century, due to the flexibility, scalability, and superior performance of deep learning in classification, prediction, data generation, anomaly detection, and other applications [Hinton et al., 2006, Bengio et al., 2006, LeCun et al., 2015, Goodfellow et al., 2016].The phenomenal success of deep learning, which refers to machinelearning techniques that use artificial-neural-network models with many layers, has been enabled by the widespread availability of specialized graphics processing units (GPUs) to perform computing-intensive linear-algebra operations on vectors, matrices, and tensors.These algebraic structures hold the numerical values of the input, intermediate, and output layers of the (deep) network, as well as the values of the biases and weights that are applied to the hidden and output units (neurons) of the network.During network training, manipulations of vectors, matrices, and higher-order tensors are performed ubiquitously to compute the network's output in the forward pass and the gradients to update the biases and weights in the backward pass [LeCun et al., 2015, Witten et al., 2017].
One way to characterize the training process of a neural network is to differentiate between supervised and unsupervised learning [Dayan andAbbott, 2005, Witten et al., 2017].In supervised learning, a 'teacher' imposes a set of desired input-output relationships on the network.For example, the training set might contain an extra column that specifies the desired output of the network, such as a class label.The class label or other supervisory information is not available during testing, when the performance of the network is evaluated.In unsupervised learning, no such oversight is provided, and the network's response is self-organized and solely relies on the interplay between external input, intrinsic connectivity, network dynamics, and the value of a cost function that the network attempts to minimize.Unsupervised learning is computationally more complex than supervised learning and still a largely unresolved problem in machine learning.It has attracted considerable research effort [Hinton et al., 1995, Bengio et al., 2006, Vincent et al., 2008] because it holds the potential to uncover the statistical structure and hidden correlations of large unlabeled datasets, which constitute the predominant form of today's data.
Generative modeling is a machine-learning technique widely used in unsupervised learning.Generative modeling attempts to estimate the probability distribution of a dataset.To accomplish this goal, generative models frequently employ a set of latent variables that represent unobserved factors that influence the values of observed variables.Deep generative models such as generative adversarial networks [Goodfellow et al., 2020], variational autoencoders (VAEs) [Kingma and Welling, 2013], and deep belief networks [Hinton and Salakhutdinov, 2006] have been widely applied to machine-learning use cases in science and engineering.In the studies reported in this paper, we use VAEs for generative modeling because a VAE possesses an efficient inference mechanism, incorporates regularization via a prior, maximizes a lower bound on the log likelihood, and allows estimation of the log likelihood via importance sampling [Kingma and Welling, 2013, Burda et al., 2015].VAEs employ the evidence lower bound (ELBO) as a variational lower bound on the exact log likelihood.The (negative) ELBO is a well-defined, fully differentiable, loss function whose gradients are used to efficiently optimize network weights through backpropagation, permitting competitive performance in mining large datasets.
The majority of VAE and other generative-model designs reported in the literature use continuous latent spaces because of the widespread applicability of the normal distribution, which is continuous, due to the central limit theorem and the difficulty of propagating gradients through discrete variables.However, many deep-learning use cases rely on discrete latent variables to represent the required distributions, such as in applications in supervised and unsupervised learning, attention models, language models, and reinforcement learning [Kingma et al., 2014, Jang et al., 2016, Maaløe et al., 2017].In particular, if the values of latent variables are to be computed by quantum computers, the latent variables need to be discrete because projective qubit measurements in the computational basis produce eigenvalues of -1 or +1.See supplementary section S1 for a more in-depth account of the importance of discrete-variable models.
In previous studies, discrete VAEs (DVAEs) and quantum VAEs were used to generate new data from samples from the VAE's latent space after the VAE had been trained on a dataset such as MNIST or Omniglot, and the quality of generation (fit of the VAE's model distribution to the distribution of the input data) was assessed by estimating the log likelihood of test data [Rolfe, 2016, Vahdat et al., 2018a,b, Khoshaman et al., 2019, Khoshaman and Amin, 2018, Vinci et al., 2020, Vahdat et al., 2020].In the studies reported in this paper, we use VAE models with continuous and discrete latent space to detect anomalies in aeronautics data.The datasets used comprise primarily 1-Hz recordings of operationally significant flight metrics from commercial flights.Subject matter experts analyzed the recorded data and identified operationally and safety-relevant anomalies during takeoff and approach to landing.The identification of flight-operations anomalies is important because they can foreshadow potentially serious aviation incidents or accidents.
The application of VAEs to anomaly-detection tasks has become increasingly popular in recent years.An and Cho [2015] suggested an anomaly-detection method in which the anomaly score of a VAE is used as a Monte Carlo estimate of the reconstruction log likelihood (called "reconstruction probability" in the paper).Xu et al. [2018] used a VAE for the detection of anomalies in univariate time series, preprocessed with sliding time windows, representing seasonal key performance indicators in web applications.Based on the success of deep recurrent neural networks (RNNs) in machine-learning applications with sequential data, several studies have incorporated RNNs in VAEs by equipping the VAE's encoder and decoder with long short-term memory constructs [Chen et al., 2019, Wang et al., 2020, Zhang and Chen, 2019, Zhang et al., 2019, Park et al., 2018].The LSTM-VAE approach was also applied to anomaly detection in telemetry data from the Soil Moisture Active Passive (SMAP) satellite and the Mars Curiosity rover [Su et al., 2019].However, the training of a VAE equipped with an RNN architecture on multidimensional time series is computationally costly and may overlook local temporal dependencies.To remedy these shortcomings, Memarzadeh et al. [2020] designed a convolutional VAE (CVAE) and tested its performance in detecting anomalies in time series of various Yahoo!benchmark datasets and in time series spanning the takeoff phase of commercial flights, a task on which the model achieved state-of-the-art performance.
We developed convolutional VAEs with Gaussian, Bernoulli, and Boltzmann priors.The VAE with Gaussian prior has a continuous latent space, whereas the models with Bernoulli or Boltzmann prior have a discrete latent space.The Boltzmann prior is implemented as a restricted Boltzmann machine (RBM) [Smolensky, 1986], that is, as a network of stochastic binary units with full connectivity between visible and hidden units but no connectivity between visible units or between hidden units.The VAE with Gaussian prior and the RBM network of the VAE model with RBM prior are derivations of the CVAE model presented in Memarzadeh et al. [2020] and of the DVAE model depicted in Vinci et al. [2020], respectively.Overall, our studies aim to determine and compare the anomaly-detection performance of the Gaussian, Bernoulli, and RBM models. 1 We want to find out if the anomaly-detection performance of a VAE with discrete latent space is competitive with that of a VAE with Gaussian prior and continuous latent variables, the standard choice of VAE type.Also, if a classical deep generative model with discrete latent variables exhibits a performance that is comparable or superior to that of a continuous-variable counterpart, it is worth exploring if a quantum-enhanced version of the discrete model can achieve a performance that exceeds that of the fully classical discrete model.
We report the results of three sets of experiments.Using a dataset with a drop-in-airspeed anomaly during takeoff, our baseline study explores the behavior of the VAE models during training, with an emphasis on the training behavior of the RBM model, and compares the models' anomaly-detection performance when operating with either optimized or nonoptimal hyperparameters, as given by the performance metrics precision, recall, and F1 score.Our second study investigates the ability of our trained models to generalize (transfer) to a new dataset containing the same anomaly without re-tuning of hyperparameters and without or with post-training on the new dataset.Finally, we examine if the performance of the RBM model is robust to changes in anomaly type and phase of flight, by evaluating the model's performance on a new dataset with delay-in-flap-deployment anomaly during approach to landing; for this study, the model's hyperparameters were re-tuned and the model was re-trained on the dataset used.
The structure of the paper is as follows.In section 2, we review causal generative models, that is, probabilistic models that reconstruct input data from latent variables.The concept of variational inference, as an approximation to an intractable posterior distribution, and prior distributions used in generative modeling with continuous and discrete latent variables are described.Section 3 covers VAEs with continuous and discrete latent spaces.We describe the β-VAE, used in our experiments, a type of VAE model that allows a weighting of the autoencoding and Kullback-Leibler (KL)-divergence terms in the variational ELBO objective.Alternative formulations of RBM prior networks in the VAE's latent space are also introduced.In section 4, we describe the methodology we used to evaluate our models' anomaly-detection performance.In section 5, we present the experimental findings of our three studies, outlined in the preceding paragraph.We discuss model design and performance in section 6 and present our conclusions in section 7. The appendix and supplementary material contain additional information on concepts and experiments.

Causal generative modeling
The goal of generative modeling is to estimate the probability distribution of the input data, p(x), which is unknown but assumed to exist.The distribution of synthetic data points, estimated by the model, is called the marginal distribution, p θ (x) (where θ denotes the model parameters).The goal is to make p θ (x) as close to p(x) as possible.In order to accomplish this objective, representational modeling computationally analyzes the statistical structure of a dataset and attempts to identify a set of latent (unobserved) variables z that represent the dominant features of the dataset.Latent variables are also known as 'causes' [Dayan and Abbott, 2005].A graphical model of a directed generative model with latent variables is depicted in figure 1.The generative model (decoder) reconstructs the input data from the latent variables [Dayan andAbbott, 2005, Kingma andWelling, 2019].The statistics of the generative model's output are given by the marginal distribution: (1) Here, the joint distribution of the input variables x and the latent variables z, p θ (x, z), defines the model distribution.
In a directed generative model, the model distribution is explicitly factored into the generative distribution p θ (x|z), the distribution of the generative model's output given a latent-space realization, and the model's prior distribution p θ (z).
Figure 1.Generative models with latent variables, z, can be represented as probabilistic graphical models that depict conditional relationships among variables.In a directed generative model, the model distribution, p θ (x, z) = p θ (x|z)p θ (z), is explicitly factored into the generative (decoder) distribution, p θ (x|z), and the model's prior distribution over latent variables, p θ (z).The marginal distribution, p θ (x), is obtained by marginalizing (integrating or summing) over the latent variables.Since the computation of the true posterior, p θ (z|x), is intractable, variational inference substitutes it with an approximating posterior (encoder), q φ (z|x).
Black, solid arrows denote the generative model and blue, dashed arrows the variational approximation to the intractable posterior.
In the VAE models used to conduct the experiments described in this paper, the posterior and prior over latent variables are either (a) continuous Gaussian distributions or (b) discrete Bernoulli or Boltzmann distributions.We use the symbolic expression z d to denote a discrete latent variable.
The marginal distribution is then obtained by integrating ('marginalizing') over the latent variables (see figure 1).If the latent variables are discrete, the integration in ( 1) is replaced by summation.Generative modeling is well suited for unsupervised learning: Lacking supervisory information, a model's performance is determined by the ability of its latent variables to represent and reproduce the statistical structure of the input variables [Dayan and Abbott, 2005].
In representational learning, input-dependent latent variables (the 'code') are identified by a second model, the recognition model (encoder).The lower the number of latent variables is, the more compressed is the model's latent space, with the degree of compression given by the ratio of input to latent variables.The statistical distribution of Models that allow the tractable computation of the posterior distribution are called invertible and those that do not noninvertible [Dayan and Abbott, 2005].Noninvertible models do not allow gradient computations and concomitant optimization.Deep latent-variable models are noninvertible because no analytic solution or efficient estimation procedure exists for the marginal probability given in (1).Since the marginal is intractable, the posterior given in (2) is as well as it requires the marginal in its denominator [Kingma and Welling, 2019].Approximate-inference techniques approximate the true posterior p θ (z|x) and marginal p θ (x) [Dayan andAbbott, 2005, Kingma andWelling, 2019].The approximate posterior is written as q φ (z|x) (see figure 1).Variational inference is such an approximation technique.
In variational inference, the marginal log likelihood is decomposed as follows [Kingma and Welling, 2013Welling, , 2019]]: The expression D KL (p||q) ≡ E p log[p/q] stands for the Kullback-Leibler (KL) divergence, an asymmetric 'distance' between two probability distributions.The term L(θ, φ; x) denotes the ELBO-a variational approximation, from below, to the true marginal log likelihood.As the KL divergence is non-negative, the ELBO is the greater and closer to the log likelihood, i.e., the bound tighter, the closer the approximating posterior is to the true posterior.Hence, maximizing the ELBO simultaneously increases the (log) likelihood and decreases the distance between the variational and true posteriors.

Prior distributions
The simplest generative-model priors are factorized standard normal or Bernoulli distributions: N (z l ; 0, 1), where the subscript l denotes a latent variable.Also, we use the symbolic expression z d to denote a discrete latent variable (to delineate it from a continuous latent variable, expressed as z)., 2018] by stipulating a bipartite connectivity between the groups of visible units z d v and hidden units z d h of the BM, without lateral connections between the units within either group, i.e., an RBM [Smolensky, 1986].The visible and hidden units correspond to the input and latent variables, respectively, of an undirected generative model and are binary (0 or 1) in the RBMs we employed in the experiments reported in this paper.An RBM prior is given by [Welling et al., 2004, Salakhutdinov et al., 2007, Khoshaman et al., 2019]: (5) In ( 5), E θ (z d v , z d h ) is the energy function of visible and hidden units and Z θ the normalizing constant (partition function) of prior p θ (z d v , z d h ); W, a, and b are the weight matrix between visible and hidden units and the visible and hidden bias vectors, respectively.The conditional distributions of the hidden given the visible units and of the visible given the hidden units are then given by [Witten et al., 2017]: where σ denotes the logistic function and W T •l is a vector consisting of the transpose of the lth column of weight matrix W and W k• a vector consisting of the kth row of W. Because of the absence of lateral connections between visible units and between hidden units, the conditional probabilities in (6) can be determined in one fell swoop, using block Gibbs sampling.In block Gibbs sampling, the values of all hidden units are updated at once by sampling from their conditional distribution given the visible units [top equation of ( 6)].Then, the values of the visible units are updated analogously [bottom equation of ( 6)].This process can be repeated for an arbitrary number of iterations.When Gibbs sampling is performed for an infinite number of steps, it is guaranteed to converge to the stationary distribution p θ (z d v , z d h ) of the RBM model [Hinton, 2012, Fischer andIgel, 2014], and computationally efficient techniques to learn the model distribution have been developed [Hinton, 2002, Tieleman, 2008, Tieleman and Hinton, 2009].

Variational autoencoders
VAEs are directed generative models with latent variables that approximate the intractable true posterior p θ (z|x) via variational inference and maximize an ELBO objective L(θ, φ; x) (see section 2.1 and figure 1).The ELBO can be re-written as [Kingma and Welling, 2013]: The first term of ( 7) is the autoencoding term.Maximizing it maximizes the fidelity of reconstruction because the greater the autoencoding term is, the greater is the similarity between the data distribution p(x) and the generative distribution p θ (x|z) when z is sampled from the approximate posterior distribution q φ (z|x) of the encoder.Conversely, the second term is maximized by minimizing the KL divergence between the approximate posterior and prior, which corresponds to minimizing the mutual information between x and z.Consequently, the autoencoding term attempts to maximize the mutual information between data and latents and the KL term seeks to minimize it.Eventually, the latent-space's information content will depend on the trade-off between the two terms, which, in turn, is determined by the flexibility and expressiveness of the variational approximation q φ (z|x), the structure of the generative model, and the training method [Khoshaman et al., 2019, Vinci et al., 2020].
Given training set D = {x (n) } N n=1 consisting of N i.i.d.samples from p(x), the ELBO for a minibatch of training data is given as the average of the ELBOs of the minibatch instances [Kingma and Welling, 2013]: where the minibatch M (i) = {x (i,m) } M m=1 contains M data points randomly sampled from D with N data points.

VAE with factorized Gaussian prior
The ELBO objective given in (7) contains expectations of functions of the latent variables z with regard to the variational posterior q φ (z|x), which can be written as E z∼q φ (z|x) [f (z)], where f denotes an arbitrary function.To train the model with minibatch stochastic gradient descent starting from random initializations of the model parameters θ and φ, we need to calculate gradients of such expectations [Kingma and Welling, 2019, Khoshaman et al., 2019].
Procedures to obtain unbiased gradients with respect to θ and φ are described in Appendix A.
We estimated the Gaussian parameters µ and log σ 2 by means of linear layers at the top of the VAE's encoder (see Appendix D).The log σ 2 estimate was then routed through a softplus activation.Hence, the mean and log variance of the approximate posterior, µ and log σ 2 , are nonlinear functions of the input data x and the variational parameters φ [Kingma and Welling, 2013].
Moreover, when the VAE prior is given by a factorized Gaussian, the KL-divergence term in the ELBO objective [(7)] can be expressed in closed form.The ELBO is then estimated as: where z (s) = µ + σ ⊙ ǫ (s) , ǫ (s) ∼ N (0, I), and l indexes a latent variable [Kingma and Welling, 2013].Higgins et al. [2017] modified the VAE objective to reduce the entanglement between latent variables.Each latent variable z l is to represent a meaningful domain-specific attribute that varies along a continuum when z l is varied.In order to promote this disentangling property in the latent variables z ∼ q φ (z|x), the authors introduce a constraint over the posterior q φ (z|x) from which they are derived, making it more similar to a prior p θ (z), and thus restrain latent-space capacity and stimulate statistical independence between individual latent-space variables.The ELBO objective [( 7)] of a β-VAE is given as (see Appendix B for derivation):

β-VAE
The parameter β is used to balance the trade-off between the fit of the reconstructed data, x, to the input data, x, imposed by the autoencoding (reconstruction) term (low β), and the fit of the posterior, q φ (z|x), to the prior, p θ (z), via the KL term (high β).If β = 0, the β-VAE model is identical to an autoencoder, and if β = 1, the model corresponds to a regular VAE.We would like to note that we do not use β to explicitly disentangle the latent space but employ it as a regularization hyperparameter that requires tuning, an approach pioneered in Memarzadeh et al. [2020].

VAE with discrete latent space
Several approaches have been developed to circumvent the non-differentiability problem affecting models with discrete latent units [Mnih and Gregor, 2014, Paisley et al., 2012, Gu et al., 2015, Bengio et al., 2013].In VAE models, the reparameterization trick has been extended by either the incorporation of smoothing functions [Rolfe, 2016] or the relaxation of discrete latent variables into continuous ones [Jang et al., 2016, Maddison et al., 2016, Khoshaman and Amin, 2018].In this work, we employ the Gumbel-softmax trick, which relaxes a discrete categorical distribution into a continuous concrete (or Gumbel-softmax) distribution [Maddison et al., 2016, Jang et al., 2016].
If each individual latent variable is sampled as z d ∼ B(p), where B denotes the Bernoulli distribution and p its parameter, the concrete relaxation can be expressed as [Maddison et al., 2016]: where σ denotes the logistic function, σ(x) = 1/(1 + e −x ), α the odds of the Bernoulli probability, α = p/(1 − p), ρ a continuous uniform random variable on the interval [0, 1], and λ the temperature of the concrete distribution.The temperature parameter λ controls the degree of continuous relaxation: the greater λ is, the greater is the relaxation and departure from a Bernoulli random variable, whereas small λ values produce continuous variates close to 0 and 1.
In our studies, we consistently use a λ of 0.1, which we determined as a hyperparameter with good performance and which introduces only a small bias.The Gumbel-softmax relaxation was only applied during training and not during evaluation (validation or testing) because differentiability of the objective function is only required during the training phase and we sought to retain, where possible, the discrete character of the model and avoid the (slight) estimation bias introduced by the continuous relaxation.We estimated the log odds of the approximate relaxed Bernoulli posterior probabilities, log α q , by means of a linear layer at the top of the VAE's encoder (see Appendix D).
Discrete VAEs can be implemented with Bernoulli [bottom equation in (4)] or RBM [( 5)] priors.Based on the probability mass function of a Bernoulli random variable, B(k; p) = p k (1 − p) 1−k , the KL term of the ELBO of a VAE with a Bernoulli prior can be expressed as: where q l stands for the parameter of the latent Bernoulli variable z d l ∼ B(q l ), considered to be discrete, and 0.5 is the parameter of the Bernoulli prior distribution, B(p = 0.5).The KL term can be implemented using the BCEWithLogitsLoss(•) function of the PyTorch nn module with the log-odds parameter vector of the respective relaxed Bernoulli distribution [log α q (approximate posterior) or log α p = 0 (prior)] and the relaxed latent vector z (training) or hard latent vector z d (evaluation) as input arguments.We used this (stochastic) method [Monte Carlo estimator of ( 12)] in all experiments in which the VAE had a Bernoulli prior.An analytic expression for the KL term in the ELBO objective [( 7)] of a VAE with Bernoulli prior is given in Appendix C.
Since the applied concrete (Gumbel-softmax) relaxation replaces discrete latent variables z d with continuous variables z, the KL term of the ELBO of a VAE with a relaxed Bernoulli posterior and an RBM prior can be expressed as: andAmin, 2018, Song andKingma, 2021], unbiased gradients of the KL term with regard to the generative and variational parameters can be obtained as: where the gradients of the log prior probability are given, as usual, as the difference between a positive and negative phase.The symbolic expression zd denotes 'fantasy states,' i.e., values of the latent variables produced by the RBM model (prior) distribution, which remain discrete and are not relaxed during training [Khoshaman andAmin, 2018, Vinci et al., 2020].In the expression above, we have highlighted the fact that the positive-phase energy [E θ (z(φ, ρ))] and the negative-phase energy [E θ (z d )] are calculated [according to the bottom equation of ( 5)] using relaxed posterior samples (z) and discrete model samples (fantasy states zd ), respectively.The objective of training is to make the model distribution, p θ (z d ), as similar as possible to the posterior distribution, q φ (z|x).We used the persistent-contrastivedivergence (PCD) algorithm [Tieleman, 2008] to evolve the conditional distributions of the 'visible' and 'hidden' layers of fantasy states of the negative phase by means of Gibbs sampling, starting from initialization to zero.In PCD, the chains of the fantasy states' values are persistent and continue to evolve over cycles of training (minibatches), without re-initialization at the beginning of the cycle.The PCD algorithm is characterized by short mixing times (fast convergence to the stationary distribution) because the weight updates repel the persistent chains from their current states by raising the energies of the states [Hinton, 2012].It should be noted that this form of training does not require knowledge of the (intractable) partition function Z θ .Hence, our VAE model with RBM prior is an energy-based model whose training is based on (unnormalized) prior energies rather than (normalized) prior probabilities.

Latent Boltzmann networks
In a VAE with an RBM prior, the RBM network is located in the latent space; there are no visible RBM units corresponding to input data as in a standalone RBM.Also, by necessity, the latent variables z of the positive phase are continuous (because they are sampled from the approximate posterior distribution, which is relaxed via the Gumbelsoftmax procedure described above during training in order to make the ELBO differentiable).On the other hand, the RBM model samples (fantasy states) zd remain discrete variables, as indicated by the superscript, and are not relaxed during training.
Rolfe [2016] first developed a DVAE with RBM prior.The model applied the spike-and-exponential transformation to the posterior latents to make them differentiable.Khoshaman and Amin [2018] then modified the DVAE/RBM model by using the Gumbel-softmax trick to bring about the continuous relaxation of the DVAE's latent variables; the authors termed their DVAE model with RBM prior and Gumbel-softmax relaxation 'GumBolt.' Vinci et al. [2020] introduced a quantum version of the GumBolt model, based on Amin et al. [2018] and Khoshaman et al. [2019].These authors split up the posterior latent units z into two portions of equal size (denoted as z l and z r in figure 2) to implement between them the positive phase of the RBM model according to ( 14) and the energy function given in (5).A corresponding approach is taken for the fantasy states zd of the negative phase.We designate an RBM model with such variables a 'bipartite latent-space RBM.' In such an RBM model, there is no difference in kind between the 'visible' and 'hidden' units (for example, z l and z r , respectively, in the positive phase).The fantasy states are evolved via PCD Gibbs sampling whereas the posterior latent states remain 'clamped' to their original values (VAE encoder output) and are not subjected to Gibbs sampling.
positive phase negative phase . Contrasting architectures of (a) a prior model in which an RBM is implemented between two subdivisions of the VAE's latent space obtained by dividing the latent units z and zd , sampled from, respectively, the approximate posterior q φ (z|x) and model prior p θ (z d ), into two equally sized parts each (here termed 'bipartite latent-space RBM') [Rolfe, 2016, Khoshaman and Amin, 2018, Khoshaman et al., 2019, Vinci et al., 2020] and of (b) a prior RBM model in which a true hidden layer is added to the layer of posterior latents, which constitute the 'visible' layer of the RBM's positive phase (here termed 'RBM with augmented positive phase').Here, the symbolic expressions for the approximate posterior and prior distributions [q φ (z|x) and p θ (z d ), respectively] highlight the fact that the posterior latents are relaxed into continuous random variables z whereas the prior latents (fantasy states) remain discrete variables zd .In a bipartite latent-space RBM, the latent states z of the approximate posterior remain 'clamped' to their original values (VAE encoder output) whereas the fantasy states zd of the negative phase are evolved via PCD Gibbs sampling.In this work, we treat the latent variables z of the approximate posterior as the visible units of the RBM (zv) and add an equal number of hidden units (z d h ), thus augmenting the RBM's positive phase.The values of z d h are determined by one-step Gibbs sampling, given the values of zv.The values of the 'visible' and 'hidden' units of the negative phase are evolved by PCD Gibbs sampling, as in the bipartite latent-space model.The symbol k denotes the number of Gibbs-sampling steps.
In the studies conducted for this paper, we have adopted a slightly different approach.We consider the posterior latent variables z to be inputs of the latent-space RBM (z v ) and add an equal number of hidden units (z d h ) to the model.The values of the hidden units of the positive phase are determined by one-step Gibbs sampling, given the values of the visible units [see top equation of (6) for the hidden units' distribution]. 2 The values of the 'visible' and 'hidden' units of the negative phase (z d v and zd h , respectively) are evolved by PCD Gibbs sampling, as in the bipartite latent-space model.We call a model with such features an 'RBM with augmented positive phase.'The contrasting architectures of the two models are shown in figure 2. We chose the model shown in figure 2(b) over the model in (a) to conduct our studies because it had displayed a more dynamic training and slightly better performance in preliminary experiments on the baseline dataset, using otherwise identical hyperparameters, including the number of VAE latents.When the model with augmented positive phase was used as the RBM prior, the mean latent values and mean energies of the positive and negative phases fluctuated more over minibatches of training before convergence to final values than when the bipartite latent-space RBM was used.Similarly, the components of the RBM bias vector b and the elements of the weight matrix W sometimes changed from decreasing to increasing or vice versa while converging to their final values over the epochs of training whereas the parameters of the bipartite latent-space RBM converged to 2 During training, the visible units of the positive phase z, which correspond to the VAE's latent variables, are continuous because they are Bernoulli variables relaxed via the Gumbel-softmax trick to make the objective function differentiable.The fantasy states of the negative phase zd , by contrast, remain discrete variables and are not relaxed during training.However, the variable type (continuous or discrete) of the hidden units of the positive phase is not obvious.The values of the positive phase's visible and hidden units are used to compute the positive phase's energy E θ (zv, z (d) h ), which does not depend on the variational parameters φ [see bottom equation of ( 5)].We evaluated all combinations of continuous and discrete units in the formula for the energy of the positive phase (continuous visible and continuous hidden, continuous visible and discrete hidden, and discrete visible and discrete hidden).The combination of continuous visible units and discrete hidden units produced the best performance, and we chose this combination, based on this empirical observation, as indicated in figure 2. We will address this question more rigorously in future research.
their final values without much intermediary change of direction.The more lively training of the model depicted in figure 2(b) indicated to us that it might more easily switch between different modes of the system's energy landscape than the faster or more directly converging model illustrated in (a).Lastly, the model in (b) produced slightly better performance metrics (precision, recall, and F1 score) in preliminary test runs on the baseline dataset.We would like to emphasize that we did not comprehensively evaluate and compare the training behavior and performance of the two models because such an assessment and comparison was not the focus of our studies.Both models performed similarly overall, as was to be expected considering their similar structures.

Evaluation of anomaly-detection performance
For the experiments reported in this paper, we trained 16 models (10 for the experiments assessing performance on the baseline dataset with nonoptimal hyperparameters; section 5.1.2) independently, and we report mean +/− standard deviation.The reconstruction error of the ELBO [negative of left term in ( 9)] can be operationalized by the mean squared error (MSE) between the training data and the decoder output (reconstructed training data).We used this error metric when the input data were normalized by mean centering and scaling to unit variance (z score), which was the case for the datasets with drop-in-airspeed anomaly during takeoff.The MSE between the reconstructed and original training data is: where x symbolizes the input data and x their reconstructions and the sum is taken over a minibatch of training data.
We used a different error metric to estimate the reconstruction error when the training data were normalized to lie between zero and one using the transformation where x ′ symbolizes the transformed input data.This was the case for the dataset with a delay-in-flap-deployment anomaly during approach to landing.In this case, we used the binary cross entropy (BCE) to estimate the reconstruction error: where the sums are over input instances and features.Our experiments showed that the BCE error metric captured the reconstruction error more accurately when the input data were scaled to the interval [0, 1].
Since nominal data points are much more prevalent than anomalous ones, the model primarily learns patterns exhibited by nominal data, and, therefore, their reconstruction errors (MSE or BCE) tend to be smaller than the errors of anomalous points.However, a powerful encoder-decoder model will also fit anomalous data points, which is undesirable when the reconstruction error is used as the metric to identify anomalies.To discourage the fitting to anomalous instances, our models are variational (rather than pure) autoencoders.The KL-divergence term in the ELBO of a VAE [( 7)] penalizes the divergence between the approximate posterior q φ (z|x) and prior p θ (z).We introduce the hyperparameter β into the objective function [see (10)], based on the method developed in Higgins et al. [2017], to regulate the relative weighting of the autoencoding and KL terms of the ELBO.When β is chosen properly, in such a way that the reduction in KL divergence due to the similarity between posterior and prior outweighs the rise in reconstruction error due to the lack of fit to (the few) anomalous data points, the model can be induced to preferentially fit data in the nominal majority class.Hence, optimal anomaly detection depends on careful tuning of the hyperparameter β.
Once a VAE model is trained with empirically optimized hyperparameters, we use the reconstruction errors per training instance, e, to determine an anomaly-score threshold.The threshold is based on the assumption that the data (nominal and anomalous instances) are normally distributed and is specified as: where the angle brackets and ∆ denote the mean and standard deviation, respectively, and z the z score, derived from the (known) percentage of anomalies in the training set using the quantile function.Once the threshold is determined based on the training data, we identify anomalies in the test data by calculating the anomaly score (reconstruction error) for each test data instance and comparing it to the above threshold; instances with an anomaly score below the threshold are classified as nominal, and instances with a score above the threshold are considered anomalous.Anomaly scores given by the BCE error metric are reasonably normally distributed.However, anomaly scores corresponding to the MSE metric are considerably skewed to the right and possess a long right tail.We applied various normalizing transformations to such anomaly scores, including the square-root, natural-logarithm, and inverse (reciprocal) transformation.On average, the logarithm produced the best anomaly-detection performance, and for this reason, we apply a log transformation to the anomaly scores from these datasets.We sampled both the threshold (based on the training set) and the (log-transformed) anomaly scores (based on the test set) ten times each and used the respective average values to classify data as nominal or anomalous.The thus predicted data labels, determined in an unsupervised way, are then compared with the known true data labels to compute performance metrics.
In all studies, we assume that the nominal data are the negative class and that the anomalous data are the positive class.
We assess model performance with three metrics-precision, recall, and F1 score-specified as: where TP represents true positives: correctly identified anomalies, FP false positives (alarms): nominal instances incorrectly categorized as anomalous, and FN false negatives: missed anomalies or instances that were incorrectly classified as nominal.

Experimental results
All VAE models were implemented in Python using the PyTorch deep-learning library [Paszke et al., 2019].Our VAE model with Gaussian prior is an upgraded version of the CVAE model introduced in Memarzadeh et al. [2020], which achieved state-of-the-art performance in detecting anomalies in aviation time series.In all models, the Adam optimizer was used, with a learning rate of 3 × 10 −4 and default momentum parameters [Kingma and Ba, 2014].We used minibatch-based optimization.Minibatches of mutually exclusive training-set instances were re-shuffled for each epoch of training.All minibatches comprised 128 training-set instances, except for the minibatches used for post-training in the transferability study (section 5.2), which contained 32 instances.We used validation sets to determine combinations of hyperparameter values with good performance.Except for the top layer of the encoder, which outputs the estimated parameters of the approximate posterior and the reparameterized latent variables (and effects the relaxation of discrete latent variables in discrete-variable models), all models use the same encoder and decoder architecture presented in Appendix D.

Baseline study: Drop in airspeed during takeoff
We determined the baseline performance of the VAE models with Gaussian, Bernoulli, and RBM priors on a dataset of departing flights with drop-in-airspeed anomaly.Subject matter experts ascertained that if the speed of an aircraft drops by more than 20 knots in the first 60 s after becoming airborne, an adverse event might ensue, and, therefore, data points with such a property are classified as anomalous.The dataset contains flight-operations data of commercial flights.It comprises primarily 1-Hz recordings for each flight that cover a variety of systems, including the state, orientation, and speed of the aircraft.The data are acquired in real time aboard the aircraft and downloaded by the airline once the aircraft has reached the destination gate.Each data instance is a 60 s-long recording of seven sensor outputs (five continuous, two discrete) during the initial ascent after becoming airborne.The drop-in-airspeed anomaly is not necessarily the only type of anomaly in the dataset, and the true number of operationally significant anomalies is unknown.

Model training
For this baseline study, we divided the data into training (60%), validation (20%), and test (20%) sets.We used the training set to train the models, the validation set to monitor overfitting and select hyperparameters, and the test set to assess model performance in an unbiased way.Since the input data were normalized by mean centering and scaling to unit variance, the MSE [( 15)] was used to estimate the reconstruction error.Models were trained for 400 epochs.Model weights tended to converge at about 100 epochs.We did not observe any overfitting with increasing training time, as visualized by the change of the model's loss [negative of the β-ELBO objective (10)] over time when evaluated on the validation set (figure 1 in the supplementary material).
We assessed many combinations of hyperparameters, separately for each model type (Gaussian, Bernoulli, RBM), and selected the hyperparameter values that produced the best overall performance in precision, recall, and F1 score.Our approach to use anomaly-detection performance on a validation set, assessed by comparing the data labels predicted by a model with the validation set's known true labels, to tune model hyperparameters is comparable to the hyperparameter-optimization strategies employed in Memarzadeh et al. [2020] and Li et al. [2021].The hyperparameters chosen for each model are given in table 1.The RBM model, which has a more flexible and expressive prior than the models with standard normal or Bernoulli prior, performed optimally at a lower latent-space dimensionality than the Gaussian and Bernoulli models.We also explored the application of a loss penalty to the RBM coupling weights W, the use of a sampling replay buffer, in which 5% randomly chosen fantasy states are not determined by the persistent chains but randomly set to 0 or 1 with equal probability [Du and Mordatch, 2019], KL-term annealing ('warm-up') [Sønderby et al., 2016], a hierarchical (conditional) posterior [Vahdat et al., 2020, Vinci et al., 2020], multiple sampling and averaging of the ELBO and its gradients [Vinci et al., 2020], and the continuous Bernoulli to normalize the ELBO's reconstruction term [Loaiza-Ganem and Cunningham, 2019].In the end, we did not apply any of these modifications because none of them improved the performance of our models, where applicable.
Table 1.Hyperparameters used for the Gaussian, Bernoulli, and RBM models.To assess model performance, the training and validation sets were combined, and the models were re-trained on the combined training/validation set.We then evaluated the performance of the models on the test set.The times of training the Gaussian, Bernoulli, and RBM models on the combined training/validation set for 400 epochs on a Skylake GPU-enhanced node of the Pleiades supercomputer3 at the NASA Ames Research Center are shown in table 2. The Gaussian and Bernoulli models as well as the RBM model with one PCD Gibbs-sampling update during the negative phase require about the same average training time [Gaussian: 1 h 0 min 29 s, Bernoulli: 58 min 27 s, RBM (k = 1): 1 h 4 min 13 s].On the other hand, the RBM model with 20 Gibbs updates requires, on average, more time to train [RBM (k = 20): 1 h 34 min 17 s].
Figure 3 shows the values of the latent units of the positive phase [z ∼ q φ (z|x)] and of the negative phase [z d ∼ p θ (z d )] averaged over latent dimensions and minibatch instances during a typical training run as well as the corresponding mean energies according to the bottom equation of ( 5), where visible and hidden units reside in the VAE's latent space.Values for each minibatch (171 per epoch) are shown.The figure illustrates that the mean negative-phase values (of the latent variables and energy) closely follow their positive-phase counterparts.However, the mean negative-phase values fluctuate less and are more centered; these findings indicate that the (free-running) negative phase re-produces a smoothed and partially averaged version of the structure of the VAE latent units of the (clamped) positive phase.Training of the RBM biases and weights was dynamic, suggesting that the PCD algorithm explored well the energy landscape of the configurations of the system given by the dataset and model (figures 2 and 3 in the supplementary material).Histograms of log-transformed anomaly scores of the RBM model are shown in figure 4 for two training modes.We sampled both the anomaly-score threshold based on the training data as well as the test data's anomaly scores ten times and used the sample statistics to gauge model performance, as described in section 4. We observed that models enter different modes during training and differ in their anomaly-detection performance depending on the selected mode.The mode with a threshold of about 5.7 for log-transformed anomaly scores produced the best model performance, with F1 scores >0.65.Training with modes with a threshold 5.8, on the other hand, resulted in inferior model performance (F1 score <0.65).The superior performance of the mode with thr ≈ 5.7 is illustrated by the cleaner separation of nominal and anomalous data.Modes with thr 5.8 , on the other hand, are characterized by a greater number of false positives (nominal data to the right of the anomaly-score threshold).Other modes, with thresholds between 5.7 and 5.8, were also observed but are less common.

Model performance
The performance of the VAE models with Gaussian, Bernoulli, and RBM priors in the baseline study is shown in figure 5.The RBM model achieved a mean precision of 0.563, a mean recall of 0.817, and a mean F1 score of 0.666.The Gaussian model achieved a similar performance (pr 0.579, rc 0.778, f1 0.663).A notable observation is that the Bernoulli model lags behind both the RBM and the Gaussian model (pr 0.425, rc 0.596, f1 0.495).Our models demonstrate an excellent performance considering that the training was unsupervised and that the similar unsupervised CVAE model with Gaussian prior developed by Memarzadeh et al. [2020], which achieved state-of-the-art anomalydetection performance on aviation datasets, achieved a precision of 0.31 and recall of 0.53 on a dataset similar to  distribution, with a parameter of 0.5.The models' similar training times also attest to their comparability (see table 2).Comparing the two models with simple factorized priors, the continuous (Gaussian) model performs better on this dataset, which is dominated by continuous time-series inputs (5 continuous time series, 2 discrete/binary ones).
On the other hand, the performance deficit of the simple Bernoulli model is offset by the more expressive RBM model, both of which are discrete VAE models.Precisely the RBM model's flexibility and capability to adapt to the posterior distribution push it to a level of performance on par with the continuous Gaussian model.This observation highlights the performance boost accorded by energy-based modeling and MCMC sampling of the persistent states of the negative phase.
While training and evaluation with optimized hyperparameters allows a fair comparison between models, the approach leads to an overestimation of model performance when no validation set with labeled instances is available, as is frequently the case in practice, including in aeronautics applications.To give an impression of the performance of our models when nonoptimal hyperparameters are used, we trained models with all combinations between four different values for the latent-space dimensionality and five different settings for the hyperparameter β.The performance of models with 32, 64, 128, and 256 latent units as well as β values of 1, 10, 25, 50, and 100 was evaluated.We trained ten Gaussian, Bernoulli, and RBM models each in this way for 300 epochs, which allowed model weights to converge to their final values, and averages of the resultant F1 scores are displayed as heatmaps in figure 6.The figure demonstrates the performance degradation that occurs when nonoptimal hyperparameters are used.A value of the hyperparameter β of 50 or 100 is associated with moderate performance (F1 score between 0.314 and 0.536 across all three models), while a β value of 1, 10, or 25 leads to poor performance (F1 score between 0.202 and 0.328).The influence of the hyperparameter β on model performance is nonlinear, with a value of 1 or 25 resulting in better performance than a β of 10.Performance differences due to different numbers of latent units are less pronounced.All latent-space dimensions investigated in this experiment (32, 64, 128, and 256) produce good performance and correspond to the latent-space dimensions employed in all studies described in this paper (see table 1).

Model transferability
The DASHlink dataset5 used in this experiment consists of time series of sensor data collected during the first 60 s of the takeoff phase of commercial flights.This is a dataset collected independently of the dataset used in the baseline study (section 5.1) but containing the same input time series as the baseline dataset.The dataset also contains the same anomaly (drop in airspeed during takeoff by more than 20 knots) as the baseline dataset.The transferability study tests the ability of the models tuned and trained on the baseline data to transfer to another dataset containing the same input attributes and anomaly.All transferability experiments were conducted with hyperparameters determined by training on the baseline dataset's training set and assessing model performance on the baseline dataset's validation set (table 1).We examined two versions of transferability: a strong version of transferability, in which a model trained on the baseline data was directly tested on the DASHlink takeoff data, and a relaxed version, in which a model trained on the baseline data was post-trained on the DASHlink takeoff data for 300 epochs, with model weights initialized to the values acquired after training the model to convergence on the baseline data.Overall, the model transfer markedly reduced model performance while still producing usable results.The transferability results are more difficult to interpret with regard to performance differences between the three models than the baseline results because the experiments from which they were obtained were more complex and involved additional factors (compared with the baseline experiments) since they involved two datasets.We report additional transferability experiments in supplementary section S2.

Robustness of RBM model: Delay in flap deployment during approach to landing
To deepen our insight into the behavior and performance of the VAE model with RBM prior, we conducted a further anomaly-detection study with this model, on another dataset, including de novo determination of hyperparameters and training.The experimental approach for this study is similar to the one adopted for the baseline study (section 5.1).The anomaly in the dataset used for this study is a delay in the deployment of flaps, as judged by a subject matter expert, during the final approach to landing, lasting approximately 160 s. 6 As in the previous experiments, this labeled anomaly is not necessarily the only anomaly in the dataset, and the label is not among the model inputs during the unsupervised training.The dataset contains ten time series of aeronautical sensor outputs, relating to position, orientation, and speed of the aircraft, as well as to the positions of the control surfaces.
The dataset was divided into training (60%), validation (20%), and test (20%) sets.Since the data were normalized to lie in the range of zero to one using ( 16), we used the BCE [( 17)] to estimate the reconstruction error [negative of left term in ( 9)].Training proceeded for 400 epochs and weights converged to their final values after about 100 epochs.We again independently trained 16 models and report the mean +/-standard deviation of their performance.We did not observe any overfitting over the epochs of training, as visualized by the progression of the losses of the validation set.
The hyperparameters tried out that produced the best overall performance in precision, recall, and F1 score, evaluated on the validation set, are given in table 1.To test model performance, the training and validation sets were combined for the purpose of model training, and then the performance of the models trained in this way was evaluated on the test set.It should be noted that in this study anomaly scores were not log-transformed because the original scores were already approximately normally distributed.Log transformation neither increased the scores' normality nor improved anomaly-detection performance.Additional information is provided in supplementary section S3.Table 3 shows the performance of the RBM model in this case study.The model achieved a mean precision of 0.591, a mean recall of 0.647, and a mean F1 score of 0.618.The mean F1 score is similar to (slightly lower than) the F1 value of 0.666 achieved by the RBM model in the baseline study, which was performed on a dataset with drop-in-airspeed anomaly during takeoff.Consequently, the late-deployment-of-flaps study corroborates the excellent anomaly-detection performance of the unsupervised VAE model with RBM prior observed in the baseline study and illustrates the model's robustness to change of anomaly type and phase of flight.

Model design
We developed two DVAE models.The Bernoulli model employs a factorized Bernoulli distribution as prior, in analogy to the common continuous VAE with Gaussian prior [( 4)].The RBM model is an attempt to make the DVAE more flexible and expressive.It places an RBM in the VAE's latent space, so that the VAE's latent units are the RBM's inputs, and uses the energy-based restricted Boltzmann distribution given in (5) as prior.The RBM model parameters a, b, and W are tuned during training by the interplay between positive-and negative-phase energies, and the 'visible' and 'hidden' units of the negative phase (fantasy states zd v and zd h , respectively) are updated via MCMC Gibbs sampling from (6).DVAE models are less common than VAEs with continuous latent variables because an ELBO objective [(7)] containing discrete variables cannot be differentiated, thus precluding the computation of ELBO gradients, a necessary operation during the backward pass of training.In order to make DVAEs differentiable, the reparameterization trick [Kingma and Welling, 2013], which moves the variational parameters φ from the distribution q φ (z|x) to a more easily differentiable deterministic function g φ (ǫ, x), is extended by a smoothing function [Rolfe, 2016] or a continuous relaxation [Maddison et al., 2016, Jang et al., 2016].Our Bernoulli and RBM models employ the Gumbel-softmax trick [Jang et al., 2016, Maddison et al., 2016] to relax the discrete posterior latents z d ∼ q φ (z d |x).
Several authors [Rolfe, 2016, Khoshaman and Amin, 2018, Khoshaman et al., 2019, Vinci et al., 2020] break up the VAE's latent space into two equally sized partitions and implement the positive phase of the RBM between the two halves of latent units z sampled from the approximate posterior q φ (z|x) and, similarly, the negative phase between the latent units zd sampled from the RBM prior p θ (z d ).In this paper, we have termed an RBM model with such characteristics a 'bipartite latent-space RBM.' In the experiments performed for this paper, we have adopted a somewhat different approach and introduced a hidden layer in the RBM's positive phase, which is obtained in one Gibbs-sampling step from the positive phase's visible layer, comprising the VAE's latent units.The evolution of the negative phase's fantasy states is accomplished by Gibbs sampling in both RBM versions (see figure 2).We chose this augmented model rather than the bipartite latent-space one to conduct our experiments because it had demonstrated a somewhat more dynamic training and a slightly better performance in preliminary experiments that did not comprehensively evaluate the training behavior and performance characteristics of the two RBM model versions (see discussion at the end of section 3.3).
In addition to adding a truly hidden layer to the RBM's positive phase, we also introduced the hyperparameter β into the ELBO objective function [see ( 10)].While Higgins et al. [2017] devised the β-ELBO, in which β regulates the trade-off between the autoencoding and KL-divergence terms, as a way to promote disentanglement between latent dimensions by enforcing a minimum similarity between variational posterior q φ (z|x) and prior p θ (z), we tune β to optimize anomaly-detection performance, as measured by performance metrics [( 19)].

Model performance
We assessed the performance of our VAE models with Gaussian, Bernoulli, and RBM priors in the context of anomaly detection in commercial aeronautics.Prior DVAE models used generation performance on MNIST [LeCun, 1998] and other image datasets to gauge model performance [Rolfe, 2016, Khoshaman and Amin, 2018, Khoshaman et al., 2019, Vinci et al., 2020].Generation performance is quantified by the test-set log likelihood or a surrogate such as the ELBO or Q-ELBO.In our anomaly-detection studies, we used the performance metrics precision, recall, and F1 score.The accurate and timely discovery of flight-operations anomalies is important because they can be precursors of potentially serious aviation incidents or accidents.To detect operationally significant anomalies in flight data and preempt future accidents, airlines and transportation agencies will have to increasingly rely on advanced machine-learning techniques applied to historical data or online data streams.Multifactorial and nonlinear anomalies defy traditional anomalydetection methods, such as exceedance detection [Federal Aviation Administration, 2004, Dillman et al., 2015], and the volume and proportion of anomalies characterized by heterogeneous and high-order feature interactions is only expected to increase with increasing airspace complexity, characterized by increasing passenger volume, the integration of unmanned aircraft systems, and urban air mobility.Since labeled data (i.e., data classified as either nominal or anomalous) are costly to obtain and frequently not available and many complex anomalies unknown, unsupervised learning approaches, such as the one portrayed in this paper, are often the preferred or only feasible option.
Both the Gaussian and the Bernoulli models employ simple factorized priors and require similar time to train (table 2).In these respects, the Bernoulli model is the discrete equivalent of the continuous Gaussian model.The anomaly-detection performance of the three models compares favorably with the performance of a recently developed CVAE and of other unsupervised machine-learning methods, evaluated on a related anomaly-detection task, described in Memarzadeh et al. [2020].Our three models achieved F1 scores of 0.663 ± 0.0700 (Gaussian), 0.495 ± 0.0785 (Bernoulli), and 0.666 ± 0.0345 (RBM) when tested on the baseline dataset with drop-in-airspeed anomaly during takeoff.Consequently, the Gaussian and RBM models perform at a similar level, whereas the Bernoulli model falls short in performance.In other words, the performance of the discrete model with simple factorized prior, which is inferior to the capability of the corresponding continuous model, is elevated to the performance of the continuous model by the adoption of a more flexible and expressive energy-based RBM prior that is able to more accurately represent the statistical structure of the input data.Hence, we have implemented an unsupervised discrete deep generative model that performs on par with the analogous continuous generative model that employs a Gaussian prior in detecting anomalies in an aeronautical dataset.The performance decrement due to the usage of nonoptimal hyperparameters was also assessed (figure 6).Our continuous and discrete models performed similarly in the transferability study (section 5.2).Our RBM model is ready to be integrated with quantum computing: the discrete (fantasy) states of the negative phase can be obtained by quantum Boltzmann sampling, by, for example, a quantum annealer or quantum-circuit Born machine (QCBM).On the other hand, a Gaussian VAE, the most common choice in classical deep generative learning, defies a straightforward integration with quantum sampling because it does not sample discrete states from a system whose generative process can be implemented by measuring a parameterizable density operator ρ θ or wave function ψ θ in the computational basis.
The proposed VAE model with RBM prior is robust to change of anomaly type and phase of flight, as demonstrated by its performance on a dataset with late-deployment-of-flaps anomaly during approach to landing, on which it achieved an F1 score of 0.618 ± 0.0554 (section 5.3).Transfer to a novel dataset without renewed tuning of hyperparameters markedly decreases model performance.The RBM model's F1 score on a dataset on which it was neither tuned nor trained but which contained the same input data and anomaly and spanned the same phase of flight was 0.355 ± 0.0486, 31.1 percentage points lower than the performance on the dataset from which the transfer occurred.While such a drop represents a significant deterioration in performance, the post-transfer performance is still respectable, considering that the datasets involved consist of multiple complex time series and the training was executed without supervision.Transferability was mildly improved by post-training on the target dataset: initializing model weights to the (converged) values obtained at the end of training on the original dataset, post-training on the new dataset for 300 epochs raised the RBM model's average F1 score by 4.51 percentage points to a value of 0.400.
Latent-space dimensionality and the hyperparameter β are hyperparameters that have a strong effect on model performance (in all three models) and require optimization for each application.Models with 32-256 latent units generally exhibited good performance, whereas model performance was very sensitive to the specific choice of β; hence, the choice of the value of the hyperparameter β is application-specific.The temperature λ of the concrete distribution [Maddison et al., 2016, Jang et al., 2016], which determines the bias introduced by the continuous Gumbel-softmax relaxation and the differentiability of the model parameters, is also important for the discrete models (Bernoulli and RBM) and requires tuning.We used a λ of 0.1 throughout our studies because it balances the trade-off between estimation bias and differentiability and had proved the optimal value among the ones tested.We did not use an annealing schedule to gradually reduce λ over the epochs of training, but we used discrete ('hard') Bernoulli variables for validation and testing.While the length of the persistent chains (fantasy particles) had a mild influence on the performance of the RBM model, the number of fantasy particles was quite unimportant, due to the averaging of the particles' (negativephase) energies at the end of each series of Gibbs updates (per minibatch); we used 500 throughout.When optimal, or even adequate, hyperparameters are unknown and labeled data not available, which is often the case in practical applications, the performance of unsupervised anomaly-detection models can frequently be improved by making the effort to prepare a small labeled validation set that includes relevant anomalies and using it for hyperparameter tuning [Soenen et al., 2021, Antoniadis et al., 2022].
Mean negative-phase values of the latent variables and energy tightly followed the corresponding values of the positive phase, but were more centered, suggesting that the free-running negative phase re-produces a smoothed version of the clamped positive phase (figure 3).We were able to boost anomaly-detection performance by applying a normalizing transformation to skewed anomaly scores (MSEs) with conspicuous right tail.We always used a log transformation, but less skewed scores might benefit more from a square-root transformation and for more strongly skewed scores an inverse transformation might be optimal.We would like to note that our VAE models with continuous and discrete priors are universal generative models and not limited in application to aeronautics data, but, with slight modification, can also be applied to other time-series data.

Conclusions
We developed unsupervised VAE models with convolutional encoder and decoder layers and a latent space based on Gaussian, Bernoulli, and RBM priors.The discrete Bernoulli and RBM versions are an attempt to design models whose latent space captures the discrete nature of objects processed by machine-learning models, such as semantic classes, causal factors, digital samples, and other discrete entities.The RBM model allows a straightforward integration with quantum computing because the fantasy states of the RBM's negative phase can be obtained by measuring the states of a quantum generative process, such as a quantum Boltzmann machine (QBM) implemented on an annealer or a QCBM on a circuit.Our objective is the maximization of the β-ELBO, in which the hyperparameter β regulates the trade-off between reconstruction fidelity and regularization via a prior distribution placed on the latent variables.
We tested our VAE models on datasets comprising time series of commercial flights that contain anomalies during the takeoff or approach-to-landing phases of flight.It is expected that the use of unsupervised and semi-supervised machine-learning techniques to identify anomalies in aviation is going to increase because of the high cost of labeling instances manually, the limitations associated with the identification of anomalies based on simple heuristics, and the ever increasing airspace complexity over the course of the 21st century.While the estimated generative log likelihood or a related quantity is used to gauge generative-model performance in re-generation studies (e.g., of imagery), anomaly detection relies on metrics such as precision, recall, and F1 score to assess model performance.
While all unsupervised models exhibit good performance overall, the discrete Bernoulli model performs more poorly than the other two.However, the more expressive RBM model, which, during training, employs unnormalized energies rather than probabilities and MCMC to sample from the prior distribution, is a discrete-variable model whose performance matches that of the continuous Gaussian model.Our models are robust to changes in type of anomaly and phase of flight.Transferring a model without re-tuning of hyperparameters or re-training to a new dataset with the same anomaly results in an anomaly-detection performance that is markedly impaired, but still acceptable, considering the nature of the data and training.
In future studies, we will use more advanced algorithms, such as parallel tempering and adaptive-step PCD, to conduct negative-phase sampling, to see if such schemes improve model performance.We will also devise an estimator of the log partition function, to compute log-likelihood estimates for future generation studies.We also intend to use conditional VAEs or other deep generative models to generate artificial anomalies, to enhance anomaly-detection datasets and, ultimately, the performance of anomaly-detection algorithms.We plan to integrate quantum capabilities by developing a QBM, implemented by quantum annealing or QCBM.In addition, we intend to increase the expressiveness of our prior model by replacing the relatively simple RBM network with a more sophisticated energy-based feedforward network, and we will attempt to integrate such an EBM with quantum computing.

Supplementary material S1. The necessity of discrete-variable models
Real-world objects are frequently discrete: the objects in an image; phonemes in speech; symbols, words, and sentences in language.Our study on the unsupervised detection of anomalies in aeronautics data requires the differentiation of discrete classes (nominal vs. anomalous instances), and the input to our models consists of multivariate time series consisting of discrete temporal data points.Consequently, internal discrete-variable representations are frequently more interpretable and a more natural fit to modalities of interest present in datasets [Chen et al., 2016, Van Den Oord et al., 2017].When causes and/or consequences are discrete ('If the store doesn't have peaches, I'll buy apples.'),they are also a natural fit for complex reasoning, planning, and predictive learning [Van Den Oord et al., 2017].
Discrete variables also naturally represent typical machine-learning operations, such as choosing between models or variables [Khoshaman and Amin, 2018].An unsupervised model with a latent space composed of discrete variables can learn to disentangle content and style information of images [Makhzani and Frey, 2017].In semi-supervised learning, discrete variables can be used to label distinct semantic classes and to train generative models with more meaningful representations [Kingma et al., 2014, Maaløe et al., 2017, Khoshaman et al., 2019].Moreover, discrete variables pose a low burden on computer memory and are computationally efficient [Rae et al., 2016, Jang et al., 2016]; they are also a natural match to digital computation.
Rolfe [2016] points out that many practically relevant datasets contain information on discrete objects subject to continuous transformation, such as image datasets of natural objects that change position and orientation and experience variations in scene illumination.Such datasets comprise multiple disconnected smooth manifolds.The author notices that while continuous variations in pose and illumination are typical natural phenomena, it is very difficult to transform the image of a person to that of a car while remaining on the manifold of natural images.Rolfe goes on to suggest to model the selection of discrete real-world objects using discrete variables and the continuous transformations applied to each disconnected component with continuous variables.The author implements such a framework using an autoregressive network consisting of layers of continuous latent variables conditioned on a layer of discrete variables.Our models of discrete convolutional VAEs also follow Rolfe's prescription by combining encoders and decoders with continuous convolutional layers with a latent space with discrete variables.
While discrete-variable models promise gains in performance and improvements in efficiency in classical computing, they are outright obligatory for quantum computing.Quantum variables take the form of density matrices, the measurements of which will produce discrete values (bitstrings) as projective measurements of qubits i = 1, 2, ..., n in the computational basis produce values m i ∈ {−1, +1}.The physical correlates of logical qubits are two-level quantum systems such as the polarization direction of light, Fock states of bosons, quadratures of coherent states, the spin direction of atomic particles, or the flow direction of superconducting current [Nielsen and Chuang, 2000, Nakahara, 2008, Knill, 2010].Consequently, the measured binary qubit values m i constitute the interface between the quantum and classical computing worlds; their processing by the classical component of quantum-classical hybrid models requires discrete variables.17)], which contains a log transformation of the reconstructed data, and the greater normality of the raw anomaly scores is presumably due to the BCE error metric; in the studies described in sections 5.1 and 5.2, we applied the MSE error metric to z-transformed reconstructed and original input data and, during evaluation of model performance, log-transformed the resultant right-skewed anomaly scores.As in the baseline study with a dataset containing a dropin-airspeed anomaly (see section 5.1), we observed that models enter different modes during training and differ in their anomaly-detection performance depending on the selected mode.The mode with an anomaly-score threshold of about 980 produced a good model performance, with F1 scores ∼0.64, whereas the mode with a threshold of about 1011 resulted in inferior model performance (F1 scores ∼0.42).The superior performance of the mode with thr ≈ 980 is illustrated by the cleaner separation of nominal and anomalous data.The mode with thr ≈ 1011, on the other hand, is characterized by a greater number of false negatives (anomalous data to the left of the anomaly-score threshold).Other modes were not observed in this experiment.

S4. Additional figures
Model prior No. latents a beta b lambda c No. fant.part.d Len.pers.chains e Number of latent units.b Hyperparameter β controlling the balance between the autoencoding and KL terms.c Temperature of the relaxed Bernoulli distribution.d Number of fantasy particles / persistent chains.e Length of persistent chains.
indicate standard deviations.The symbol k stands for the number of Gibbs-sampling steps.

Figure 3 .
Figure 3. Values of latent units and energy, averaged over latent units and minibatch instances, of the positive phase [z ∼ q φ (z|x)] and of the negative phase [z d ∼ p θ (z d )].The average negative-phase values of these quantities are more centered than the corresponding average values of the positive phase, indicating that the negative phase re-produces a smoothed version of the posterior latents.

Figure 4 .
Figure 4. Histograms of average log-transformed anomaly scores for two training modes of the RBM model.Data points to the right of the dashed threshold line are categorized as anomalies.The mode with a threshold of ∼5.70 demonstrates a good separation of nominal and anomalous data, whereas the mode with a threshold of ∼5.86 exhibits a less good separation, with a relatively high number of nominal data classified as anomalies (false positives).

Figure 5 .
Figure 5. Performance of VAE models with Gaussian, Bernoulli, and RBM priors in the baseline study.Error bars indicate standard deviations of performance metrics among 16 independently trained models.Overall, the models demonstrate excellent performance on this unsupervised anomaly-detection task (average F1 score of 0.608).The more expressive discrete RBM model performs better than the simple discrete Bernoulli model and at a level comparable to the continuous Gaussian model.The DVAE appears to benefit from the flexibility and adaptability of the energy-based RBM prior.

Figure 6 .
Figure 6.Performance of the Gaussian, Bernoulli, and RBM models in the baseline study after training with nonoptimal hyperparameters.The figure shows heatmaps of average F1 scores for the three models across the hyperparameters latent-space dimensionality and hyperparameter β, which are common to all three models.Across the three models, F1 scores are noticeably lower than after training with hyperparameters tuned based on validation-set performance, demonstrating the performance degradation incurred by using nonoptimal hyperparameters.

Figure 7 .
Figure 7. Performance of VAE models with Gaussian, Bernoulli, and RBM priors in the study examining the ability of the models tuned and trained on the baseline data to transfer to the DASHlink takeoff data.Error bars indicate standard deviations of performance metrics among 16 independently trained models.Performance was assessed based on a threshold derived from the DASHlink training set, (a) without or (b) with post-training on the new dataset for 300 epochs with model weights initialized by training to convergence on the baseline dataset.The figure depicts similar performance of all models, trained without supervision, markedly worse than in the baseline experiment, but still in the useful range.
arXiv:2303.12302v1 [cs.LG] 22 Mar 2023 Anomaly Detection with DVAE A PREPRINT S2.Additional experiments relating to section 5.2 Model transferabilityWe show the performance of two additional transferability experiments in supplementary figure4.The data differ from the data shown in figure7in that the threshold for the data shown in supplementary figure4(a) is based on the training set of the baseline dataset and the threshold for the data depicted in supplementary figure 4(b) is derived from an average of the training-set thresholds of the baseline and DASHlink takeoff datasets.We used this mixed threshold because in the transferability experiments with post-training both the baseline and the DASHlink takeoff datasets were involved in the training of the models.The results of the experiment with baseline threshold and without post-training are quite imbalanced in the sense that recall is much better than precision, and the overall average F1 score is 3.74 percentage points below the one in the experiment with DASHlink/takeoff threshold and without post-training [figure 7(a)].The results of the experiment with mixed baseline-DASHlink/takeoff threshold and post-training are very similar to the ones of the post-training experiment with DASHlink threshold.S3.Additional information relating to section 5.3 Robustness of RBM model: Delay in flap deployment during approach to landing Histograms of anomaly scores for two training modes of the RBM model, with anomaly-score thresholds of about 980 and 1011, are shown in supplementary figure 5.In this study, anomaly scores correspond to the BCE [(

Figure S1 .
Figure S1.Typical total loss (negative of β-ELBO) as well as its two components reconstruction loss and KL-divergence loss during training and validation for 400 epochs.The KL-divergence loss (weighted with β) is negative because the log probability of the RBM prior, log p θ (z), is not normalized.The loss due to the L2 penalty on the RBM's weight matrix W is zero because this feature, which did not produce any performance gain, was not used during the training of the VAE/RBM model.

Figure S2 .
Figure S2.Evolution of the first four components of the RBM's bias vector a during model training, suggesting a dynamic exploration of the energy landscape of the system's configuration space.

Figure S3 .
Figure S3.Evolution of the first four elements of the first column of the RBM's weight matrix W during model training, suggesting a dynamic exploration of the energy landscape of the system's configuration space.
Figure S4.Performance of VAE models with Gaussian, Bernoulli, and RBM priors in the study examining the ability of the models tuned and trained on the baseline data to transfer to the DASHlink takeoff data.Error bars indicate standard deviations of performance metrics among 16 independently trained models.Performance was assessed based on (a) a threshold derived from the baseline training set, without post-training, or (b) the average value of the thresholds derived from the baseline and DASHlink/takeoff datasets, with post-training on the new dataset for 300 epochs with model weights initialized by training to convergence on the baseline dataset.The figure demonstrates an imbalance between precision and recall (recall ≫ precision) in the experiment without post-training and, in the post-training condition, similar findings as in the post-training experiment with a threshold based on the DASHlink/takeoff training set.

Figure S5 .
Figure S5.Histograms of average anomaly scores for two training modes of the RBM model trained and tested on the dataset with late-deployment-of-flaps anomaly during approach to landing.Data points to the right of the dashed threshold line are categorized as anomalies.The mode with a threshold of ∼980 demonstrates a good separation of nominal and anomalous data, whereas the mode with a threshold of ∼1011 exhibits a less good separation, with a relatively high number of anomalous data classified as nominal (false negatives).

Table 2 .
Training times (400 epochs) of VAE models with Gaussian, Bernoulli, and RBM priors when trained on the combined training/validation set.

Table 3 .
Performance of the VAE model with RBM prior in the experiment with late-deployment-of-flaps anomaly during final approach to landing.