The probabilistic tensor decomposition toolbox

Jesper L Hinrich; Kristoffer H Madsen; Morten Mørup

doi:10.1088/2632-2153/ab8241

1. Introduction

Tensors, higher-order or n-way arrays are increasingly encountered in all areas of science. While standard two-way matrix analysis methods can be applied by restructuring these higher order arrays into a matrix, such approaches fail to properly exploit the inherent multi-way structure. Instead, using multi-way or tensor methods, such as the Canonical Polyadic (CP) decomposition (also known as PARAFAC and CandeComp) [1–3], Tucker model [4, 5] or variations thereof are able to account for the intrinsic structure of the data. Tensors are prevalent in both research and industry (for reviews see [6–9]) and has a large heterogeneous research community and several tensor analysis toolboxes are publicly available. The most prominent being the Matlab based N-way Toolbox [10], Tensorlab [11], and Tensor Toolbox [12] enabling researchers to apply multi-way modeling across domains.

However, these existing prominent multi-way toolboxes are based solely on maximum likelihood (ML) estimation which only provides a point estimate of the underlying parameters and does not account for parameter uncertainty. Instead estimating uncertainty requires repeated model fitting using for instance jackknifing [13] or bootstrapping [14] which are increasingly expensive as the size of the data grows.

In contrast, Bayesian inference results in an approximation of the true posterior distribution either via sampling [15–17] or a direct approximation [18–20]. Apart from uncertainty quantification, benefits of Bayesian inference includes automatic penalty for increased model complexity, inference on model order, determining sparsity level, incorporation of more realistic noise assumptions, and a principled way to include prior information and performing cross-validation [16, 21] on left out slices or blocks of data.

1.1. Bayesian tensor decomposition

For comprehensive reviews of maximum likelihood based tensor decomposition methods the reader is referred to existing tensor decomposition reviews [6–9, 22, 23]. We presently provide a short overview of Bayesian tensor decomposition. For brevity, this section will drop the prefix Bayesian and unless otherwise stated all models are based on fully Bayesian inference.

Currently, probabilistic models are primarily based on the Tucker and CP decomposition. The earliest works using Tucker decomposition are found in [24, 25] where both the core array and factors follow a normal distribution. The latter also considered enforcing sparsity on the core array to automatically learn the most prominent multi-linear interactions. However, neither of these methods were fully Bayesian as they relied on maximum-a-posteriori estimation which provides a point estimate of the posterior distribution. The first fully Bayesian Tucker model with normal factors using Gibbs sampling was proposed in [26] and the indeterminacy and structure of the core array was explored in [27]. The Tucker decomposition was extended to handle missing values and sparse noise based on variational Bayesian (VB) inference [28] whereas the Tucker model with an infinite core size was explored in [29–31] with factors specified as Gaussian processes and inference based on VB. Extensions of the Tucker model to count data was explored in [32, 33] using the Poisson likelihood and either Gamma or Dirichlet factors. For categorical data, a Tucker model with multinomial likelihood and factors was considered in [34]. The connection between contingency tables, log-linear models, and the Tucker decomposition was further explored in [35], who proposed a collapsed Tucker core for higher-order data.

As in ML estimation, the Bayesian CP model has received more attention than Bayesian Tucker model. The earliest work appears to be a VB based 3-way CP with normal factors with the assumption of independent [36] or dependent [37] components. Its N-way extension was later proposed in [38, 39]. For time series data, temporal lag-1 dependence in a mode was explored in [40]. Encouraging sparse factors was investigated by placing a sparsity prior on normal distributed factors [41, 42]. An orthogonal CP was proposed in [43], but it relied on MAP estimates of the von Mises-Fisher matrix distribution.

Modelling non-negative data has been explored with factors following a rectified normal [44], truncated normal or exponential [45], Poisson [46], or Dirichlet distribution [47] for continuous data. Decomposition of count or discrete data has been explored using either the Poisson or multinomial likelihood [48–50], with factors following a Dirichlet distribution [48] or Poisson distribution [49, 50]. Finally, several recent works exist on improved scaling to high-dimensional or streaming data [51–53].

Beyond the CP and Tucker model, tensor regression and classification has been investigated [34, 54, 55] and the connection between tensor regression and Gaussian processes was explored in [56]. Analysing multiple matrices with one varying mode is possible using the VB based PARAFAC2 model [57, 58] whereas analysing multiple 3-way tensors with one or two varying modes was proposed in [59, 60]. Finally, a VB Tensor Train model has been proposed in [61].

1.2. Summary of contribution

This paper introduces the probabilistic tensor decomposition toolbox which gathers many of the existing tools for Bayesian tensor decomposition in one place, see https://github.com/JesperLH/prob-tensor-toolbox/. The toolbox interconnects different constraints proposed in the literature, provides easy access to probabilistic decomposition for researchers, and serves as a reference implementation for comparing existing and future tensor decomposition models. In particular, the toolbox provides implementations of Bayesian CP, Tucker, and Tensor Train decomposition and supports model inference using either variational Bayesian inference, Gibbs sampling, or a combination of the two. At the time of writing, Bayesian CP is the most versatile model, as it allows modelling factors following a normal, truncated normal or exponential, uniform, or von Mises-Fisher matrix distribution. These distributions facilitate factors which are real, non-negative, box constrained, or orthogonal, respectively. Furthermore, column-wise and element-wise sparsity is modelled through hyper-prior distributions. The CP implementation can account for both homoscedastic as well as mode-specific heteroscedastic noise. Currently, inference in the presence of missing values is only available for the CP model using either marginalization or Bayesian imputation.

2. Methods

Scalars will be indicated by lowercase letters x, vectors by bold lowercase letters ${\bf x}$ , matrices by bold upper case letters ${\bf X}$ , and N^th order (N > 2) tensors by bold calligraphic letters $\boldsymbol{\mathcal{X}}$ . Generally, roman letters will be used to represent data and random variables while Greek letters represent parameters of the probability distribution. A notable exception is $\boldsymbol{\theta}$ which will be used to indicate both the set of all parameters (random variables and distributional parameters) and the set of parameters for a specific distribution, i.e. $P(\boldsymbol{\theta}_i)$ for the distribution of $\boldsymbol{\theta}_i$ . Furthermore, let ×_n denote the n-mode matrix multiplication and $\otimes$ , $\odot$ , and ° respectively denote the Kronecker product, Khatri-Rao product, and element-wise product.

2.1. Bayesian tensor decomposition

Bayesian methods for tensor modelling presently includes the CP and Tucker model with factors (and core array) following different distributions, as presented in section 1.1. Additionally, the Tensor Train [61], PARAFAC2 [57], and multi-tensor factorization [59, 60] model were all recently developed using Bayesian inference.

Tucker decomposition is one of the core tensor models and is here used for illustrate some of the differences between maximum likelihood (ML) and Bayesian estimation. Consider an N^th order data tensor with I_n observations in the n^th mode, e.g. $\boldsymbol{\mathcal{X}}^{I_1 \times I_2 \times \ldots \times I_N}$ , the Tucker model can be written as

$\begin{equation} \boldsymbol{\mathcal{X}} = \boldsymbol{\mathcal{G}} \times_1 {\bf A}^{(1)} \times_2 {\bf A}^{(2)} \times_3 \ldots \times_N {\bf A}^{(N)} + \boldsymbol{\mathcal{E}}, \end{equation} \tag{ 1 }$

where ${\bf A}^{(n)} \in \mathbb{R}^{I_n \times D_n}, n = 1,2,\ldots,N$ are the N factor matrices, $\boldsymbol{\mathcal{G}} \in \mathbb{R}^{D_1 \times D_2 \times \ldots D_N}$ is the core array which contains the coefficients for all multi-linear interactions of the columns of the factor matrices, and $\boldsymbol{\mathcal{E}}$ is the noise. For brevity, let $\boldsymbol{\mathcal{M}}$ denote the model or data reconstruction where

$\begin{equation} \boldsymbol{\mathcal{M}} \equiv \boldsymbol{\mathcal{G}} \times_1 {\bf A}^{(1)} \times_2 {\bf A}^{(2)} \times_3 \ldots \times_N {\bf A}^{(N)}. \end{equation} \tag{ 2 }$

This provides the framing for stating the ML and Bayesian inference schemes below. Note, that the inference is stated as a least squares and Gaussian likelihood for ML and Bayesian inference, respectively. While other choices are possible this is the most common choice.

$\begin{eqnarray} &\underline{\textbf{Maximum Likelihood}} &\nonumber\\ &\mathop{\rm argmin}\limits_{\boldsymbol{\mathcal{G}}, {\bf A}^{(1)}, {\bf A}^{(2)}, \ldots,{\bf A}^{(N)}} ||\boldsymbol{\mathcal{X}} - \boldsymbol{\mathcal{M}}||_F^2&\nonumber\\ & \text{s.t. constraint on } \boldsymbol{\mathcal{G}}, {\bf A}^{(n)}\, \forall_{n \in N}& \end{eqnarray} \tag{ 3 }$

$\begin{eqnarray} &\underline{\textbf{Bayesian}}& \nonumber\\ &P(\boldsymbol{\theta}), \quad \boldsymbol{\theta} = \{\boldsymbol{\mathcal{G}}, {\bf A}^{(1)}, \ldots, {\bf A}^{(N)},\tau\}&\nonumber\\ &\mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta}) = \mathcal{N}(\mathrm{vec}(\boldsymbol{\mathcal{M}}), \tau^{-1}{\bf I})&\nonumber\\ &P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}) = \dfrac{\mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta})}{\int_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta}) d \boldsymbol{\theta}}& \end{eqnarray} \tag{ 4 }$

The ML approach, as mentioned, is widely used for existing tensor methods and can be solved using alternating optimization [2, 62], all-at-once optimization [63], or non-linear least squares [64, 65]. Constraints such as non-negativity can be imposed using either active set procedures [66], or alternating direction method of multipliers [67, 68]. Historically, multiplicative updates [69] have also been used.

For Bayesian inference, the first step is to specify the prior distributions, $P(\boldsymbol{\theta})$ , which describe the a priori knowledge about the parameters. The next step is to specify the likelihood function, $\mathcal{L}(\boldsymbol{\mathcal{X}}|\boldsymbol{\theta})$ , which is a probability distribution specifying how the data is generated based on the reconstructed array by the imposed model $\mathcal{M}$ and assumptions regarding noise. The posterior distribution, $P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}})$ , is then calculated via Bayes rule as specified in equation (4).

The key challenge in Bayesian inference is that calculating the posterior is generally intractable, as calculating the marginal distribution or evidence, $P(\boldsymbol{\mathcal{X}}) = \int_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta}) d\boldsymbol{\theta}$ , is intractable. Therefore, most applications of Bayesian inference turn to approximate methods, which are primarily; 1) Markov Chain Monte Carlo (MCMC) sampling and its variants [15, 17, 70]. 2) Laplace approximation and its variants [20, 71]. 3) Variational approximation [18, 21].

This article and the presented toolbox focuses on variational Bayesian (VB) inference and MCMC Gibbs sampling due to their interconnection, ease of use, and wide application within the existing literature on Bayesian tensor modeling.

2.2. Gibbs sampling and VB approximation

Given the prior distribution, $P(\boldsymbol{\theta})$ , and the likelihood, $\mathcal{L}(\boldsymbol{\mathcal{X}}|\boldsymbol{\theta})$ , the posterior distribution, $P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}})$ is given by Bayes rule,

$\begin{equation} P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}) = \dfrac{\mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta})}{\int_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta}) d \boldsymbol{\theta}} = \dfrac{\mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})P(\boldsymbol{\theta})}{P(\boldsymbol{\mathcal{X}})}. \end{equation} \tag{ 5 }$

The posterior is generally intractable as the marginal likelihood or evidence is intractable, $P(\boldsymbol{\mathcal{X}})$ . Gibbs sampling is useful when it is not possible or too expensive to directly sample from the joint posterior, $P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}})$ , but the conditional distribution $P(\boldsymbol{\theta}_i | \boldsymbol{\theta}_{-i}, \boldsymbol{\mathcal{X}})$ has a closed form or is easy to sample from. Gibbs sampling then draws a sample from the conditional distribution,

$\begin{align} \boldsymbol{\theta}_i \sim P(\boldsymbol{\theta}_i | \boldsymbol{\theta}_{-i} , \boldsymbol{\mathcal{X}}), \end{align} \tag{ 6 }$

where $\boldsymbol{\theta}_i$ is the set of random variable which are sampled and $\boldsymbol{\theta}_{-i}$ are the random variables which are conditioned upon. A full sample $\boldsymbol{\theta}$ is obtained by cycling through all conditional distributions. When the sampler has converged, then these samples approximate the true posterior distribution [16].

For variational Bayesian inference, the main idea is to approximate the posterior, ${\rm l}\ P(\boldsymbol{\theta} | {\bf X})$ , using a set of variational distributions, $Q(\boldsymbol{\theta})$ . The Q distributions provides a lower-bound on the log evidence $\log P(\boldsymbol{\mathcal{X}})$ which is called the evidence lower bound (ELBO). Using Jensen's inequality, the ELBO is,

$\begin{align} \log P(\boldsymbol{\mathcal{X}}) = & \log \int_{\boldsymbol{\theta}} P(\boldsymbol{\mathcal{X}} , \boldsymbol{\theta})\frac{Q(\boldsymbol{\theta})}{Q(\boldsymbol{\theta})} d\boldsymbol{\theta} \geq \int_{\boldsymbol{\theta}} Q(\boldsymbol{\theta})\log\frac{P(\boldsymbol{\mathcal{X}},\boldsymbol{\theta})}{Q(\boldsymbol{\theta})} d\boldsymbol{\theta} = \mathtt{ELBO}\mathcal{(Q)}. \end{align} \tag{ 7 }$

The variational distributions are chosen such that the ELBO is tractable [18, 21]. This paper uses a mean-field approximation where the probability distributions are assumed to be independent between subsets of random variables, i.e. $Q\left(\boldsymbol{\theta}\right) = \prod_i Q\left(\boldsymbol{\theta}_i\right)$ . The optimal variational distribution for each subset $\boldsymbol{\theta}_i$ is then,

$\begin{align} Q(\boldsymbol{\theta}_i) \propto \mathrm{\textrm{exp}}\left\lbrace\, \left\langle \, \mathrm{\textrm{ln}}\, P(\boldsymbol{\mathcal{X}},\boldsymbol{\theta})\, \right\rangle \,_{\backslash Q(\boldsymbol{\theta}_i)}\right\rbrace \end{align} \tag{ 8 }$

where $\, \left\langle \,\cdot\, \right\rangle \,_{\backslash Q(\boldsymbol{\theta}_i)}$ is the expected value with respect to all variational distributions except $Q(\boldsymbol{\theta}_i)$ , see [72]. The optimal parameters of the variational distribution are then identified via moment matching in equation (8).

2.3. Probabilistic tensor decomposition toolbox

The probabilistic tensor decomposition toolbox is available at https://github.com/JesperLH/prob-tensor-toolbox/. Presently, it only considers the Gaussian likelihood which is analogous to the least squares error. Given a data tensor $\boldsymbol{\mathcal{X}}^{I_1 \times I_2 \times \cdots \times I_N}$ , the likelihood is,

$\begin{align} \mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta}) & = \mathcal{N}_A\left(\boldsymbol{\mathcal{X}} | \boldsymbol{\mathcal{M}}, \boldsymbol{\Sigma}_\epsilon^{(1)}, \boldsymbol{\Sigma}_\epsilon^{(2)}, \ldots, \boldsymbol{\Sigma}_\epsilon^{(N)}\right) \end{align} \tag{ 9 }$

$\begin{equation} \kern-13.5pt = \mathcal{N}\left(\mathrm{vec}(\boldsymbol{\mathcal{X}}) | {\bf m}, \boldsymbol{\Sigma}_\epsilon \right) \kern13.5pt \end{equation} \tag{ 10 }$

where $\boldsymbol{\mathcal{M}}$ is the mean array and ${\bf m} \equiv \mathrm{vec}(\boldsymbol{\mathcal{M}})$ its vectorization, $\mathcal{N}_A$ and $\mathcal{N}$ is the array and multivariate normal distribution, and $\boldsymbol{\Sigma}_\epsilon \equiv \boldsymbol{\Sigma}_\epsilon^{(N)} \otimes \boldsymbol{\Sigma}_\epsilon^{(N-1)} \otimes \cdots \otimes \boldsymbol{\Sigma}_\epsilon^{(1)}$ is the residual error following a Kronecker structured covariance matrix [26, 27, 73].

A simpler heteroscedastic noise model where the mode specific covariance is diagonal, $\boldsymbol{\Sigma}_\epsilon^{(n)}\equiv \mathrm{diag}\left(\boldsymbol{\sigma}_\epsilon^{(n)^2}\right)$ , is implemented for Bayesian CP. This noise structure allows quantifying mode-specific noise, such as noisy samples or unreliable features. For the Tucker and the Tensor Train decomposition only homoscedastic noise is considered, which is the same as specifying $\boldsymbol{\Sigma}_\epsilon^{(n)}\equiv \sigma^2{\bf I}$ and $\boldsymbol{\Sigma}_\epsilon^{(m)}\equiv {\bf I} \, \forall_{m\neq n}$ for an arbitrarily chosen n.

The CP, Tucker and Tensor Train (TT) models are all specified by the structure of the mean array $\boldsymbol{\mathcal{M}}$ . For the Tucker decomposition the mean array is defined as in equation (2). Similarly, CP is described by equation (2) but constraining the core array to the unit hyper-cube $\boldsymbol{\mathcal{G}} = \boldsymbol{\mathcal{I}}$ . The unit hyper-cube has value one along the hyper-diagonal and off-diagonal elements are zero. For tensor train decomposition, the mean array is,

$\begin{align} \boldsymbol{\mathcal{M}} = \boldsymbol{\mathcal{U}}^{(1)} \times_{[2,1]} \boldsymbol{\mathcal{U}}^{(2)} \times_{[3,1]} \boldsymbol{\mathcal{U}}^{(3)} \times_{[4,1]} \cdots \times_{[N-1,1]} \boldsymbol{\mathcal{U}}^{(N)} \end{align} \tag{ 11 }$

where ×_[a,b] is tensor contraction along mode a and b and $\boldsymbol{\mathcal{U}}^{(n)} \in \mathbb{R}^{D_{n-1} \times I_n \times D_n}$ is the latent factor or train cart for mode n = 1, ..., N. Details on tensor train and its probabilistic extension are found in [61, 74].

Handling mode-specific heteroscedastic noise is feasible both for the Tucker decomposition and the TT decomposition. However, when imposing orthogonality constraints on a latent factor heteroscedastic noise is not permitted as the prior on the factor (von Mises-Fisher Matrix distributed) and likelihood (heteroscedastic Gaussian distributed) are not conjugate, i.e. the posterior distribution of each factor no longer follows a von Mises-Fisher Matrix distribution.

2.3.1. Partially observed data

For partially observed data, the missing elements can either be imputed or marginalized. For the latter, the likelihood is specified only on the observed elements of the tensor,

$\begin{align} \mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta}) = & \prod_{{\mathbf{i}} \in \mathcal{O}}\mathcal{N}\left(x_{\mathbf{i}}|m_{\mathbf{i}}, \boldsymbol{\Sigma}_{\epsilon} \right) \end{align} \tag{ 12 }$

where $\mathcal{O}$ is the set of observed elements and $\mathbf{i} = (i_1, i_2, \ldots, i_N)$ is the vector index of a single element. Marginalization is preferred over imputation as it does not introduce any bias in the estimated posterior distribution and preserves any convergence guarantees of the chosen inference method. Imputing missing values is generally faster and works well in many practical applications. Unfortunately, imputation can fail unexpectedly as the number of missing values increase as the inference is highly influenced by past (poor) imputations.

For Bayesian inference, imputation is more principled than its maximum likelihood counterpart [16]. In Bayesian inference, the likelihood is just the probability of data $\boldsymbol{\mathcal{X}}$ under some model $\boldsymbol{\theta}$ , i.e. $\mathcal{L}(\boldsymbol{\mathcal{X}} | \boldsymbol{\theta})\equiv P(\boldsymbol{\mathcal{X}}|\boldsymbol{\theta})$ . In the presence of missing data, this can be rewritten as $P(\boldsymbol{\mathcal{X}}_{obs}, \boldsymbol{\mathcal{X}}_{miss} | \boldsymbol{\theta})$ and the probability of a missing element $x_{\mathbf{i}}$ is then,

$\begin{align} P(x_{\mathbf{i}} | \boldsymbol{\mathcal{X}}_{obs}) = & \int_{\boldsymbol{\theta}} P(x_{\mathbf{i}} |\boldsymbol{\theta}) P(\boldsymbol{\theta}| \boldsymbol{\mathcal{X}}_{obs})\, \delta\boldsymbol{\theta},\quad \forall_{\mathbf{i}\not\in \boldsymbol{\mathcal{O}}} \end{align} \tag{ 13 }$

Determining this distribution is intractable, but it can be approximated using VB [21, 28] or sampling [16]. The toolbox currently use a coarse approximation and assumes $P(x_{\mathbf{i}}, | \boldsymbol{\mathcal{X}}_{obs}) = g(\boldsymbol{\mathcal{M}})_{\mathbf{i}}$ for all missing elements $\mathbf{i} \not\in \mathcal{O}$ . For VB, $g(\boldsymbol{\mathcal{M}}) = \, \left\langle \,\boldsymbol{\mathcal{M}}\, \right\rangle \,$ is the reconstructed element using the current expected first moments. For sampling, $g(\boldsymbol{\mathcal{M}}) = \boldsymbol{\mathcal{M}}$ which is the reconstruction using the most recent sample of the random variables. Due to this coarse approximation, imputation should always be followed by at least one marginalization update to get valid posterior moments.

2.3.2. Factor matrices

The Bayesian Tensor Train model only supports orthogonal train carts and details are provided in the original reference [61]. Therefore, this section only concerns the specification of factor matrices for Bayesian CP and Tucker decomposition, i.e. ${\bf A}^{(n)}$ for mode n as defined in equation (2). Presently, the following priors are implemented in the toolbox,

$\begin{align} P\left({\bf A}^{(n)} | \boldsymbol{\mu}^{(n)}, \boldsymbol{\Lambda}^{(n)}\right) = & \prod_{i = 1}^{I_n}\mathcal{N}\left({\bf a}_i^{(n)^\top} | \boldsymbol{\mu}_i^{(n)^\top}, \boldsymbol{\Lambda}_i^{(n)^{-1}}\right) \end{align} \tag{ 14 }$

$\begin{align} P\left({\bf A}^{(n)} | \boldsymbol{\mu}^{(n)}, \boldsymbol{\Lambda}^{(n)} \right) = & \prod_{i = 1}^{I_n}\prod_{d = 1}^{D_n}\mathcal{N}_{[0,\infty]}\left(a_{id}^{(n)} | \mu_{id}^{(n)}, \lambda_{id}^{(n)^{-1}}\right) \end{align} \tag{ 15 }$

$\begin{align} P\left({\bf A}^{(n)} | \boldsymbol{\Lambda}^{(n)}\right) = & \prod_{i = 1}^{I_n}\prod_{d = 1}^{D_n} \mathrm{Exponential}\left(a_{id}^{(n)} | \lambda_{id}^{(n)}\right) \end{align} \tag{ 16 }$

$\begin{align} P\left({\bf A}^{(n)} | l, u \right) = & \prod_{i = 1}^{I_n}\prod_{d = 1}^{D_n}\mathrm{Uniform}\left(a_{id}^{(n)} | l, u\right) \end{align} \tag{ 17 }$

$\begin{align} P\left({\bf A}^{(n)} | {\bf K} \right) = & {\rm v}\mathcal{MF}\left({\bf A}^{(n)} | {\bf K}\right), \quad {\bf K}\equiv{\bf 0} \end{align} \tag{ 18 }$

where each n-mode factor matrix ${\bf A}^{(n)}$ has observations i = 1, 2, ..., I_n and latent components d = 1, 2, ..., D_n. Presently, both Tucker and CP support von Mises-Fisher Matrix ( ${\rm v}\mathcal{MF}$ ) distributed factor matrices which represent an orthogonal assumption on the factor matrix. For CP, the multivariate normal distribution is used to model unconstrained factors, while the truncated normal and exponential distribution are used to model non-negativity. Furthermore, the uniform distribution can be applied to achieve an uninformative box constraint.

2.3.3. The core array

For the Tucker decomposition an explicit core array $\boldsymbol{\mathcal{G}}^{D_1 \times D_2 \times \cdots \times D_N}$ is inferred which models multi-linear interactions between the components. Here, the core array is specified using its vectorized version ${\bf g} \equiv \mathrm{vec}(\boldsymbol{\mathcal{G}})$ and follows a normal distribution,

$\begin{align} P({\bf g} | \boldsymbol{\Psi}) = & \mathcal{N}\left({\bf g} | {\bf 0}, \boldsymbol{\Psi}^{-1}\right), \end{align} \tag{ 19 }$

where $\boldsymbol{\Psi}^{D_1 D_2 \cdots D_N \times D_1 D_2 \cdots D_N }$ is the precision matrix and the precision of a single element in $\boldsymbol{\mathcal{G}}$ is $\psi_{d_1, d_2, \ldots, d_N}$ . The toolbox considers learning a shared scale $\boldsymbol{\Psi}\equiv\psi {\bf I}$ , an automatic relevance determination (ARD) prior for slice activation $\mathrm{diag}(\boldsymbol{\Psi}) \equiv \boldsymbol{\psi}^{(N)} \odot \boldsymbol{\psi}^{(N-1)} \odot \cdots \odot \boldsymbol{\psi}^{(1)}$ [28], and element-wise sparsity $\mathrm{diag}(\boldsymbol{\Psi}) \equiv \boldsymbol{\psi}$ [25]. For these priors off-diagonal elements of $\boldsymbol{\Psi}$ are zero and the prior on diagonal elements is assumed independent and from a Gamma distribution, $P(\psi_{\cdot}) \sim {\cal G}(\alpha_\psi,\beta_\psi)$ with shape $\alpha_\psi$ and rate $\beta_\psi$ .

In the context of Bayesian analysis, a Kronecker structured precision matrix $\boldsymbol{\Psi} \equiv \boldsymbol{\Psi}^{(N)} \otimes \boldsymbol{\Psi}^{(N-1)} \otimes \cdots \otimes \boldsymbol{\Psi}^{(1)}$ was explored by [27] and for count data a core array of CP structured sub-cores [33]. The latter is related to the block term decomposition [75] and can be viewed as a special case for count data using Bayesian inference.

2.3.4. Component and Core Precision Matrix

The probabilistic toolbox allows specifying a prior $P(\boldsymbol{\Lambda}_i^{(n)})$ on the precision of each observation i and latent components D_n for factor matrix ${\bf A}^{(n)}$ . Similarly, for each element of the core array $g_{d_1,d_2,\ldots,d_N}$ it is possible to specify its precision $P(\psi_{d_1,d_2,\ldots,d_N})$ . At present time, only precisions following a Gamma distribution are considered. For the core array, the priors are

$\begin{align} P(\psi) = {\cal G}\left(\psi| \alpha_\psi, \beta_\psi\right), \quad P(\boldsymbol{\psi}) = \prod_{d = 1}^{D_1\cdot D_2 \cdots D_N}{\cal G}\left(\psi_d| \alpha_\psi, \beta_\psi\right)\nonumber\\ P(\boldsymbol{\psi}^{(n)}) = \prod_{d = 1}^{D_n}{\cal G}\left(\psi_d^{(n)} | \alpha_\psi, \beta_\psi\right), \quad n = 1,2,\ldots,N \end{align} \tag{ 20 }$

for learning a shared scale $\boldsymbol{\Psi} \equiv \psi{\bf I}$ , element-wise sparsity $\boldsymbol{\Psi} \equiv \mathrm{diag}(\boldsymbol{\psi})$ , and prune irrelevant slices $\boldsymbol{\Psi} \equiv \mathrm{diag}([\boldsymbol{\psi}^{(1)},\ldots \boldsymbol{\psi}^{(N)})$ , respectively. Note $\alpha_\psi$ and $\beta_\psi$ is the rate and shape of the Gamma distributions which are specified as broad priors.

For the latent factors, a full precision matrix for each observation i and mode n is rarely needed. Instead, it is modelled as

$\begin{align} \boldsymbol{\Lambda}_i^{(n)} \equiv \lambda {\bf I}, & \quad \boldsymbol{\Lambda}_i^{(n)} \equiv \mathrm{diag}\left(\boldsymbol{\lambda}\right), \quad \boldsymbol{\Lambda}_i^{(n)} \equiv \mathrm{diag}\left(\boldsymbol{\lambda}^{(n)}\right), \end{align} \tag{ 21 }$

$\begin{align} &\boldsymbol{\Lambda}_i^{(n)} \equiv \mathrm{diag}\left(\boldsymbol{\lambda}_i^{(n)}\right), \quad \text{or}\quad \boldsymbol{\Lambda}_i^{(n)} \equiv \boldsymbol{\Lambda}^{(n)}. \end{align} \tag{ 22 }$

These specifications model a shared scale, ARD shared over modes, mode-specific ARD, mode specific sparsity, and a mode specific interaction between the latent components. The latter $P(\boldsymbol{\Lambda}^{(n)}|\cdot)$ follow a Wishart distribution while the elements in each of the former specifications follow a Gamma distribution.

The Gamma distribution is used as it is the conjugate prior for the precision of a normal or truncated normal distribution, as well as for the rate of an exponential distribution. The Wishart prior is the conjugate prior of the precision matrix from a normal distribution and is only applicable to normal factor priors equation (14).

Gamma based ARD priors work well in many applications [25, 72, 76], but they do not necessarily provide the desired shrinkage [77, 78].

2.3.5. Inferring the Posterior Distribution

For a given decomposition model, its parameters or all the random variables of interest are denoted as $\boldsymbol{\theta}$ . As an example, consider Bayesian CP with three factors ${\bf A}^{(1)}, {\bf A}^{(2)}, {\bf A}^{(3)}$ , shared component precision $\boldsymbol{\lambda}$ , and homoscedastic noise τ for which $\boldsymbol{\theta} = \{{\bf A}^{(1)},{\bf A}^{(2)},{\bf A}^{(3)},\boldsymbol{\lambda},\tau\}$ is the set of random variables.

The probabilistic tensor decomposition toolbox seeks to characterize these random variables via their posterior distribution $P(\boldsymbol{\theta} | \boldsymbol{\mathcal{X}})$ . As discussed in section 2.2, exact inference is not feasible and the posterior is approximated using either variational Bayesian (VB) inference or Gibbs sampling, see also [16, 18] for details.

2.4. Prediction on Held-out Slices or Samples

Bayesian inference facilitates a principled way of predicting on held-out slices or samples from a tensor $\boldsymbol{\mathcal{X}}$ . This allows splitting the tensor into a disjoint training and test set, $\boldsymbol{\mathcal{X}} = [\boldsymbol{\mathcal{X}}_{train}, \boldsymbol{\mathcal{X}}_{test}]$ . This approach facilitate greater independence between training and test set when compared with leaving out elements or fibers of the tensor. Test set performance is assessed using the predictive posterior distribution [16, 21] where typically its logarithm is used,

$\begin{equation} \log P(\boldsymbol{\mathcal{X}}_{test}|\boldsymbol{\mathcal{X}}_{train}) = \log \int_{\boldsymbol{\theta}}\int_{\boldsymbol{\theta}^\star} P(\boldsymbol{\mathcal{X}}_{test},\boldsymbol{\theta},\boldsymbol{\theta}^\star|\boldsymbol{\mathcal{X}}_{train}) d\boldsymbol{\theta} d\boldsymbol{\theta}^\star. \end{equation} \tag{ 23 }$

Here $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^\star$ are the random variables for the training and test set, respectively. Generally, this integral is intractable and the probabilistic tensor decomposition toolbox considers two approximation approaches inspired by [21].

The first approach relies on the assumption that the estimated variational posterior distribution is a good approximation of the true posterior distribution, i.e. $P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}_{train}) \approx Q(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}_{train})$ . This provides the following lower-bound,

$\begin{align} \log P(\boldsymbol{\mathcal{X}}_{test} &| \boldsymbol{\mathcal{X}}_{train}) \approx \log \int_{\boldsymbol{\theta}}\int_{\boldsymbol{\theta}^\star} P(\boldsymbol{\mathcal{X}}_{test}|\boldsymbol{\theta},\boldsymbol{\theta}^\star)Q(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}_{train})P(\boldsymbol{\theta}^\star) d\boldsymbol{\theta} d\boldsymbol{\theta}^\star\nonumber\\ \geq& \int_{\boldsymbol{\theta}}\int_{\boldsymbol{\theta}^\star}Q(\boldsymbol{\theta}^\star)Q(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}_{train})\log \left(P(\boldsymbol{\mathcal{X}}_{test}|\boldsymbol{\theta},\boldsymbol{\theta}^\star)\frac{P(\boldsymbol{\theta}^\star)}{Q(\boldsymbol{\theta}^\star)}\right) d\boldsymbol{\theta} d\boldsymbol{\theta}^\star \end{align} \tag{ 24 }$

where $Q(\boldsymbol{\theta}^\star)$ is a variational approximation of the test set parameters.

The second approach assumes a point estimate of the training parameters $\boldsymbol{\theta}$ is a good approximation of the posterior distribution, e.g. $P(\boldsymbol{\theta}|\boldsymbol{\mathcal{X}}_{train}) \approx \, \left\langle \,\boldsymbol{\theta}\, \right\rangle \,$ where the mean value of $\boldsymbol{\theta}$ is used. This approximation then only requires evaluating the integral over testset parameters $\boldsymbol{\theta}^\star$ ,

$\begin{align} P(\boldsymbol{\mathcal{X}}_{test}|\boldsymbol{\mathcal{X}}_{train}) \approx & \int_{\boldsymbol{\theta}^{\star}} P(\boldsymbol{\mathcal{X}}_{test} | \boldsymbol{\theta}^\star,\, \left\langle \,\boldsymbol{\theta}\, \right\rangle \,)P(\boldsymbol{\theta}^\star) d\boldsymbol{\theta}^{\star} \end{align} \tag{ 25 }$

where $P(\boldsymbol{\mathcal{X}}_{test} | \cdot)$ is the likelihood of the test data and $P(\boldsymbol{\theta}^\star)$ the prior on the test parameters. This integral is sometimes tractable, but otherwise it is approximated via variational inference or sampling.

3. Results and discussion

This section contains a few experiments to illustrate the use of the probabilistic tensor decomposition toolbox. These experiments demonstrate the benefits of probabilistic modelling, but also serve to highlight potential limitations.

These illustrations are primarily based on two common fluorescence spectroscopy datasets. The Amino Acid dataset [79–81] contains five samples where fluorescence was measured as different emission levels (250–450 mm) and excitation levels (250–300 nm). The data is represented in the tensor $\boldsymbol{\mathcal{X}}^{5 \times 201 \times 61}$ where modes are samples, emission levels, and excitation levels. The samples are mixtures of three pure components and the concentrations are known.

The Sugar Process dataset [82, 83] consists of 268 equally spaced samples and for each sample fluorescence was measured at 571 emission levels and 7 excitation levels, the data is represented in the tensor $\boldsymbol{\mathcal{X}}^{268 \times 571 \times 7}$ . The samples were collected every 8 hours from a sugar plant and the dataset contains chemical shifts and is more noisy than the Amino Acid dataset.

For illustrating tensor completion, section 3.3, two additional datasets are considered. The Wakeman dataset [84, 85] which contains electroencephalogram (EEG) data from a visual experiment. Before analysis, the data was preprocessed as described in Chapter 42.3 of the SPM12 Manual (https://www.fil.ion.ucl.ac.uk/spm/) except for the addition of an initial filtering step for removing high frequency noise. This step used a band pass filter (FIR) [1,35]Hz with filter order 1815 and applied it to each channel and in both directions. This is further detailed in the script acq_wakeman_eeg_data.m. In this paper, only data from subject 7 and condition Famous is used. The data is represented as a tensor $\boldsymbol{\mathcal{X}}^{70 \times 181 \times 296}$ where the modes are channels, time points, and trials, respectively.

The Human Allen Brain dataset [86] consists of gene expression data from different brain tissues in six subjects, represented by $\boldsymbol{\mathcal{X}}^{58692 \times 414 \times 6}$ where the modes are genes, areas, and subjects, respectively. The data is available from http://human.brain-map.org/. For some brain areas, multiple tissue samples where gathered, but the present analysis ignores this aspect and averages over all samples within an area and subject. Previously, this preprocessing scheme was applied when analysing the data using a non-negative CP model based on either maximum likelihood [62] or Bayeasian inference [45]. Further details are found in the script acq_allen_geneexpr_data.m.

The toolbox is made to be extensible and future changes might affect how the developed functions are called. Therefore, specific function calls are not included. Instead, the name of the relevant demonstration file is given, i.e. demo_*.m.

3.1. Component extraction: amino acid and sugar process dataset

Tensor decomposition methods seek to factorize data, such that the extracted components represent the underlying structure of the data. Commonly, the components of this underlying structure are of interest as they provide a more compact and possibly interpretable representation of the data. The toolbox, presently, includes Bayesian Canonical Polyadic/PARAFAC (CP), Tucker, and Tensor Train (TT) decomposition. Calling these methods are demonstrated in the script demo_component_extraction.m which also visualizes the results obtained by each method.

This section only illustrates the CP decomposition, as neither Tucker nor TT decomposition have simple intuitive visualizations. The CP decomposition is applied to the Amino Acid and Sugar Process datasets, as described in demo_fluorescence.m which also generates figures 1(a)–(d).

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** **Component Extraction**: Estimation of Bayesian CP with either D_init = 10 or D_init = 100 initial components. For each scenario, the mode specific (samples, emission, and excitation) loadings are shown. The components are sorted by their estimated standard deviation ( $\, \left\langle \,\lambda_{d}\, \right\rangle \,^{-1/2}$ ) which is to the right.
Download figure:
Standard image High-resolution image

**Figure 1.** **Component Extraction**: Estimation of Bayesian CP with either D_init = 10 or D_init = 100 initial components. For each scenario, the mode specific (samples, emission, and excitation) loadings are shown. The components are sorted by their estimated standard deviation ( $\, \left\langle \,\lambda_{d}\, \right\rangle \,^{-1/2}$ ) which is to the right.
Download figure:
Standard image High-resolution image

Each dataset is standardized to have unit variance and a Bayesian CP with normal factors and shared column-wise precision (ARD) prior is fit. The random variables are $\boldsymbol{\theta} = \{{\bf A}^{(1)}, {\bf A}^{(2)}, {\bf A}^{(3)}, \boldsymbol{\lambda}, \tau\}$ and are inferred using variational inference.

To investigate if irrelevant components are pruned appropriately, the model is initialized with either 10 or 100 components, i.e. D_init = 10 or D_init = 100. The maximum number of iterations was 100 and 500 for the Amino Acid and Sugar Process datasets, respectively. Results are shown in figure 1.

For the Amino Acid dataset, the model correctly estimates the latent subspace and correctly identifies the components and concentrations of each sample, see figures 1(a) and (b). Note, that the ARD for D_init = 10 estimates 4 components, but the fourth component is a replicate of the yellow component and has near zero loadings in the sample mode.

For the Sugar Process dataset, seven components are estimated for D_init = 10. However, this is not the true chemical spectras, as samples contain chemical shifts that violate the tri-linear assumption of the CP model. Therefore, additional components are necessary to adequately represent the data. The effect of this is most visible in the emission mode, as seen in both figures 1(c) and (d).

Figure 1(d), the initial number of components are increased D_init = 100 and while most are pruned, it does not estimate the same number of components. This illustrates, that if the initial number of components are greatly over-specified, then the correct subspace might not be identified and it can be beneficial to refit the model with a lower number of initial components.

Generally, determining the true number of components for any dataset remains difficult, but the Bayesian-CP model has proven to provide more robust estimation when the number of components are over-specified [39, 45]. This added robustness of using a Bayesian formulation has also been observed for other tensor decomposition methods, for instance [27, 28, 57, 59–61].

3.1.1. Heteroscedastic noise estimation

Modelling mode-specific heteroscedastic noise allows quantification of data quality of the samples, as well as investigation of which part of the spectra the model has an uncertain representation of. This example shows an analysis of the Amino Acid data using a Bayesian CP with truncated normal factors, column-wise precision (ARD) prior on the sample mode, and heteroscedastic noise on all modes. The set of random variables is now $\boldsymbol{\theta} = \{{\bf A}^{(1)}, {\bf A}^{(2)}, {\bf A}^{(3)}, \boldsymbol{\lambda}, \boldsymbol{\tau}^{(1)}, \boldsymbol{\tau}^{(2)}, \boldsymbol{\tau}^{(3)}\}$ . The model is initialized with 10 components and results are shown in figure 2. The analysis is included in script demo_tb_heteroscedasticCP.m.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** **Component Extraction**: Bayesian non-negative CP with heteroscedastic noise estimation and D_init = 10. The bottom row illustrates the estimated mode specific noise. Presently, the estimated noise has a scale indeterminacy and can only be interpreted within a mode.
Download figure:
Standard image High-resolution image

The estimated noise, presently, suffers from scale indeterminacy as $\boldsymbol{\tau}^{(1)}\odot \boldsymbol{\tau}^{(2)} = (\boldsymbol{\tau}^{(1)}\alpha^{-1})\odot (\boldsymbol{\tau}^{(2)}\alpha)$ for an arbitrary $\alpha \in \mathbb{R}_{\geq 0}$ . Therefore, the estimates are interpretable within each mode but not in absolute magnitude across modes. For comparison across modes, the estimate has to be invariant to scale changes, c.f. [27]. The heteroscedastic noise estimation is useful for down-weighting specific samples, emission levels, or excitation levels. However, it does not enforce a fixed threshold where an observation i_n for some mode n should be left out of the estimation.

3.2. Data denoising

Tensor decomposition methods can be used for data denoising where the goal is to remove unstructured and/or structured noise from the data. This is illustrated on the Amino Acid dataset using either a CP, Tucker or Tensor Train decomposition. This reproduces an experiment from the Bayesian TT article [61], but presently also including the Bayesian Tucker decomposition with orthogonal factors.

Similar to [61], the number of components are set to D = 3 in the CP model and D = (1, 6, 5, 1) in the TT model. For the Tucker decomposition, the core size is determined by exhaustive evaluation using maximally $D_1 = 5, D_2 = D_3 = 10$ resulting in 10²· 5 models for which the model achieving the highest evidence lowerbound had core size D = [5, 9, 9]. Each of the decomposition is fit until convergence or 100 iterations. The toolbox includes the script demo_denoising.m which reproduces this analysis.

The original data, reconstructed data, and the residual error for each sample using each decomposition are shown in figure 3. The differences between the decomposition methods are most apparent for sample 3 and 5 where Rayleigh scattering (structured noise) is also most pronounced.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** **Data Denoising**: For the Amino Acid dataset, the raw data from each sample (top row) and the reconstructed sample based on either CP, Tucker, and TT decomposition are shown. The residual error under each decomposition is also shown.
Download figure:
Standard image High-resolution image

The CP model is the closest approximation to the ground truth model [79, 81] and provides the best denoised reconstruction. Structured noise is retained in both the Tucker and TT decomposition which is attributed to these models being more flexible than the CP model.

3.3. Tensor completion

The last example is missing value prediction or data completion for which a variety of tensor methods has been proposed, for a review see [87]. Presently, the developed toolbox only allows missing values for CP decomposition which limits this demonstration to the Bayesian CP model. For the Bayesian Tucker model, marginalizing over missing values was presented in [28].

The CP model is applied to the Sugar Process data, subject 7 from the Wakeman EEG data, and the Allen Human Brain data. Missing values are generated by holding out a subset of the observed data, the remaining data is used for training. Model performance is assessed on the held out data, i.e. test data, using the root mean square error (RMSE). Two missing value scenarios are displayed, first elements missing at random and then entire fibers missing. The missing fibers are emission spectras (mode-2 fibers), trials (mode-1 fibers), and samples (mode-1 fibers) for the Sugar Process, Wakeman, and Allen Brain data, respectively. In contrast to the Sugar Process and Wakeman data, the Allen Brain data is not fully observed and has 43.44% missing values due to missing samples (mode-1 fibers).

For each dataset, the true number of components is unknown and a CP with normal factors and ARD is fit using D = 10 initial components. For both VB and Gibbs samplings 250 iterations are run. To reconstruct the data, VB uses the expected value in the final iteration whereas for Gibbs sampling the average reconstruction error of the last 125 samples is used. To mitigate the effect of local minima, each model is fit 10 times and the median error is displayed. The analysis is reproduced by the script demo_tensor_completion.m and results are given in figure 4.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** **Tensor Completion**: Data entries are missing at random (top row) or fibers are missing at random (bottom row). The percentage of missing values is varied and for each setting the root mean square error (RMSE) is given for the training and test set. The training-test split was generated such that there where no missing slices in the training data.
Download figure:
Standard image High-resolution image

When elements are missing at random, figure 4(a)-(c), the results are similar and performance is mostly unchanged from the initial 10% missing values until around 80% missing values. The test set performance then degrades for both VB and Gibbs sampling.

When entire fibers are missing at random, figure 4(d)–(f), the performance visibly differ between the datasets. Figure 4(d) and (f) show performance degrades earlier when entire fibers are missing, both for VB and Gibbs sampling. The transition for the Allen data is less abrupt than in the Sugar Process data. This is attributed to missing scheme matching the actual reason for missing data in the Allen data. In contrast, for the Sugar Process data, leaving emission fibers out rapidly removes all essential information and does not reflect a valid missing data mechanism.

For the EEG data, the missing fibers (trials) have little effect on the predictive performance. The reason for this is that the time-lock signal is never fully removed due to event related EEG data being highly correlated across channels and trials.

Consider leaving out all timepoints for channel i and trial j , i.e. a mode-2 fiber $\boldsymbol{\mathcal{X}}_{i,:,j}$ , then the left out channel is highly correlated with all other channels during trial j due to spatial correlation of the signal. Furthermore, the time locked nature of the experiment ensures that even if the maximum possible fibers are removed, such that each trial contains only one observed channel, the magnitude of the underlying signal at each time point can still be recovered as each channel has on average 4.2 observed trials.

This demonstration illustrates how the amount of missing data and its structure affects tensor completion. Furthermore, it highlights the importance selecting a test set which reflect the actual missing data mechanism. This was illustrated on the Allen data, were leaving out individual elements leads to over-confidence in the results.

4. Conclusion

In this paper, we introduced the probabilistic tensor decomposition toolbox which allows for Bayesian inference in the Canonical Polyadic/PARAFAC (CP), Tucker, and Tensor Train decomposition models. The toolbox offers flexibility in the choice of priors/constraints on factors, both homo- and heteroscadastic noise modelling and efficient inference via the variational approximation. The results indicate that Bayesian inference successfully prunes factors facing over-specified models. However, in practice violations of the model structure still leads to ambiguity with respect to the number of factors to be included. The results show that heteroscedastic noise modelling is useful for quantification of noise and provides additional useful information for model interpretation, which can help researchers identify noisy samples and indicate potential issues with the data acquisition process. In conclusion, the proposed Bayesian toolbox for probabilistic tensor decomposition naturally extends existing methods based on maximum likelihood estimation, and offers more comprehensive modelling accounting for uncertainty readily applicable in the many disparate fields in which tensor decomposition is already widely applied.

5. Data availability

Data sharing is not applicable to this article as no new data were created or analysed in this study. The analyzed data are available online.

The Amino Acid dataset [79–81] and Sugar Process dataset [82, 83] are available at http://www.models.life.ku.dk/datasets. The Human Allen Brain dataset [86] is available at http://human.brain-map.org/static/download. The Wakeman dataset [84, 85] is available at ftp://ftp.mrc-cbu.cam.ac.uk/personal/rik.henson/wakemandg_hensonrn and details on acquiring and processing it are given in Chapter 42 of the SPM12 Manual (https://www.fil.ion.ucl.ac.uk/spm/).

The scripts for reproducing the analysis are provided as part of the Probabilistic Tensor Toolbox, see https://github.com/JesperLH/prob-tensor-toolbox/.

6. VB and Gibbs based inference

This appendix describes how to infer each of the random variables using either Gibbs sampling [15, 16] or variational Bayesian inference [18, 21]. The former relies on determining the conditional distributions and the latter on finding a set of distribution that approximate the true posterior distribution. These two inference methods results in similar updates schemes. To avoid duplicating the appendix, the notation is abused and $\, \left\langle \,\cdot\, \right\rangle \,$ denote either a sample form the relevant conditional distribution (when using Gibbs sampling) or the expected value under the Q-distribution (when using VB).

Updating of the CP factor matrices under homoscedastic noise, heteroscedastic noise, and in the presence of missing values is presented in section 6.1. Section 6.2 concerns inferring the precision prior on factor matrices under the different prior choices. Section 6.3 concerns inferring Tucker factor matrices while the core array and precision prior are presented in section 6.4. Finally, section 6.5 gives updates for inferring homoscedastic or mode-specific heteroscedastic noise precision.

Inference of the Probabilistic Tensor Train model are stated in [61]. These are not restated as this paper does not expand on the model.

Most of the update rules presented in this appendix are not novel, but restating them in a common notation provides an overview and more easy comparison of their differences and similarities.

To monitor the progress of the inference procedure, variational inferences uses the evidence lowerbound (ELBO) which increases monotonically each iteration. Gibbs sampling monitors the log-joint distribution which should generally increase, but not monotonically. The expressions are calculated as,

$\begin{align} \text{ELBO}(\boldsymbol{\mathcal{X}},\boldsymbol{\theta}) = & \, \left\langle \,\log P(\boldsymbol{\mathcal{X}}|\boldsymbol{\theta})\, \right\rangle \,_Q + \, \left\langle \,\log P(\boldsymbol{\theta})\, \right\rangle \,_Q - \, \left\langle \,\log Q(\boldsymbol{\theta})\, \right\rangle \,_Q \end{align} \tag{ 26 }$

$\begin{align} \log P(\boldsymbol{\mathcal{X}},\boldsymbol{\theta}^{t}) = & \log P(\boldsymbol{\mathcal{X}}|\boldsymbol{\theta}^{t}) + \log P(\boldsymbol{\theta}^t) \end{align} \tag{ 27 }$

where $\, \left\langle \,\cdot\, \right\rangle \,_Q$ is expectation under the Q-distribution for variational inference. For sampling $\boldsymbol{\theta}^t$ is the sample of random variables at iteration t.

Notation

In addition to the previous notation, the sequential product of N matrices is defined as,

$\begin{align*} \bigotimes_{n = 1}^N {\bf A}^{(n)} = & {\bf A}^{(N)} \otimes {\bf A}^{(N-1)} \otimes \cdots \otimes {\bf A}^{(1)}, \qquad {\bf A}^{(n)} \in \mathbb{R}^{I_n \times J_n} \, \forall_n, \\ \bigodot_{n = 1}^N {\bf A}^{(n)} = & {\bf A}^{(N)} \odot {\bf A}^{(N-1)} \odot \cdots \odot {\bf A}^{(1)}, \qquad {\bf A}^{(n)} \in \mathbb{R}^{I_n \times J} \, \forall_n \\ \overset{N}{\underset{n = 1}{\bigcirc}} {\bf A}^{(n)} = & {\bf A}^{(N)} \circ {\bf A}^{(N-1)} \circ \cdots \circ {\bf A}^{(1)}, \qquad {\bf A}^{(n)} \in \mathbb{R}^{I \times J} \, \forall_n , \end{align*}$

which is the sequential Kronecker, Khatri-Rao, and element-wise product, respectively. For the Kronecker and Khatri-Rao products, the order of multiplication matters and here it is assumed to be numerically descending. Additionally, $\bigotimes_{n \neq m}$ and $\bigodot_{n \neq m}$ will be used for the sequential product of the $n = N,N-1,\ldots, m+1, m-1, \ldots,1$ matrices.

6.1. Updating factor priors

6.1.1. Prior follows a normal distribution

Let the n-mode factor matrix, ${\bf A}^{(n)}$ , follow a normal distribution $P({\bf A}^{(n)}|\boldsymbol{\theta}) = \prod_{i = 1}^{I_n} \mathcal{N}\left({\bf a}_i^{{(n)}^\top} | \boldsymbol{\mu}_i^{(n)^\top}, \boldsymbol{\Lambda}_i^{(n)^{-1}}\right)$ , then the Q-distribution follows

$\begin{align} Q({\bf A}^{(n)}) = & \prod_{i = 1}^{I_n} \mathcal{N}\left({\bf m}_i^{(n)^{\top}}, \boldsymbol{\Sigma}_i^{(n)}\right), \end{align} \tag{ 28 }$

where ${\bf m}_i^{(n)}$ and $\boldsymbol{\Sigma}_i^{(n)}$ are the estimated mean and covariance for observation i in mode n. Generally, the covariance is not observation specific, but this depends on the prior on precision matrix, missing values, and noise modelling.

Under a homoscedastic noise assumption, the factor is updated as

$\begin{align} \boldsymbol{\Sigma}_i^{(n)} = & \left(\, \left\langle \,\boldsymbol{\Lambda}_i^{(n)}\, \right\rangle \, + \, \left\langle \,\tau\, \right\rangle \,\bigcirc_{m\neq n}\, \left\langle \,{\bf A}^{(m)^\top}{\bf A}^{(m)}\, \right\rangle \,\right)^{-1} \end{align} \tag{ 29 }$

$\begin{align} {\bf m}_i^{(n)^{\top}} = & \boldsymbol{\Sigma}_i^{(n)}\left(\, \left\langle \,\tau\, \right\rangle \,\, \left\langle \,\texttt{MTTKRP}(\boldsymbol{\mathcal{X}}; \{{\bf A}^{(m)}\}_{m\neq n})\, \right\rangle \,_{i} + \, \left\langle \,\boldsymbol{\Lambda}_i^{(n)}\, \right\rangle \,\, \left\langle \,\boldsymbol{\mu}^{(n)^{\top}}_i\, \right\rangle \, \right). \end{align} \tag{ 30 }$

If heteroscedastic noise is modelled on all modes, then the factor is updated as

$\begin{align} \boldsymbol{\Sigma}_i^{(n)} = & \left(\, \left\langle \,\boldsymbol{\Lambda}_i^{(n)}\, \right\rangle \, + \, \left\langle \,\tau_i^{(n)}\, \right\rangle \,\bigcirc_{m\neq n}\, \left\langle \,{\bf A}^{(m)^\top}\mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf A}^{(m)}\, \right\rangle \,\right)^{-1} \end{align} \tag{ 31 }$

$\begin{align} {\bf m}_i^{(n)^{\top}} = & \boldsymbol{\Sigma}_i^{(n)}\left(\, \left\langle \,\tau_i^{(n)}\, \right\rangle \,\, \left\langle \,\texttt{MTTKRP}(\boldsymbol{\mathcal{X}}; \{\mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf A}^{(m)}\}_{m\neq n})\, \right\rangle \,_{i} + \, \left\langle \,\boldsymbol{\Lambda}_i^{(n)}\, \right\rangle \,\, \left\langle \,\boldsymbol{\mu}^{(n)^{\top}}_i\, \right\rangle \, \right) \end{align} \tag{ 32 }$

Marginalizing over missing values leads to

$\begin{align} \boldsymbol{\Sigma}_{i_n}^{(n)} = & \left(\, \left\langle \,\boldsymbol{\Lambda}_{i_n}^{(n)}\, \right\rangle \, + \, \left\langle \,\tau\, \right\rangle \,\sum_{\mathbf{i} \in \boldsymbol{\mathcal{O}}} \bigcirc_{m\neq n}\, \left\langle \,{\bf a}_{i_m}^{(m)^\top}{\bf a}_{i_m}^{(m)}\, \right\rangle \,\right)^{-1} \end{align} \tag{ 33 }$

$\begin{align} {\bf m}_{i_n}^{(n)^{\top}} = & \boldsymbol{\Sigma}_{i_n}^{(n)}\left(\, \left\langle \,\tau\, \right\rangle \, \, \left\langle \,\texttt{MTTKRP}(\boldsymbol{\mathcal{X}} \circ \boldsymbol{\mathcal{O}}; \{{\bf A}^{(m)}\}_{m\neq n})\, \right\rangle \,_{i_n} + \, \left\langle \,\boldsymbol{\Lambda}_{i_n}^{(n)}\, \right\rangle \,\, \left\langle \,\boldsymbol{\mu}^{(n)^{\top}}_{i_n}\, \right\rangle \, \right) \end{align} \tag{ 34 }$

where $\mathbf{i} = (i_1,\ldots, i_{n-1}, {i_n}, i_{n+1},\ldots,i_N)$ and indexes an observed element in $\boldsymbol{\mathcal{O}}$ . Here $\boldsymbol{\mathcal{O}}$ is both the set of observed elements and a binary indicator. The update rules are similar to the fully observed case, but here only the expectation when reconstructing present elements is included.

6.1.2. Prior follows a truncated normal, exponential, or uniform distribution

Let the n-mode factor matrix, ${\bf A}^{(n)}$ , follow $P({\bf A}^{(n)}|\boldsymbol{\theta}) = \prod_{i = 1}^{I_n} \prod_{d = 1}^{D_n} g(a_{id})$ where $g(a_{id}^{(n)})$ is either $\mathcal{N}_{[0,\infty]}\left(a_{id}^{{(n)}^\top} | \mu_{id}^{(n)^\top}, \lambda_{id}^{(n)^{-1}}\right)$ , $\mathrm{Exponential}\left(a_{id}^{{(n)}^\top} | \lambda_{id}^{(n)}\right)$ with rate $\lambda_{id}^{(n)}$ , or $\mathrm{Uniform}_{[\alpha,\beta]}\left(a_{id}^{{(n)}^\top} | \alpha, \beta\right)$ with lower and upper-bounds α and β. Under a normal likelihood, then each of these prior distributions has the following Q-distribution,

$\begin{align} Q({\bf A}^{(n)}) = & \prod_i^{I_n}\prod_d^D \mathcal{N}_{[0,\infty]}\left(m_{id}^{(n)}, \sigma_{id}^{(n)^{2}}\right) \end{align} \tag{ 35 }$

Under a homoscedastic noise assumption, the factor is updated via,

$\begin{align} \sigma_{id}^{(n)^2} = & \left(\sigma_\text{C}^2 + \, \left\langle \,\tau\, \right\rangle \,\prod_{m\neq n}\, \left\langle \,{\bf a}_d^{(m)^\top}{\bf a}_d^{(m)}\, \right\rangle \, \right)^{-1} \end{align} \tag{ 36 }$

$\begin{align} m_{id}^{(n)} = & \sigma_{id}^{(n)^2}\Big(\mu_\text{C} + \, \left\langle \,\tau\, \right\rangle \, \Big(\, \left\langle \,\bigodot_{m\neq n} {\bf a}_d^{(m)}\, \right\rangle \,^\top {\bf x}_{(n),i}^\top - \sum_{d'\neq d} \, \left\langle \,a_{id'}^{(n)} \, \right\rangle \,\, \left\langle \,\bigcirc_{m\neq n} {\bf a}_{d}^{(m)^\top} {\bf a}_{d'}^{(m)} \, \right\rangle \,\Big)\Big) \end{align} \tag{ 37 }$

where $\sigma_\text{C}^2$ and $\mu_\text{C}$ depend on the prior distribution, $P(a_{ij}^{(n)})$ , which slightly changes the resulting update. These values are, $(\mu_\text{C}, \sigma_\text{C}^2) = \left(\, \left\langle \,\mu_{id}^{(n)}\, \right\rangle \, \, \left\langle \,\lambda_{id}^{(n)}\, \right\rangle \,, \, \left\langle \,\lambda_{id}^{(n)}\, \right\rangle \,\right)$ for the truncated normal distribution, $(\mu_\text{C}, \sigma_\text{C}^2) = \left(-\, \left\langle \,\lambda_{id}^{(n)}\, \right\rangle \,, 0\right)$ for the exponential distribution, and $(\mu_\text{C}, \sigma_\text{C}^2) = (0,0)$ for the uniform distribution.

Similarly, under a heteroscedastic noise assumption, the factor is updated via,

$\begin{align} \sigma_{id}^{(n)^2} = & \left(\sigma_\text{C} + \, \left\langle \,\tau_i^{(n)}\, \right\rangle \,\prod_{m\neq n}\, \left\langle \,{\bf a}_d^{(m)^\top}\mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf a}_d^{(m)}\, \right\rangle \, \right)^{-1} \end{align} \tag{ 38 }$

$\begin{align} m_{id}^{(n)} = & \sigma_{id}^{(n)^2}\left(\mu_\text{C} + \, \left\langle \,\tau_i^{(n)}\, \right\rangle \, \left(\, \left\langle \,\bigodot_{m\neq n} \mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf a}_d^{(m)}\, \right\rangle \,^\top {\bf x}_{(n),i}^\top \right. \right. \nonumber\\[6pt] &\, \left. \left.- \sum_{d'\neq d} \, \left\langle \,a_{id'}^{(n)} \, \right\rangle \,\, \left\langle \,\bigcirc_{m\neq n} {\bf a}_{d}^{(m)^\top} \mathrm{diag}(\boldsymbol{\tau}^{(m)}) {\bf a}_{d'}^{(m)} \, \right\rangle \,\right)\right) \end{align} \tag{ 39 }$

Marginalizing over missing values is the same as only accounting for the contribution of present elements and leads to,

$\begin{align} \sigma_{{i_n}d}^{(n)^2} = & \left(\sigma_\text{C} + \, \left\langle \,\tau\, \right\rangle \,\sum_{\mathbf{i} \in \boldsymbol{\mathcal{O}}} \prod_{m\neq n}\, \left\langle \, a_{i_m,d}^{(m)^2} \, \right\rangle \, \right)^{-1} \end{align} \tag{ 40 }$

$\begin{align} m_{{i_n}d}^{(n)} = & \sigma_{{i_n}d}^{(n)^2}\left(\mu_\text{C} + \, \left\langle \,\tau\, \right\rangle \, \sum_{\mathbf{i} \in \boldsymbol{\mathcal{O}}}\left(\, \left\langle \,\prod_{m\neq n} a_{i_m,d}^{(m)}\, \right\rangle \,^\top {\bf x}_{\mathbf{i}} \right.\right. \nonumber\\[8pt] & \left. \left.- \sum_{d'\neq d} \, \left\langle \,a_{{i_n}d'}^{(n)} \, \right\rangle \,\prod_{m\neq n} \, \left\langle \, a_{i_m,d}^{(m)^\top} a_{i_m,d'}^{(m)} \, \right\rangle \,\right)\right) \end{align} \tag{ 41 }$

where $\mathbf{i} = (i_1,\ldots,i_{n-1}, {i_n}, i_{n+1},\ldots,i_N)$ and provides the index of each present element in $\boldsymbol{\mathcal{O}}$ .

The expression for log-prior and entropy contributions are easily derived following standard textbook definitions. However, evaluating tail probabilities in the truncated normal distribution can be numerically unstable and the toolbox implements the approach described in [88].

6.1.3. Prior follows a von Mises-Fisher matrix distribution

Let the n-mode factor matrix, ${\bf A}^{(n)}$ , follow a von Mises-Fisher distribution $P({\bf A}^{(n)}|{\bf F}_0) = {\rm v}\mathcal{MF}\left( {\bf A}^{{(n)}} | {\bf F}_0 \right)$ , then the Q-distribution follows,

$\begin{align} Q({\bf A}^{(n)}) = &{\rm v}\mathcal{MF}\left({\bf F}^{(n)}\right) \end{align} \tag{ 42 }$

Under a homoscedastic noise assumption, the concentration matrix is,

$\begin{align} {\bf F}^{(n)} = & \, \left\langle \,\tau\, \right\rangle \, \texttt{MTTKRP}\left(\boldsymbol{\mathcal{X}}; \, \left\langle \,\{{\bf A}^{(m)}\}_{m\neq n}\, \right\rangle \, \right) +{\bf F}_0 \end{align} \tag{ 43 }$

If any factor follow a ${\rm v}\mathcal{MF}$ -distribution, then marginalization of missing values is not supported. Instead, it is possible to impute missing values and perform updates as in the fully observed case. The usual caveats for imputation of missing values then apply.

Heteroscedastic noise is not supported on any mode following the ${\rm v}\mathcal{MF}$ -distribution. It is, however, still possible to model heteroscedastic noise on non-orthogonal modes. For heteroscedastic noise and non-orthogonal factors on all but mode n, the factor is updated as,

$\begin{align} {\bf F}^{(n)} = & \, \left\langle \,\tau\, \right\rangle \, \texttt{MTTKRP}\left(\boldsymbol{\mathcal{X}}; \, \left\langle \,\{\mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf A}^{(m)}\}_{m\neq n}\, \right\rangle \, \right) +{\bf F}_0 \end{align} \tag{ 44 }$

For variational inference, the expected value of the von Mises-Fisher matrix distribution is determined as in [89]. Determining the log prior and entropy requires evaluating the hyper-geometric function $_0F_1$ with a matrix argument for which efficient approaches are presented in [89, 90]. Presently, sampling is not included, but can be implemented following the work in [91].

6.2. Updating factor precision or rate prior

For factor matrices following a normal or truncated normal distribution it is possible to infer the precision matrix $\boldsymbol{\Lambda}_i^{(n)}$ or a restricted form of it as described in section 2.3.4. Similarly, for factors following an exponential distribution it is possible to infer the rate parameter. For the truncated normal and exponential distribution, it is possible to infer either the scale, component relevance (ARD), or element-wise sparsity. Similarly, the scale and component relevance can be inferred for the multivariate normal distribution, as well as the full precision matrix following a Wishart distribution. However, the toolbox presently does not allow multivariate factors with element-wise sparsity.

Note, uniform factors has no inferable parameters and for the von Mises-Fisher matrix distribution inferring the concentration matrix F₀ is not supported.

The toolbox allow sharing precision/rate priors across several modes, where the shared set is defined as n ∈ Ω_shared. For each distribution, a constant η_n affects how much the n^th factor contribute to the prior update. For the normal and truncated normal distribution η_n = 0.5 and for the exponential distribution η_n = 1. Dependent on the choice of prior P(·), the updates are as follows.

For learning the scale of n ∈ Ω_shared, let $\boldsymbol{\Lambda}_i^{(n)}\equiv \lambda{\bf I}_D$ and the distribution and updates are,

$\begin{align} P(\lambda) = & {\cal G}\left(\alpha_\lambda, \beta_\lambda\right), \qquad Q(\lambda) = {\cal G}\left(\tilde{\alpha}_\lambda, \tilde{\beta}_\lambda\right) \end{align} \tag{ 45 }$

$\begin{align} \tilde{\alpha}_\lambda = & \alpha_\lambda + D\sum_{n\in\Omega} I_n \eta_n , \qquad\tilde{\beta}_\lambda = \beta_\lambda + \sum_{n\in\Omega} \mathrm{trace}\left(\, \left\langle \,{\bf A}^{(n)^\top}{\bf A}^{(n)}\, \right\rangle \,\right)\eta_n \end{align} \tag{ 46 }$

For automatic relevance determination of the components, let $\boldsymbol{\Lambda}_i^{(n)}\equiv \mathrm{diag}(\boldsymbol{\lambda})$ and the distributions and their updates are,

$\begin{align} P(\boldsymbol{\lambda}) = & \prod_{d = 1}^D{\cal G}\left(\lambda_d|\alpha_\lambda, \beta_\lambda\right) , \qquad Q(\boldsymbol{\lambda}) = \prod_{d = 1}^D {\cal G}\left(\lambda_d|\tilde{\alpha}_\lambda, \tilde{\beta_d}_\lambda\right) \end{align} \tag{ 47 }$

$\begin{align} \tilde{\alpha}_\lambda = & \alpha_\lambda + \sum_{n\in\Omega} I_n \eta_n , \qquad \tilde{\boldsymbol{\beta}}_\lambda = \beta_\lambda + \sum_{n\in\Omega} \mathrm{diag}\left(\, \left\langle \,{\bf A}^{(n)^\top}{\bf A}^{(n)}\, \right\rangle \,\right)\eta_n \end{align} \tag{ 48 }$

For element-wise sparsity, a precision prior is placed on each observation, i.e. $\boldsymbol{\Lambda}_i^{(n)}\equiv \mathrm{diag}(\boldsymbol{\lambda}_i)$ . This gives the following update,

$\begin{align} P(\boldsymbol{\lambda}_i) = & \prod_{d = 1}^D{\cal G}\left(\lambda_{id}|\alpha_\lambda, \beta_\lambda\right) , \qquad Q(\boldsymbol{\lambda}_i) = \prod_{d = 1}^D {\cal G}\left(\lambda_{id}|\tilde{\alpha}_\lambda, \tilde{\beta_{id}}_\lambda\right) \end{align} \tag{ 49 }$

$\begin{align} \tilde{\alpha}_\lambda = & \alpha_\lambda + \sum_{n\in\Omega} \eta_n , \qquad \tilde{\boldsymbol{\beta}_i}_\lambda = \beta_\lambda + \sum_{n\in\Omega} \mathrm{diag}\left(\, \left\langle \,{\bf a}_i^{(n)^\top}{\bf a}_i^{(n)}\, \right\rangle \,\right)\eta_n \end{align} \tag{ 50 }$

For inferring the full precision matrix under multivariate normal factors, let $\boldsymbol{\Lambda}_i^{(n)}\equiv \boldsymbol{\Lambda}$ then,

$\begin{align} P(\boldsymbol{\Lambda}) = & \mathrm{Wishart}\left(w,{\bf V}\right) , \qquad Q(\boldsymbol{\Lambda}) = \mathrm{Wishart}\left(\tilde{w},\tilde{{\bf V}}\right) \end{align} \tag{ 51 }$

$\begin{align} \tilde{w} = & w + \sum_{n\in\Omega} I_n \eta_n , \qquad \tilde{{\bf V}} = {\bf V} + \sum_{n\in\Omega} \, \left\langle \,{\bf A}^{(n)^\top}{\bf A}^{(n)}\, \right\rangle \, \eta_n \end{align} \tag{ 52 }$

Note, the updates presented in this section are not affected by whether the noise is homoscedastic or heteroscedastic.

6.3. Updating tucker factor matrix

Specifying the prior on the factor matrices is done identically for the CP and Tucker model. However, updating factors in the Tucker model requires accounting for the core array which is not done in the updates presented in section 6.1.

Presently, the toolbox only implements factors following the von Mises-Fisher matrix distribution, i.e. orthogonal factors, where the prior distribution is $P({\bf A}^{(n)}) = {\rm v}\mathcal{MF}\left({\bf A}^{(n)} | {\bf F}_0\right)$ . The Q-distribution then is then,

$\begin{align} Q({\bf A}^{(n)}) = {\rm v}\mathcal{MF}\left({\bf A}^{(n)} | \tilde{{\bf F}}^{(n)}\right) \end{align} \tag{ 53 }$

and the estimated concentration matrix is

$\begin{align} \tilde{{\bf F}}^{(n)} = \, \left\langle \,\tau\, \right\rangle \, {\bf X}_{(n)} \left(\bigotimes_{m\neq n} \, \left\langle \,{\bf A}^{(m)}\, \right\rangle \, \right)^\top \, \left\langle \,{\bf G}_{(n)}^\top\, \right\rangle \, + {\bf F}_0. \end{align} \tag{ 54 }$

The toolbox assumes ${\bf F}_0 = {\bf 0}$ which is the uniform prior on the D_n-sphere.

6.4. Updating the core prior and its precision prior

Let the core array $\boldsymbol{\mathcal{G}}$ be represented by its vectorization ${\bf g}^{D_1 D_2\cdots D_N \times 1} = \mathrm{vec}(\boldsymbol{\mathcal{G}})$ whose prior follows a normal distribution $P({\bf g}|\boldsymbol{\mu}_{\boldsymbol{\mathcal{G}}}, \boldsymbol{\Psi}) = \mathcal{N}\left({\bf g}|\boldsymbol{\mu}_{\boldsymbol{\mathcal{G}}},\boldsymbol{\Psi}^{-1}\right)$ . Then, the Q-distribution has the following form,

$\begin{align} Q({\bf g}) = \mathcal{N}\left({\bf g}|\boldsymbol{\mu}_{\bf g},\boldsymbol{\Sigma}_{\bf g}\right) \end{align} \tag{ 55 }$

Under homoscedastic noise, the core is then updated as,

$\begin{align} \boldsymbol{\Sigma}_{\bf g} = & \left(\, \left\langle \,\boldsymbol{\Psi}\, \right\rangle \, + \, \left\langle \,\tau\, \right\rangle \, \, \left\langle \,\left(\bigotimes_{n = 1}^N {\bf A}^{(n)^\top}{\bf A}^{(n)}\right) \, \right\rangle \,\right)^{-1}, \end{align} \tag{ 56 }$

$\begin{align} \boldsymbol{\mu}_{\bf g} = & \boldsymbol{\Sigma}_{\bf g} \left(\, \left\langle \,\tau\, \right\rangle \, \left(\bigotimes_{n = 1}^N \, \left\langle \,{\bf A}^{(n)}\, \right\rangle \, \right)^\top \mathrm{vec}(\boldsymbol{\mathcal{X}}) + \, \left\langle \,\boldsymbol{\Psi}\, \right\rangle \,\boldsymbol{\mu}_{\boldsymbol{\mathcal{G}}}\right), \end{align} \tag{ 57 }$

Presently, the toolbox only considers $\boldsymbol{\mu}_{\boldsymbol{\mathcal{G}}} = {\bf 0}$ and ${\bf A}^{(n)}\sim {\rm v}\mathcal{MF}({\bf F}_0)$ . The latter makes updating equation (56) exceedingly simple as ${\bf A}^{(n)^\top}{\bf A}^{(n)} = {\bf I}$ result in $\boldsymbol{\Sigma}_{{\bf g}}$ being a diagonal matrix.

As presented in section 2.3.4, how $P(\boldsymbol{\Psi})$ is specified determines the functionality of the core. The possibilities are inferring the scale, determining element-wise sparsity, or slice-wise pruning.

For inferring the scale, then $\boldsymbol{\Psi} \equiv \psi{\bf I}_{D}$ and the distributions and updates are,

$\begin{align} P(\psi) & = {\cal G}\left(\psi | \alpha_\psi, \beta_\psi\right) , \qquad Q(\psi) = {\cal G}\left(\tilde{\alpha}_\psi, \tilde{\beta}_\psi \right) \end{align} \tag{ 58 }$

$\begin{align} \tilde{\alpha}_\psi & = \alpha_\psi + \frac{1}{2}\prod_{m = 1}^N D_m , \qquad \tilde{\beta}_\psi = \beta_\psi + \frac{1}{2}\, \left\langle \,{\bf g}^\top{\bf g}\, \right\rangle \, \end{align} \tag{ 59 }$

For inferring element-wise sparisty, the distributions and updates are,

$\begin{align} P(\boldsymbol{\Psi}) & = \prod_{\bf d} {\cal G}\left(\psi_{\bf d}|\alpha_\psi, \beta_\psi \right) ,\qquad Q(\boldsymbol{\Psi}) = \prod_{\bf d}{\cal G}\left(\psi_{\bf d}|\tilde{\alpha}_\psi, \tilde{\beta}_{\psi_{\bf d}} \right) \end{align} \tag{ 60 }$

$\begin{align} \tilde{\alpha}_\psi & = \alpha_\psi + \frac{1}{2} , \qquad \tilde{\beta}_{\psi_{\bf d}} = \beta_\psi + \frac{1}{2}\, \left\langle \,g_{\bf d}^2\, \right\rangle \, \end{align} \tag{ 61 }$

For slice-wise automatic relevance determination $\boldsymbol{\Psi} \equiv \bigotimes_{n = 1}^N \mathrm{diag} \left( \boldsymbol{\psi}^{(n)} \right)$ and the distribution and updates for each mode n are,

$\begin{align} P(\boldsymbol{\psi}^{(n)}) & = \prod_{d = 1}^{D_n} {\cal G}\left(\psi_d^{(n)} | \alpha_\psi, \beta_{\psi} \right), \qquad Q(\boldsymbol{\psi}^{(n)}) = \prod_{d = 1}^{D_n} {\cal G}\left(\psi_d^{(n)} | \tilde{\alpha}_\psi, \tilde{\beta}_{\boldsymbol{\psi}^{(n)}} \right) \end{align} \tag{ 62 }$

$\begin{align} \tilde{\alpha}_\psi & = \alpha_\psi + \frac{1}{2}\prod_{m\neq n} D_m , \quad \tilde{\beta}_{\boldsymbol{\psi}_d^{(n)}} = \beta_\psi + \frac{1}{2} \mathrm{diag}\left( \, \left\langle \,{\bf G}_{(n)} \left( \bigotimes_{m\neq n}^N \boldsymbol{\Psi}^{(m)}\right) {\bf G}_{(n)}^\top\, \right\rangle \, \right). \end{align} \tag{ 63 }$

An alternating update scheme is then used, such each mode n = 1, 2, ..., N is updated conditioned on all other random variables.

6.5. Updating noise precision prior

The toolbox assumes a normal likelihood with either homoscedastic or heteroscedastic noise variance, the latter is only support CP decomposition. For homoscedastic noise, the noise precision τ follows a Gamma distribution, i.e. $P(\tau) = {\cal G}\left(\alpha_\tau, \beta_\tau\right)$ with shape $\alpha_\tau$ and rate $\beta_\tau$ . The resulting Q-distribution is,

$\begin{align} Q(\tau) = {\cal G}\left(\tilde{\alpha}_\tau, \tilde{\beta}_\tau\right) \end{align} \tag{ 64 }$

For fully observed data, the shape parameter is $\tilde{\alpha}_\tau = \alpha_\tau + \frac{1}{2}\prod_{n = 1}^N I_n$ . For CP decomposition the rate is

$\begin{align} \tilde{\beta}_\tau = & \beta_\tau + \frac{1}{2}\left(\mathrm{vec}(\boldsymbol{\mathcal{X}})^\top\mathrm{vec}(\boldsymbol{\mathcal{X}}) + \bigcirc_{n = 1}^N \, \left\langle \,{\bf A}^{(n)^\top}{\bf A}^{(n)}\, \right\rangle \, \right. \nonumber\\ & \left.- 2\, \mathrm{vec}(\boldsymbol{\mathcal{X}})^\top \mathrm{vec}\left(\bigodot_{n = 1}^N \, \left\langle \,{\bf A}^{(n)}\, \right\rangle \,\right) \right) \end{align} \tag{ 65 }$

Tucker decomposition has a similar update, but differ as it has to account for the core array $\boldsymbol{\mathcal{G}}$ which changes the estimated rate,

$\begin{align*} \tilde{\beta}_\tau = & \beta_\tau + \frac{1}{2}\left(\mathrm{vec}(\boldsymbol{\mathcal{X}})^\top\mathrm{vec}(\boldsymbol{\mathcal{X}}) + \, \left\langle \,\mathrm{vec}(\boldsymbol{\mathcal{G}})^\top\left(\bigotimes_{n = 1}^N {\bf A}^{(n)^\top}{\bf A}^{(n)}\right)\mathrm{vec}(\boldsymbol{\mathcal{G}})\, \right\rangle \, \right. \nonumber\\ & \left. - 2 \, \mathrm{vec}(\boldsymbol{\mathcal{X}})^\top \left(\bigotimes_{n = 1}^N \, \left\langle \,{\bf A}^{(n)}\, \right\rangle \,\right)\mathrm{vec}(\boldsymbol{\mathcal{G}}) \right) \end{align*}$

For the CP model with partially observed data, the estimated shape changes to $\tilde{\alpha}_\tau = \alpha_\tau + \frac{1}{2}|\boldsymbol{\mathcal{O}}|$ while rate changes to,

$\begin{align} \tilde{\beta}_\tau = & \beta_\tau + \frac{1}{2}\sum_{(i_1,i_2,\ldots i_N) \in \boldsymbol{\mathcal{O}}}\left({\bf x}_{i_1,i_2,\ldots,i_N}^2 + + \mathrm{trace}\left(\bigcirc_{n = 1}^N \, \left\langle \,{\bf a}_{i_n}^{(n)^\top} {\bf a}_{i_n}^{(n)}\, \right\rangle \,\right)\right. \nonumber\\ & \left.- 2 \, x_{i_1,i_2,\ldots,i_N} \sum_{d = 1}^D \prod_{n = 1}^N \, \left\langle \,a_{i_n,d}^{(n)}\, \right\rangle \,\right) \end{align} \tag{ 66 }$

where $\boldsymbol{\mathcal{O}}$ is the set of present values, $|\boldsymbol{\mathcal{O}}|$ the number of present values, and $(i_1,i_2,\ldots i_N)$ indexes a single element in $\boldsymbol{\mathcal{O}}$ .

CP with Heteroscedastic Noise

Estimating mode-specific heteroscedastic noise assumes a special Kronecker structured covariance, see section 2.3, equation (9). Modelling heteroscedastic noise on mode n is done by defining $P\left(\boldsymbol{\tau}^{(n)}\right) = \prod_{i = 1}^{I_n}{\cal G}\left(\tau_i^{(n)}| \alpha_\tau, \beta_\tau\right)$ . The Q-distribution is then,

$\begin{align} Q(\boldsymbol{\tau}^{(n)}) = \prod_{i = 1}^{I_n} {\cal G}\left( \tau_i^{(n)} | \tilde{\alpha}_{\tau^{(n)}}, \tilde{\beta}_{\tau_i^{(n)}} \right) \end{align} \tag{ 67 }$

The noise precision is then estimated for each heteroscedastic mode using the following equations in an alternating update scheme.

$\begin{align} \tilde{\alpha}_{\tau^{(n)}} = & \alpha_\tau + \frac{1}{2}\prod_{m\neq n} I_m \end{align} \tag{ 68 }$

$\begin{align} \kern-35.5pt\tilde{\boldsymbol{\beta}}_{\boldsymbol{\tau}^{(n)}} & = \beta_\tau + \frac{1}{2}\left( \mathrm{diag}\left({\bf X}_{(n)} \mathrm{diag}\left(\bigodot_{m\neq n} \, \left\langle \,\boldsymbol{\tau}^{(m)}\, \right\rangle \, \right) {\bf X}_{(n)}^\top\right) \nonumber \right.\kern-12.5pt \end{align}$

$\begin{align} \kern32.5pt & + \mathrm{diag}\left(\, \left\langle \,{\bf A}^{(n)^\top}{\bf A}^{(n)}\, \right\rangle \, \circ \underset{m\neq n}{\bigcirc} \, \left\langle \,{\bf A}^{(m)^\top}\mathrm{diag}(\boldsymbol{\tau}^{(m)}){\bf A}^{(m)}\, \right\rangle \, \right) \nonumber\\ & -2\left. \mathrm{diag}\left({\bf X}_{(n)}\left(\bigodot_{m\neq n} \, \left\langle \,\mathrm{diag}(\boldsymbol{\tau}^{(m)})\, \right\rangle \, \, \left\langle \,{\bf A}^{(m)}\, \right\rangle \,\right)\, \left\langle \,{\bf A}^{(n)}\, \right\rangle \,^\top\right) \right)\kern-32.5pt \end{align} \tag{ 69 }$

where $\tilde{\boldsymbol{\beta}}_{\boldsymbol{\tau}^{(n)}}$ is a vector $\mathbb{R}_{>0}^{I_n\times 1}$ of rates.

Acknowledgments

The Probabilistic Tensor Toolbox was developed as part of the PhD thesis of the first author and supervised by both co-authors. The PhD was funded by a scholarship from the Technical University of Denmark, Department of Applied Mathematics and Computer Science.

Dates

Peer review information

2.3.1. Partially observed data

2.3.2. Factor matrices

2.3.3. The core array

2.3.4. Component and Core Precision Matrix

2.3.5. Inferring the Posterior Distribution

3.1.1. Heteroscedastic noise estimation

6.1. Updating factor priors

6.1.1. Prior follows a normal distribution

6.1.2. Prior follows a truncated normal, exponential, or uniform distribution

6.1.3. Prior follows a von Mises-Fisher matrix distribution

CP with Heteroscedastic Noise

The probabilistic tensor decomposition toolbox

Article metrics

Submit

Share this article

Dates

Peer review information

Abstract

1. Introduction

1.1. Bayesian tensor decomposition

1.2. Summary of contribution

2. Methods

2.1. Bayesian tensor decomposition

2.2. Gibbs sampling and VB approximation

2.3. Probabilistic tensor decomposition toolbox

2.3.1. Partially observed data

2.3.2. Factor matrices

2.3.3. The core array

2.3.4. Component and Core Precision Matrix

2.3.5. Inferring the Posterior Distribution

2.4. Prediction on Held-out Slices or Samples

3. Results and discussion

3.1. Component extraction: amino acid and sugar process dataset

3.1.1. Heteroscedastic noise estimation

3.2. Data denoising

3.3. Tensor completion

4. Conclusion

5. Data availability

6. VB and Gibbs based inference

Notation

6.1. Updating factor priors

6.1.1. Prior follows a normal distribution

6.1.2. Prior follows a truncated normal, exponential, or uniform distribution

6.1.3. Prior follows a von Mises-Fisher matrix distribution

6.2. Updating factor precision or rate prior

6.3. Updating tucker factor matrix

6.4. Updating the core prior and its precision prior

6.5. Updating noise precision prior

CP with Heteroscedastic Noise

Acknowledgments