Legendre Decomposition for Tensors

We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.


Introduction
Matrix and tensor decomposition is a fundamental technique in machine learning; it is used to analyze data represented in the form of multi-dimensional arrays, and is used in a wide range of applications such as computer vision Terzopoulos, 2002, 2007), recommender systems (Symeonidis, 2016), signal processing (Cichocki et al., 2015), and neuroscience (Beckmann and Smith, 2005). The current standard approaches include nonnegative matrix factorization (NMF; Seung, 1999, 2001) for matrices and CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970) or Tucker decomposition (Tucker, 1966) for tensors. CP decomposition compresses an input tensor into a sum of rank-one components, and Tucker decomposition approximates an input tensor by a core tensor multiplied by matrices. To date, matrix and tensor decomposition has been extensively analyzed, and there are a number of variations of such decomposition (Kolda and Bader, 2009), where the common goal is to approximate a given tensor by a smaller number of components, or parameters, in an efficient manner.
However, despite the recent advances of decomposition techniques, a learning theory that can systematically define decomposition for any order tensors including vectors and matrices is still under development. Moreover, it is well known that CP and Tucker tensor decomposition include non-convex optimization and that the global convergence is not guaranteed. Although there are a number of extensions to transform the problem into a convex problem (Liu et al., 2013;Tomioka and Suzuki, 2013), one needs additional assumptions on data, such as a bounded variance.
Here we present a new paradigm of matrix and tensor decomposition, called Legendre decomposition, based on information geometry (Amari, 2016) to solve the above open problems of matrix and tensor decomposition. In our formulation, an arbitrary order tensor is treated as a discrete probability distribution in a statistical manifold as long as it is nonnegative, and Legendre decomposition is realized as a projection of the input tensor onto a submanifold composed of reconstructable tensors.
The key to introducing the formulation is to use the partial order (Davey and Priestley, 2002;Gierz et al., 2003) of indices, which allows us to treat any order tensors as a probability distribution in the information geometric framework.
Legendre decomposition has the following remarkable properties: It always finds the unique tensor that minimizes the Kullback-Leibler (KL) divergence from an input tensor. This is because Legendre decomposition is formulated as convex optimization, and hence we can directly use gradient descent, which always guarantees the global convergence, and the optimization can be further speeded up by using a natural gradient (Amari, 1998)  Furthermore, our formulation has a close relationship with statistical models, and can be interpreted as an extension of the learning of Boltzmann machines (Ackley et al., 1985). This interpretation gives new insight into the relationship between tensor decomposition and graphical models (Chen et al., 2018;Yılmaz et al., 2011;Yılmaz and Cemgil, 2012) as well as the relationship between tensor decomposition and energy-based models (LeCun et al., 2007). In addition, we show that the proposed formulation belongs to the exponential family, where the parameter θ used in our decomposition is the natural parameter, and η, used to obtain the gradient of θ, is the expectation of the exponential family.
The remainder of this paper is organized as follows. We introduce Legendre decomposition in Section 2. We define the decomposition in Section 2.1, formulate it as convex optimization in Section 2.2, describe algorithms in Section 2.3, and discuss the relationship with other statistical models in Section 2.4. We empirically examine performance of our method in Section 3, and summarize our contribution in Section 4.

The Legendre Decomposition
We introduce Legendre decomposition for tensors. We begin with a nonnegative Nth-order tensor X ∈ R I 1 ×I 2 ×···×I N ≥0 . To simplify the notation, we write the entry To treat X as a discrete probability mass function in our formulation, we normalize X by dividing each entry by the sum of all entries, yielding P = X/ v x v . In the following, we always work with a normalized tensor P.

Definition
Legendre decomposition always finds the best approximation of a given tensor P. Our strategy is to choose an index set B ⊆ [I 1 ] × [I 2 ] × · · · × [I N ] as a decomposition basis, where we assume (1, 1, . . . , 1) B for a technical reason, and approximate the normalized tensor P by a multiplicative combination of parameters associated with B.
We define a tensor Q ∈ R I 1 ×I 2 ×···×I N ≥0 as fully decomposable with B ⊆ Ω + if each entry q v ∈ Ω is represented in the form of using |B| parameters (θ v ) v ∈B with θ v ∈ R and the normalizer ψ(θ) ∈ R, which is always uniquely determined from the parameters (θ v ) v ∈B as This normalization does not have any effect on the decomposition performance, but rather it is needed to formulate our decomposition as an information geometric projection, as shown in the next subsection. There are two extreme cases for a choice of a basis B: If B = ∅, a fully decomposable Q is always uniform; that is, q v = 1/|Ω| for all v ∈ Ω. In contrast, if B = Ω + , any input P itself becomes decomposable.
We now define Legendre decomposition as follows: Given an input tensor P ∈ R I 1 ×I 2 ×···×I N

≥0
, the sample space Ω ⊆ [I 1 ] × [I 2 ] × · · · × [I N ], and a parameter basis B ⊆ Ω + , Legendre decomposition finds the fully decomposable tensor Q ∈ R I 1 ×I 2 ×···×I N ≥0 with a B that minimizes the Kullback-Leibler Figure 1[a]). In the next subsection, we introduce an additional parameter (η v ) v ∈B and show that this decomposition is always possible via the dual parameters ( (θ v ) v ∈B , (η v ) v ∈B ) with information geometric analysis. Since θ and η are connected via Legendre transformation, we call our method Legendre decomposition.
Legendre decomposition for second-order tensors (that is, matrices) can be viewed as a low rank approximation not of an input matrix P but of its entry-wise logarithm log P. This is why the rank of log Q with the fully decomposable matrix Q coincides with the parameter matrix , and t v = 0 otherwise. Thus we fill zeros into entries in Ω + \ B. Then we have log q v = u ∈↓v t u , meaning that the rank of log Q coincides with the rank of T . Therefore if we use a decomposition basis B that includes only l rows (or columns), rank(log Q) ≤ l always holds.

Optimization
We solve the Legendre decomposition by formulating it as a convex optimization problem. Let us assume that B = Ω + = Ω \ {(1, 1, . . . , 1)}, which means that any tensor is fully decomposable. Our definition in Equation (1) can be re-written as with −ψ(θ) = θ (1,1,...,1) , and the sample space Ω is a poset (partially ordered set) with respect to the partial order "≤" with the least element ⊥ = (1, 1, . . . , 1). Therefore our model belongs to the log-linear model on posets introduced by Sugiyama et al. (2016Sugiyama et al. ( , 2017, which is an extension of the information geometric hierarchical log-linear model (Amari, 2001;Nakahara and Amari, 2002). Each entry q v and the parameters (θ v ) v ∈Ω + in Equation (2) directly correspond to those in Equation (8) in Sugiyama et al. (2017). According to Theorem 2 in Sugiyama et al. (2017) for each v ∈ Ω + (see Figure 1 is always a dual coordinate system of the set of normalized tensors S = {P | 0 < p v < 1 and v ∈Ω p v = 1} with respect to the sample space Ω, as they are connected via Legendre transformation. Hence S becomes a dually flat manifold (Amari, 2009).
Here we formulate Legendre decomposition as a projection of a tensor onto a submanifold. Suppose that B ⊆ Ω + and let S B be the submanifold such that which is the set of fully decomposable tensors with B and is an e-flat submanifold as it has constraints on the θ coordinate (Amari, 2016, Chapter 2.4). Furthermore, we introduce another submanifold S P for a tensor P ∈ S and A ⊆ Ω + such that whereη v is given by Equation (3) by replacing q u with p u , which is an m-flat submanifold as it has constraints on the η coordinate.
The dually flat structure of S with the dual coordinate systems is always a singleton; that is, the tensor Q such that Q ∈ S B ∩ S P always uniquely exists, and Q is the minimizer of the KL divergence from P (Amari, 2009, Theorem 3): The transition from P to Q is called the m-projection of P onto S B , and Legendre decomposition coincides with m-projection ( Figure 2). In contrast, if some fully decomposable tensor R ∈ S B is given, finding the intersection Q ∈ S B ∩ S P is called the e-projection of R onto S P . In practice, we use e-projection because the number of parameters to be optimized is |B| in e-projection while it is |Ω \ B| in m-projection, and |B| ≤ |Ω \ B| usually holds.
The e-projection is always convex optimization as the e-flat submanifold S B is convex with respect to (θ v ) v ∈Ω + . More precisely, where H(P) is the entropy of P and independent of (θ

Algorithm
Here we present our two gradient-based optimization algorithms to solve the KL divergence minimization problem in Equation (4). Since the KL divergence D KL (P, Q) is convex with respect to each θ v , the standard gradient descent shown in Algorithm 1 can always find the global optimum, where ε > 0 is a learning rate. Starting with some initial parameter set (θ v ) v ∈B , the algorithm iteratively updates the set until convergence. The gradient of θ w for each w ∈ B is obtained as Algorithm 1: Legendre decomposition by gradient descent Algorithm 2: Legendre decomposition by natural gradient Compute the inverse G −1 of the Fisher information matrix G using Equation (5); where the last equation uses the fact that ∂ψ(θ)/∂θ w = η w in Theorem 2 in Sugiyama et al. (2017). This equation also shows that the KL divergence Thus the total complexity is O(h|Ω||B| 2 ) with the number of iterations h until convergence.
Although gradient descent is an efficient approach, in Legendre decomposition, we need to repeat "decoding" from (θ v ) v ∈B and "encoding" to (η v ) v ∈B in each iteration, which may lead to a loss of efficiency if the number of iterations is large. To reduce the number of iterations to gain efficiency, we propose to use a natural gradient (Amari, 1998), which is the second-order optimization method, shown in Algorithm 2. Again, since the KL divergence D KL (P, Q) is convex with respect to (θ v ) v ∈B , a natural gradient can always find the global optimum. More precisely, our natural gradient algorithm is an instance of the Bregman algorithm applied to a convex region, which is well known to always converge to the global solution (Censor and Lent, 1981). Let B = {v 1 , v 2 , . . . , v |B | }, θ = (θ v 1 , . . . , θ v |B| ) T , and η = (η v 1 , . . . , η v |B| ) T . In each update of the current θ to θ next , the natural gradient method uses the relationship, ∆η = −G∆θ, ∆η = η −η and ∆θ = θ next − θ, which leads to the update formula as given in Theorem 3 in Sugiyama et al. (2017). Note that natural gradient coincides with Newton's method in our case as the Fisher information matrix G corresponds to the (negative) Hessian matrix:

Relationship to Statistical Models
We demonstrate interesting relationships between Legendre decomposition and statistical models, including the exponential family and the Boltzmann (Gibbs) distributions, and show that our decomposition method can be viewed as a generalization of Boltzmann machine learning (Ackley et al., 1985). Although the connection between tensor decomposition and graphical models has been analyzed by Chen et al. (2018); Yılmaz et al. (2011); Yılmaz and Cemgil (2012), our analysis adds a new insight as we focus on not the graphical model itself but the sample space of distributions generated by the model.

Exponential family
We show that the set of normalized tensors S = {P ∈ R I 1 ×I 2 ×···×I N

>0
| v ∈Ω p v = 1} is included in the exponential family. The exponential family is defined as for natural parameters θ. Since our model in Equation (1) can be written as with θ u = 0 for u ∈ Ω + \ B, it is clearly in the exponential family, where ζ and ψ(θ) correspond to k and C(θ), respectively, and r(x) = 0. Thus, the (θ v ) v ∈B used in Legendre decomposition are interpreted as natural parameters of the exponential family. Moreover, we can obtain (η v ) v ∈B by taking the expectation of ζ(u, v): Thus Legendre decomposition of P is understood to find a fully decomposable Q that has the same expectation with P with respect to a basis B.

Boltzmann Machines
A Boltzmann machine is represented as an undirected graph G = (V, E) with a vertex set V and an edge set E ⊆ V × V, where we assume that V = [N] = {1, 2, . . . , N } without loss of generality. This V is the set of indices of N binary variables. A Boltzmann machine G defines a probability distribution P, where each probability of an N-dimensional binary vector x ∈ {0, 1} N is given as where θ i is a bias, θ i j is a weight, and Z(θ) is a partition function.
Therefore our Legendre decomposition is a generalization of Boltzmann machine learning in the following three aspects: 1. The domain is not limited to binary but can be ordinal; that is, {0, 1} N is extended to [I 1 ] × [I 2 ] × · · · × [I N ] for any I 1 , I 2 , . . . , I N ∈ N. 2. The basis B with which parameters θ are associated is not limited to B(V) ∪ B(E) but can be any subset of [I 1 ] × · · · × [I N ], meaning that higher-order interactions (Sejnowski, 1986) can be included. 3. The sample space of probability distributions is not limited to {0, 1} N but can be any subset of , which enables us to perform efficient computation by removing unnecessary entries such as missing values.
Hidden variables are often used in Boltzmann machines to increase the representation power, such as in restricted Boltzmann machines (RBMs; Smolensky, 1986;Hinton, 2002) and deep Boltzmann machines (DBMs; Salakhutdinov andHinton, 2009, 2012). In Legendre decomposition, including a hidden variable corresponds to including an additional dimension. Hence if we include H hidden variables, the fully decomposable tensor Q has the order of N + H. This is an interesting extension to our method and an ongoing research topic, but it is not a focus of this paper since our main aim is to find a lower dimensional representation of a given tensor P.
In the learning process of Boltzmann machines, approximation techniques of the partition function Z(θ) are usually required, such as annealed importance sampling (AIS; Salakhutdinov and Murray, 2008) or variational techniques (Salakhutdinov, 2008). This requirement is because the exact computation of the partition function requires the summation over all probabilities of the sample space Ω, which is always fixed to 2 V with the set V of variables in the learning process of Boltzmann machines, and which is not tractable. Our method does not require such techniques as Ω is a subset of indices of an input tensor and the partition function can always be directly computed.

Experiments
We empirically examine the efficiency and the effectiveness of Legendre decomposition using synthetic and real-world datasets. We used Amazon Linux AMI release 2018.03 and ran all experiments on 2.3 GHz Intel Xeon CPU E5-2686 v4 with 256 GB of memory. The Legendre decomposition was implemented in C++ and compiled with icpc 18.0.0 1.
Throughout the experiments, we focused on the decomposition of third-order tensors and used three types of decomposition bases in the form of being the set indices for the top l elements of the i 3 th frontal slice in terms of probability. In these bases, B 1 works as a normalizer for each mode, B 2 works as a normalizer for rows and columns of each slice, and B 3 highlights entries with high probabilities. We always assume that (1, . . . , 1) is not included in the above bases. The cardinality of a basis corresponds 1Implementation is available at: https://github.com/mahito-sugiyama/Legendre-decomposition to the number of parameters used in the decomposition. We used l to vary the number of parameters for decomposition in our experiments.
To examine the efficiency and the effectiveness of tensor decomposition, we compared Legendre decomposition with two standard nonnegative tensor decomposition techniques, nonnegative Tucker decomposition (Kim and Choi, 2007) and nonnegative CANDECOMP/PARAFAC (CP) decomposition (Shashua and Hazan, 2005). Since both of these methods are based on least square objective functions (Lee and Seung, 1999), we also included a variant of CP decomposition, CP-Alternating Poisson Regression (CP-APR; Chi and Kolda, 2012), which uses the KL-divergence for its objective function. We used the TensorLy implementation (Kossaifi et al., 2016) for the nonnegative Tucker and CP decompositions and the tensor toolbox (Bader et al., 2017;Bader and Kolda, 2007) for CP-APR. In the nonnegative Tucker decomposition, we always employed rank-(m, m, m) Tucker decomposition with the single number m, and we use rank-n decomposition in the nonnegative CP decomposition and CP-APR. Thus rank-(m, m, m) Tucker decomposition has (I 1 + I 2 + I 3 )m + m 3 parameters, and rank-n CP decomposition and CP-APR have (I 1 + I 2 + I 3 )n parameters.

Results on Synthetic Data
First we compared our two algorithms, the gradient descent in Algorithm 1 and the natural gradient in Algorithm 2, to evaluate the efficiency of these optimization algorithms. We randomly generated a third-order tensor with the size 20 × 20 × 20 from the uniform distribution and obtained the running time and the number of iterations. We set B = B 3 (l) and varied the number of parameters |B| with increasing l. In Algorithm 2, we used the outer loop (from line 3 to 8) as one iteration for fair comparison and fixed the learning rate ε = 0.1.
Results are plotted in Figure 4(a, b) that clearly show that the natural gradient is dramatically faster than gradient descent. When the number of parameters is around 400, the natural gradient is more than six orders of magnitude faster than gradient descent. The increased speed comes from the reduction of iterations. The natural gradient requires only two or three iterations until convergence in all cases, while gradient descent requires more than 10 5 iterations to get the same result. In the following, we consistently use the natural gradient for Legendre decomposition.
Next we examined the scalability compared to other tensor decomposition methods. We used the same synthetic datasets and increased the tensor size from 20 × 20 × 20 to 500 × 500 × 500. Results are plotted in Figure 4(c). Legendre decomposition is slower than Tucker and CP decompositions if the tensors get larger, while the plots show that the running time of Legendre decomposition is linear with the tensor size ( Figure 5). Moreover, Legendre decomposition is faster than CP-APR if the tensor size is not large.

Results on Real Data
Next we demonstrate the effectiveness of Legendre decomposition on real-world datasets of third-order tensors. We evaluated the quality of decomposition by the root mean squared error (RMSE) between the input and the reconstructed tensors. We also examined the scalability of our method in terms of the number of parameters.
First we examine Legendre decomposition and three competing methods on the face image dataset2. We picked up the first entry from the fourth mode (corresponds to lighting) from the dataset and the 2This dataset is originally distributed at http://www.cl.cam.ac.uk/research/dtg/attarchive/ facedatabase.html and also available from the R rTensor package (https://CRAN.R-project.org/ package=rTensor). first 20 faces from the third mode, resulting in a single third-order tensor with a size of 92 × 112 × 20, where the first two modes correspond to image pixels and the third mode to individuals. We set decomposition bases B as B = B 1 ∪ B 2 (l) ∪ B 3 (l). For every decomposition method, we gradually increased l, m, and n and checked the performance in terms of RMSE and running time.
Results are plotted in Figure 5(a, b). In terms of RMSE, Legendre decomposition is superior to the other methods if the number of parameters is small (up to 2,000), and it is competitive with nonnegative CP decomposition and inferior to CP-APR for a larger number of parameters. The reason is that Legendre decomposition uses the information of the index order that is based on the structure of the face images (pixels); that is, rows or columns cannot be replaced with each other in the data. In terms of running time, it is slower than Tucker and CP decompositions as the number of parameters increases, while it is still faster than CP-APR.
Next we used the MNIST dataset (LeCun et al., 1998), which consists of a collection of images of handwritten digits and has been used as the standard benchmark datasets in a number of recent studies such as deep learning. We picked up the first 500 images for each digit, resulting in 10 third-order tensors with the size of 28 × 28 × 500, where the first two modes correspond to image pixels. In Legendre decomposition, the decomposition basis B was simply set as B = B 3 (l) and removed zero entries from Ω. Again, for every decomposition method, we varied the number of parameters by increasing l, m, and n and evaluated the performance in terms of RMSE.
Means ± standard error of the mean (SEM) across all digits from "0" to "9" are plotted in Figure 5(c,  d). Results for all digits are presented in the supplementary material. Legendre decomposition clearly shows the smallest RMSE, and the difference is larger if the number of parameters is smaller. The reason is that Legendre decomposition ignores zero entries and decomposes only nonzero entries, while such decomposition is not possible for other methods. Running time shows the same trend as that of the face dataset; that is, Legendre decomposition is slower than other methods when the number of parameters increases.

Conclusion
In this paper, we have proposed Legendre decomposition, which incorporates tensor structure into information geometry. A given tensor is converted into the dual parameters (θ, η) connected via the Legendre transformation, and the optimization is performed in the parameter space instead of directly treating the tensors. We have theoretically shown the desired properties of the Legendre decomposition, namely, that its results are well-defined, unique, and globally optimized, in that it always finds the decomposable tensor that minimizes the KL divergence from the input tensor. We have also shown the connection between Legendre decomposition and Boltzmann machine learning.
We have experimentally shown that Legendre decomposition can more accurately reconstruct input tensors than three standard tensor decomposition methods (nonnegative Tucker decomposition, nonnegative CP decomposition, and CP-APR) using the same number of parameters. Since the shape of the decomposition basis B is arbitrary, Legendre decomposition has the potential to achieve even more-accurate decomposition. For example, one can incorporate the domain knowledge into the set B such that specific entries of the input tensor are known to dominate the other entries.
Our work opens the door to both further theoretical investigation of information geometric algorithms for tensor analysis and a number of practical applications such as missing value imputation. Figure 6: Experimental results on each digit from 0 to 9 in MNIST. Legendre decomposition is shown in red circles, nonnegative Tucker decomposition in blue triangles, nonnegative CP decomposition in green diamonds, and CP-APR in brown crosses.