Interaction Decompositions for Tensor Network Regression

It is well known that tensor network regression models operate on an exponentially large feature space, but questions remain as to how effectively they are able to utilize this space. Using a polynomial featurization, we propose the interaction decomposition as a tool that can assess the relative importance of different regressors as a function of their polynomial degree. We apply this decomposition to tensor ring and tree tensor network models trained on the MNIST and Fashion MNIST datasets, and find that up to 75% of interaction degrees are contributing meaningfully to these models. We also introduce a new type of tensor network model that is explicitly trained on only a small subset of interaction degrees, and find that these models are able to match or even outperform the full models using only a fraction of the exponential feature space. This suggests that standard tensor network models utilize their polynomial regressors in an inefficient manner, with the lower degree terms being vastly under-utilized.


Introduction
Tensor network regression has emerged as a promising and active area of machine learning research, having achieved impressive results on common benchmark tasks such as the Movie 100K [1], MNIST [2] [3][4] [5], and Fashion MNIST [3][4] [5] datasets. The effectiveness of these models can be attributed to the tensor-product transformation that is applied to the data features, which maps the original feature vector into an exponentially large vector space. By performing linear operations on this expanded feature space, tensor network models are able to generate regression outputs that are highly non-linear functions of the original features.
In most tensor network models, the tensor-product transformation is constructed from a set of vector-valued functions that each act on only a single data feature. The form of these functions is important to the operation of the model, as it determines how regression on the transformed space is related to regression on the original feature space. Conventional wisdom regarding the choice of these functions can be traced back to the parallel works of Stoudenmire and Schwab [2] and Novikov et al. [1], who each proposed a different transformation scheme. The method from [2] was inspired by techniques in quantum many-body physics, and mapped each feature x ∈ [0, 1] into the L2-normalized vector [cos( π 2 x), sin( π 2 x)]. The approach in [1], by contrast, was motivated by a desire to characterize interactions within categorical (discrete) data, and therefore had each feature mapped to the vector [1, x]. The advantage of this latter mapping is that every element of the transformed feature space is a product of some subset of the original features, which makes the resulting regression output easier to interpret.
The purpose of this work is to quantitatively assess how well tensor network models are able to utilize the exponential feature space induced by their tensor-product transformations. We shall focus specifically on models which are built upon the [1, x] featurization from Novikov et al. [1], since this allows us to easily interpret different regions of the transformed space in terms of interactions (products) between the original features. To this end, we introduce the interaction decomposition of a tensor network model, which casts the regression output as the sum of terms which each contain all feature products of a fixed degree. Here the degree of an interaction is defined as the number of features that are multiplied together, such that, e.g., interactions of degree three take the form x 1 x 2 x 3 . By applying this decomposition to tensor network models that were trained on a given machine learning task, we can determine the importance of each interaction degree to the final output of the model. Furthermore, by implementing new models that regress on only a subset of degrees, we can assess whether the tensor network models are under-utilizing those interactions.
The remainder of this paper has the following structure. Sec. 2 provides an overview of tensor network regression, starting with a review of tensor operations and ending with a description of the tensor ring and tree tensor network architectures that we used for our tests. In Sec. 3, we describe the motivation and mechanics of the interaction decomposition, and then apply it to tensor network classifiers trained on the MNIST and Fashion MNIST datasets. From these tests, we find that some models utilize up to three-quarters of all interaction degrees generated by the tensor-product transformation, which collectively contain roughly 10 19 different regressors. However, we also determine that the tensor network classifiers are vastly under-utilizing the lower-degree interactions, since separate models trained using only interactions less than, e.g., sixth degree are able to achieve classifications accuracies very near those of the full regression models. We discuss the implications of these results and directions for future work in Sec. 4. The Appendix contains technical details about the procedure used to carry out the interaction decompositions, as a well as a tabulation of important numerical results.

Tensor Overview
Throughout this work, we consider machine learning models that are constructed using tensors [6] [7]. For our purposes, a tensor is simply a multidimensional array of numbers, such that each number is indexed by a non-negative integer along every dimension. The order of a tensor is equal to the number of dimensions that it has, or equivalently the number of integers needed to specify one of its elements. From this perspective, vectors and matrices can be viewed as first-order and second-order tensors respectively. We will denote tensors with order greater than one using uppercase letters (A, B, C, ...), while vectors will be denoted using a lower case letter under an arrow ( a, b, c, ...). Elements of a tensor are specified using subscripts, so that an element of the third-order tensor A is given by A ijk , where i, j, k are non-negative integers (all dimensions are indexed starting from zero). When referring to elements of a vector, the arrow symbol is dropped. To specify the ith member of a set of tensors, we use a superscript with parentheses, e.g., A (i) .
There are a wide variety of operations that can be defined between tensors, but in this work we focus primarily on the tensor product and the tensor contraction. The tensor product C = A ⊗ B constructs a new tensor C from every pairwise product between elements of tensor A and elements of tensor B, such that each element of C is given by and the order of C is the sum of the orders of A and B. The tensor contraction between A and B is similar to the corresponding tensor product, except that it generates a new tensor C by taking elements of A ⊗ B and summing them along a set of specified dimensions. As an example, if A and B are both third order, then a contraction between the second dimension of A and the third dimension of B is written as Note that C is fourth order rather than sixth order, since two of the dimensions of A ⊗ B were summed together.

Tensor Networks
Although we have described the tensor product and the tensor contraction as operations which construct a new tensor C from existing tensors A and B, it is equally valid to take the reverse view and interpret Eqs. (1, 2) as representing an existing tensor C in terms of new components A and B. This approach is the foundation of the tensor network representation, in which a tensor of interest is decomposed into a set of contractions between component tensors [8] [9]. The number and order of these component tensors are ultimately arbitrary, but most networks are constructed such that the component orders are all significantly smaller than the order of the original tensor. Since the number of elements in a tensor scales exponentially with its order (for a fixed index size), these component tensors will generally contain far fewer elements than the original tensor. This can allow for operations to be performed on the network that would have been computationally intractable on the original, higher-order tensor.
As a very simple example of a tensor network, consider the sixth-order tensor C that is represented by the contraction of two fourth-order tensors A and B, such that the elements of C are given by Assuming that each index in Eq. (3) is of size t > 2, it is clear that tensor C has significantly more elements (t 6 ) than are contained in A and B combined (2t 4 ). Using this representation, operations on C which are localized to indices {i, j, k} or {l, m, n} can instead be performed on A or B respectively, dramatically reducing their computational cost. Further inspection of Eq. (3) shows that the contraction of A and B can yield a combined tensor that has rank at most t across dimension r, which is far smaller than the largest possible rank of t 3 . This implies that most sixthorder tensors can only be approximately represented by the contraction of A and B, with the quality of the approximation depending on both the underlying rank structure of C and the size constraints placed on A and B. Similar trade-offs occur across all tensor networks, with different contraction structures being better suited to represent different higher-order tensors [10] [11].
Due to its simplicity, the network formed from A and B in Eq. (3) can be clearly conveyed by simply listing out all of the indices explicitly. However, many tensor networks involve contractions Figure 1: Tensor network representing fifth-order tensor H, depicted using explicit summations (top) and a tensor diagram (bottom). The network is generated by four different contractions, each of which is indicated in the diagram by a shared leg. While it is possible to discern the contraction pattern of the component tensors by studying the index notation, the diagram makes it obvious at a glance.
between dozens or even hundreds of tensors, each with their own set of indices. For notational clarity, it is common to use tensor diagrams (also referred to as Penrose notation [12]) to represent the sets of contractions within more complicated tensor networks. In these diagrams, each tensor is denoted using a geometric shape, while each index is represented by a line or leg protruding outward from the shape. A tensor product is implied by placing two tensors next to one another, and a contraction is indicated by having those tensors share one or more legs. Figure 1 shows the relative simplicity of the diagram notation versus explicit summations when several tensors are involved in the network. We utilize tensor diagrams, along with explicit index expressions, throughout the remainder of this paper to help illustrate the relevant tensor operations.

Regression with Tensors
Our work focuses on the use of tensor networks as a means of performing regression [13]. In a regression task, the goal is to learn (or estimate) the relationship between a set of m independent variables {x i } m−1 i=0 called features and a set of n dependent variables {y i } n−1 i=0 called labels. We denote a joint sample of these m + n variables as ( x, y), where x ∈ R m is a vector containing the values of the features and y ∈ R n is a vector containing the values of the labels. In the case of parametric regression, the relationship between x and y is modeled by a function f such that where W is a set of learned parameters which determines the behavior of the function. We do not expect the relationship in Eq. (4) to be exact for any intuitive function f , except when the data is generated artificially. Indeed, an exact reconstruction will generally be undesirable for real-world data, since the labels are likely to contain noise that should not be directly copied into the model. Once f is learned, it can be used to make predictions about the value of y for an unlabeled sample x.
Tensor network regression can be understood as a specific form of tensor regression [14], in which a large tensor of regression coefficients is generated by a network of smaller component tensors. In tensor regression, the function f of Eq. (4) is expressed as the contraction of a data tensor X( x), which is a function of the features in a given sample, and a weight tensor W whose elements make up the set of parameters W. The data tensor can, in principle, take on any form, but it is usually constructed from the tensor product of a set of m vector-valued functions { h (i) } m−1 i=0 that each take as input a single feature: where x i is the ith element of x and thus the ith feature out of m. Note that the sequence of tensor products in Eq. (5) is similar to a tensor network, insofar as it expresses a higher-order tensor X using a set of order-one components which collectively contain exponentially fewer elements. The regression output f ( x; W) is computed by contracting the weight tensor W with X: where W contains an additional dimension k that indexes the output vector of the model. This form of regression is quite distinct from the standard deep learning paradigm, in that the transformation of the data is effectively set in advance here via the data tensor featurization X( x). All that is then left to optimize are the coefficients W ki0...im that should be assigned to each of the new regressors. The same delineation cannot in general be made for deep learning models, since they are formed from a composition of non-linear functions that has no clear relation to any series expansion.
The precise role that the original features {x i } m−1 i=0 play in Eq. (6) depends on the form of X and therefore on the set of functions { h (i) } m−1 i=0 that were chosen. For our work we follow Novikov et al. and use functions of the form which have been used in other implementations of tensor network regression [4][15] [16]. When Eq. (7) is used to construct X, the regression function f ( x; W) from Eq. (6) becomes where 0 0 = 1 is assumed. Eq. (8) shows that tensor regression, when using the definition of h (i) (x i ) from Eq. (7), is equivalent to linear regression on all possible products formed between the original features, plus a bias term when all of the indices are zero. Regression on the elements of X can therefore generate functions f which have non-zero mixed derivatives with respect to the original features. For example, These non-zero derivatives make f significantly more expressive than functions generated by linear regression directly on the original features in x, for which all mixed derivatives must vanish.

Regression using Tensor Rings and Tree Tensor Networks
Although straightforward mathematically, the approach to tensor regression outlined in Sec. 2 m parameters in W , which is exponential in the number of features. Given that many regression tasks involve data with hundreds or even thousands of features, this method of parameterization cannot be used. Tensor network regression offers an alternative approach, in which the weight tensor is decomposed into a set of low-order component tensors. The parameters W of the regression model are then taken to be the elements of these tensors, rather than the elements of W directly. This decomposition can be illustrated using Figure 1, in which H serves as the weight tensor while the components D, E, F , and G contain the actual parameters of the model. Since the data tensor X , it is possible for the contraction in Eq. (6) to be carried out using only the components of W and X.
There exist a wide variety of tensor network architectures which can be used for regression. Our work here will focus on two of the more popular designs: tensor rings [17] [18] and tree tensor networks [19] [20]. The structures of these networks are illustrated using tensor diagrams in Figure 2, and are described in the following subsections.

Tensor Rings
The tensor ring (TR) is a popular type of tensor network, having been utilized for neural network compression [21] [22] and image reconstruction [23][24] [25]. As depicted in Figure 2, a TR is constructed from a 1-D sequence of third-order tensors contracted along a set of virtual indices that link neighboring tensors together. The "ring" part of a TR references the fact that the tensors at the beginning and end of the sequence are also contracted together, forming a closed loop. In addition to its two virtual indices, each component tensor also has a physical index, which becomes an index of the higher-order tensor after contraction of the virtual indices. When a TR is used for regression, the physical indices are contracted with the { h (i) (x i )} m−1 i=0 from the data tensor, except for one tensor whose physical index is left uncontracted to serve as the index of f . The TR is closely related to the matrix product state (MPS), which is used heavily in quantum many-body physics [26] and also frequently utilized for tensor network regression [1][2] [15]. The structure of an MPS network, also referred to as a tensor train decomposition [27], is identical to that of a TR, except that the tensors at the ends of the chain are second-order and thus not contracted together. In this work we use TRs rather than MPSs due to the greater symmetry of the former, which allows us to employ simpler contraction algorithms.
For tensor network regression to be practical, the sequence of contractions between components of X and components of the network must be carried out such that all of the intermediate tensors are low-order. For a TR, this can be easily achieved by performing the contractions in two stages, i=0 first being contracted with their corresponding component tensors in the TR along the physical index, which produces a new set of second-order tensors. These matrices are then contracted together, along with the third-order tensor containing the regression output, using the virtual indices. For m = 4, the diagrams of these steps are given by where the legs colored in red are those that are contracted from one step to the next. When using the functions from Eq. (7), the total number of parameters in a TR regression model is 2(m + 1)r 2 , where r is the size of the virtual indices. Since the number of features m is generally fixed for a given regression task (after preprocessing), r serves as the primary tunable parameter in the model, with larger values of r placing fewer restrictions on the elements of W . If r is allowed to grow exponentially with m, then the TR can represent an arbitrary weight tensor W , although this generally defeats the purpose of using a tensor network. In practice, r is typically capped at between 10 -100 for regression.

Tree Tensor Networks
A common alternative to the TR/MPS network is the tree tensor network (TTN), in which the component tensors are arranged in a (typically binary) tree pattern. TTNs have been used for quantum simulation [19][28], efficient tensor representation [29] (where it is known as the hierarchical Tucker decomposition), and for regression [30] [4]. An example TTN is depicted in Figure 2, showing that each component tensor has both a horizontal and vertical position in the network. Similar to a TR, a TTN contains both virtual and physical indices, but only the lowest layer of component tensors are contracted directly with the data tensor X. While the structure of a TR can be applied easily to any value of m, a binary TTN works most efficiently when m = 2 l , where l is the number of layers in the tree. Although these networks can be modified to handle other values of m, our work here will only consider regression tasks where the number of features is a power of two.
When used for tensor regression, the components in the bottom layer of the TTN are first contracted with the { h (i) (x i )} m−1 i=0 of X, which generates a new layer of m 2 first-order tensors. This is shown in the second diagram of Eq. (11), where the new first-order tensors are depicted as dark blocks. These tensors are then contracted with the second layer of the tree, generating m 4 firstorder tensors. This process repeats layer-by-layer until the regression output f ( x) is generated by contracting the top tensor, which is shown in the third diagram of Eq. (11). The intermediate tensors created at each layer are always first-order, which ensures that the procedure will be computationally tractable. For m = 4, the tensor diagrams for the contractions are given by with the number of tensors being approximately halved after each layer is contracted. As in Eq. (10), the legs shown in red are contracted between each step. When using h (i) (x i ) of the form in Eq. (7), the number of parameters in a TTN is 2mr + ( m 2 − 1)r 3 , which scales as O(r 3 ) in contrast with the O(r 2 ) scaling of a TR. As a result, the size r of the virtual index is typically chosen to be on the order of 10. As with a TR, a TTN can represent an arbitrary weight tensor W if r is allowed to scale exponentially with m.

Motivation
Throughout our discussion of tensor network regression in Sec. 2.2, the weight tensor W and data tensor X were treated principally as abstract objects, in that they were only operated on numerically via their component tensors. This was necessary on practical grounds, since the exponential scaling of both W and X makes it virtually impossible to perform operations on either tensor when the data has even a modest number of features m. That said, there is an obvious mathematical clarity that comes from working directly with W and X via the decomposition of Eq. (8), since the elements of X are simply products of the original features while the elements of W are the corresponding linear regression coefficients. If, for example, we wished to perform regression using only a specific portion of the feature products, then we could just set the elements of W for all other feature products to zero and learn the remaining parameters as usual. Such a straightforward modification is generally not possible when representing the weight tensor as a tensor network, since each element of W is a complicated function of all of the parameters in the model.
In this section we introduce the interaction decomposition of a tensor network, with the aim of recovering some of the fine-tuned control and interpretability that comes from an element-wise representation of the weight tensor W . In an interaction decomposition, the terms of the sum in Eq. (8) are grouped together by the number of features included in their product, for a total of m + 1 groupings. The number of features in a given product is labeled its interaction degree, such that x 1 has degree 1, x 1 x 2 has degree 2, and so on, with the bias having degree 0. Under an interaction decomposition, the regression output f ( x; W) is written as where d (j) ( x; W) is the contribution to the regression output from all terms of degree j. As with f ( x; W), these contributions are functions of both the original features x and the parameters W of the decomposed network. We discuss ways of interpreting this decomposition in terms of vector subspaces in Sec. 3.2.
Using Eq. (12), the relative importance of the jth interaction degree can be assessed by analyzing the average magnitude of d (j) ( x; W), as well as its effect on the regression output. We carry out this analysis on TR and TTN models in Sec. 3.3. Furthermore, by choosing to keep only a specific subset D of the decomposition terms in Eq. (12), it is possible to construct a new type of regression model which we call a D-degree tensor network. These networks utilize the same parameterization scheme for W as normal tensor network models of the same architecture, but are restricted to generating only the feature products of degrees contained in D. By comparing the performance of a full tensor network with that of a D-degree version of the network, we can quantify how effectively the standard network is able to utilize interaction degrees within D. We introduce these models and perform numerical tests on them in Sec. 3.4.

Interaction Subspaces
In the context of vector spaces, the weight tensor W acts as a parameterized multilinear map between the data tensor space X, which we take to be R 2 m , and the label space Y. Under this construction, the data tensor X is simply a vector within X whose elements are generated from the features in x via the tensor-product operations of Eq. (5). The nature of the d (j) ( x; W) terms from Eq. (12) can be understood by considering the action of W on a particular subspace decomposition of X. Using the definition of h (i) (x i ) from Eq. (7), we can expand X on a standard basis of X as where { e . The coefficients of X on the bases in D (j) consist of feature products of degree j, so we therefore refer to D (j) as the degree-j subspace of X. The dimension of D (j) is given by which is the number of ways to draw j features from the total set of m features. The dimension of the combined feature space for a set D of degrees, denoted dim(D), is then which is simply the sum of the subspace dimension for each degree in D. If we denote the projection into the degree-j subspace as P (j) , then the interaction decomposition becomes where the regression output is a sum of contractions between W and the projection of X into each of the m + 1 degree subspaces. Due to the linearity of tensor contractions, this equality can be easily verified by performing the sum over j first, since j P (j) gives the identity.
The form of Eq. (16) provides two interpretations of the degree contributions d (j) ( x; W). If we consider the projector P (j) acting on X, as originally envisioned, then d (j) ( x; W) is the contraction of W with the portion of X that inhabits D (j) . This could be viewed as a form of data tensor preprocessing, where elements of X corresponding to interaction degrees other than j are removed. Alternatively, the tensor diagrams in Eq. (16) show that it is equally valid to consider P (j) acting on the weight tensor W . Under this interpretation, the set { d (j) ( x; W)} m j=0 consists of regressions on X performed by different models, each derived from a common tensor W by keeping only those elements corresponding to interactions of degree j. We will shift between these two interpretations freely throughout the remainder of this section, describing the interaction decomposition as a procedure which picks out different pieces of a tensor network model and which restricts the set of feature products that can be used for regression.

Interaction Decompositions of TR and TTN Models
The interaction decomposition of a TR or TTN regression model allows us to quantify the relative importance of the jth interaction degree to the overall value of the output. This information is not available when performing the standard contraction operations laid out in Eqs. (10,11), since the elements of the intermediate tensors are sums of contributions from a wide range of interaction degrees, making it impossible to separate out the impact of any one degree. Given the unfavorable scaling of Eq. (14), a brute force evaluation of each feature product in d (j) ( x; W) is also impractical for even modest values of j. In Appendix A.1, we describe an alternative procedure that efficiently contracts the TR and TTN component tensors in a manner that ultimately yields the same output as the standard contraction, but also separates out the various d (j) ( x; W) contributions.
In this section, we carry out these interaction decompositions on TR and TTN models that were trained to classify digits from the MNIST [31] and Fashion MNIST [32] datasets. These datasets have been widely used to evaluate tensor network models in the literature, and thus serve as reasonable benchmarks for our analysis. Given that the number of operations needed for a full interaction decomposition can scale quadratically with the number of features (see Appendix A.1), we resized each image from 28 × 28 pixels to 8 × 8 pixels in order to reduce the computational burden of the tests. The grayscale pixels were also normalized to floating-point values on the range [−0.5, 0.5] to improve the numerical stability of the networks. The bond dimension of the TR and TTN models was set to 20, providing them with sufficient representational power without excessive overfitting.
The regression output f ( x; W) was fit against one-hot encodings of the digit labels and optimized using gradient descent with a mean squared error loss function. During training the networks were contracted normally, with the interaction decomposition being performed at the end using the test dataset.
To begin our analysis, we focus first on the magnitudes of the different d (j) ( x; W) contributions. To produce a single magnitude for each degree, we computed the L1 norm of d (j) ( x; W) for each image in the test dataset, and then averaged over the set. Figure 3 shows the resulting magnitudes for ten TR models and ten TTN models, all trained using the same hyperparameters but with different initial values for the tensor elements. Across both datasets the TR and TTN plots show a similar pattern, with the degree magnitudes starting at approximately 10 −1 for j = 0 and then growing steadily to some maximum value before declining again at larger j. The size and location of the peak varies significantly between the MNIST and Fashion MNIST models, with the MNIST models peaking from 0.1 to 1 at around degrees 10 to 15 while the Fashion MNIST models peak from 10 2 to 10 4 at around degrees 17 to 23. After the peak, the magnitudes begin to drop off precipitously, with interaction degrees greater then 45 typically having contributions orders of magnitude smaller than those from degrees before the peak. The inset plots of Figure 3 show that there is a significant amount of variation between individual models of a given network type and dataset, with some models having magnitudes 10 or even 100 times larger than others.
A significant limitation of the magnitude analysis from Figure 3 is that it can be difficult to assess the true importance of a degree contribution using only its average magnitude. Indeed, even if a set of degrees all have small individual magnitudes, their cumulative effect on the output may still be important. To better assess the "usefulness" of the degree contributions, we computed the accuracy of the TR and TTN classifiers as a function of interaction degree, both individually and cumulatively. Figure 4 shows these accuracies after averaging over the models of each network type. The cumulative accuracy (shown using a solid line) of degree j denotes the accuracy of the output generated by the sum of all degree contributions less than or equal to j, while the individual accuracy of degree j (shown as a point on the scatter plot) gives the performance of d (j) ( x; W) alone. The dimension of the expanded feature space corresponding to each data point is determined by Eq. (15).  , averaged across the MNIST and Fashion MNIST test datasets, for ten TR models and ten TTN models trained using all interaction degrees. Each dashed line represents the average magnitudes from one of the models, which is plotted against the interaction degree j of the contribution. The trend for all models is broadly the same, with magnitudes starting near 10 −1 for the bias term and then gradually rising to a peak before dropping off significantly by degree 45. Large variations in magnitude can be seen when comparing individual TR models and TTN models. From the plots of cumulative accuracy, we can see that the average performance of both the TR and TTN networks plateaus at slightly over 98% on MNIST (98.31% for the TRs and 98.47% for the TTNs when all degrees are included), which is consistent with prior work [2][4] [15]. On Fashion MNIST the accuracies are signficantly lower, at 82.73% for the TR and 83.43% for the TTN. 1 These final accuracy values are of less significance to us than the interaction degree at which the curve flattens. On MNIST (Fashion MNIST) this occurs at approximately j = 31 (44) for the TRs, and at j = 38 (54) for the TTNs. Looking back at the magnitudes from Figure 3, this indicates that even contributions on the order of 10 −3 can still improve the performance of the classifier. Interestingly, all four of the accuracy curves show a temporary flattening before degree 10, followed by a second upward rise. This effect is least visible on the TR MNIST curve and most visible on the two Fashion MNIST curves, with the latter pair of curves seeing most of their accuracy gains after degree 10.
Based on the accuracies of the individual contributions d (j) ( x; W), which are shown in Figure 4 using scatter plots, it is clear that only the first few interaction degrees are having their coefficients optimized such that they can classify images independently. The remaining contributions, which constitute the vast majority of regressors, have accuracies close to 10% and therefore do not separate the different digit classes to any appreciable extent when used in isolation. This suggests that the higher-degree d (j) ( x; W) have been trained essentially to correct or finesse the cumulative output from the lower degrees, since the cumulative accuracy continues to increase as their outputs are incorporated. This trend is particularly marked for the Fashion MNIST models, where only degrees 1 and 2 have accuracies above 12%. Indeed, plots C and D from Figure 3 show that many of the regressors in these models are being used to cancel out the large magnitudes of the intermediate degrees, since the final regression output needs to be roughly in the range [0, 1] to achieve a reasonable loss value.

Interaction Decompositions as Regression Models
In Sec. 3.3, we used the interaction decomposition as a tool to analyze tensor network models that had been trained using standard methods. As a result, the parameters of each model were optimized under the assumption that every interaction degree would contribute to the final output, without any truncation or isolation. This offers the greatest flexibility to the model in principle, but it can also obscure the potential success that a single degree or subset of degrees might have had if the parameters of the network had been optimized to improve their performance specifically.
In light of this fact, we propose a new type of tensor network model called the D-degree tensor network. In these models, only interaction degrees in the set D are used to construct the regression output, such that Comparing this expression to the full interaction decomposition given in Eq. (12), it is clear that if D is the set of all interaction degrees (i.e. if D = {0, 1, ..., m}), then the corresponding D-degree network is equivalent to a standard tensor network with the same structure. However, we will focus our attention on models where D contains only a fraction of the m + 1 possible interaction degrees. By restricting the regression in this manner, we are effectively inducing sparsity in the weight tensor W by zeroing the coefficients for all interaction degrees not included in D. However, unlike in the 1 Tensor networks can achieve significantly higher accuracies on 28 x 28 Fashion MNIST [3][4] [5], but the decrease in performance at 8 x 8 is much more severe than for standard MNIST, due to the greater image complexity. For comparison to more state-of-the-art methods, an Inception convolutional network [33] can achieve accuracies of 99.09% on 8 × 8 MNIST and 86.5% on 8 × 8 Fashion MNIST (see Appendix A. 3) case of sparse neural networks [34] [35], this sparsity does not necessarily lead to a reduction in the number of trainable parameters or to an improvement in the computational overhead. Instead, the sparsity leads to a simplification in the structure of the regression function, which can yield a model that is more easily interpretable while still achieving the same level of performance.
Using the decomposition procedure described in Appendix A.1, it is possible to efficiently train D-degree models on the same regression tasks used in Sec. 3.3, and thus compare their classification accuracies with those shown in Figure 4. For our tests, we selected D-degree models that fell into two categories: the cumulative-j models, in which all degrees less than or equal to j are included in the output, and the degree-j models, in which the output is simply the contribution from the jth degree: These two groups describe only a small portion of the 2 m+1 − 1 possible D-degree models, but they will allow us to easily compare our results with the plots in Figure 4. For our numerical tests, we trained the models on 8 × 8 images from the MNIST and Fashion MNIST datasets prepared in the same manner described in Sec. 3.3. The D-degree models take somewhat longer to train than standard tensor network models due to the added complexity of the interaction decomposition, but their times are still on par with those of neural network models (see Appendix A.3). Figure 5 shows average accuracies of the cumulative-j (blue plots) and degree-j (orange plots) models as a function of j, with each data point representing an average across ten models. These averages are plotted alongside the cumulative data from Figure 4, which shows the performance of the full tensor network models as a reference. We emphasize that the cumulative-j and degree-j curves in Figure 5 are computed in precisely the same manner as the line and scatter plots from Figure 4, except that the models which generated Figure 4 were trained using all of the interaction degrees rather than just the specific subset being plotted. The plots for the D-degree models omit results for j = 0, since those models contain only the bias term and thus predict the same digit for every image. The data used to generate these plots is given in Appendix A.2.
The first trend to observe from Figure 5 is that both the cumulative-j and degree-j models have accuracies that are significantly greater than the corresponding cumulative accuracy at degree j from the full tensor network models. This performance gap is notable, because it implies that the standard models are utilizing the feature-product regressors in a highly inefficient manner. For MNIST in particular, the degree-j classifiers with j > 3 were able to perform within 0.5% of the full model. As a comparison, the cumulative MNIST accuracy of the regular TR and TTN models using degrees 0 -4 is only 71% and 61% respectively. The disparity is even larger when looking at the single-degree accuracies from Figure 4, which show that most individual d (j) ( x; W) were unable to classify images at all when trained as part of a full tensor network model. When those degree contributions were optimized directly, however, they were able to perform classification with more than 98% accuracy. The degree-j models did not perform as well on Fashion MNIST, though they still achieved accuracies that were vastly higher than the corresponding cumulative accuracies from Figure 4. The cumulative-j models, on the other hand, were able to closely match the performance of the standard models, with the cumulative-10 TR and cumulative-8 TTN models coming within 0.1% of their full-degree counterparts.
The dashed horizontal lines in Figure 5 mark the accuracy of the full tensor network models when every degree contribution is included. Using these values as a benchmark, we can see that several of the cumulative-j TTN models (6 ≤ j ≤ 10) and degree-j TTN models (7 ≤ j ≤ 10) actually outperformed the corresponding full model on the MNIST dataset. This is a counter-intuitive result, The degree-j models generally demonstrate worse performance than the cumulative-j models, though the difference is very small on MNIST. The cumulative-j models were able to closely match the accuracies of the full models on both datasets, and even slightly exceed their performance on MNIST.
as it suggests that regressing on all of the interaction degrees can actually yield slightly worse results than regressing on only a small subset of them. A cumulative-8 TTN model, for example, uses only one-billionth of the feature products contained within the data tensor X, yet achieves an average accuracy roughly 0.1% higher than a TTN model which has access to all of X.
Finally, we note that a comparison can be made between the performance of these D-degree network models, which constrain the feature product coefficients to all be generated by the same lowrank tensor network, and a more general multilinear regression model in which every coefficient can be determined arbitrarily. In Appendix A.3, we give results for this type of unconstrained regression on features products up to degree 4, which shows that the cumulative-j models achieve accuracies very near the arbitrary models of degree j, and can outperform them for larger values of j even when the tensor network models contain fewer trainable parameters. This demonstrates the utility of incorporating more interactions (up to a point), since constrained regression on higher-degree feature products is more effective than unconstrained regression on lower-degree feature products.

Discussion
The exponential feature space induced by the transformation in Eq. (5) lies at the heart of tensor network regression, and there is no doubt that these models utilize it to achieve a level of performance that far exceeds standard linear regression. That said, it is easy to feel incredulous toward the idea that tensor network models, or indeed any regression model, could truly make use of the 2 64 different regressors that are generated from an 8 × 8 image. The goal of our work here has been to develop the interaction decomposition as a tool to test this claim, and then apply it to tensor network models under a pair of standard machine learning tasks. By evaluating the magnitudes and accuracies of the different interaction degrees, we can begin to draw conclusions about how effectively the exponential space is being utilized.
To this end, our results from Sec. 3.3 show that more than half of the interaction degrees contributed meaningfully to the output of the classifiers, with the Fashion MNIST TTN models in particular using up to degree 50. In the language of Sec. 3.2, this indicates that the tensor network models are utilizing a portion of the expanded feature space X that has a dimension on the order of 10 19 , which can be computed by summing Eq. (14) across all significant degrees. It is important to note, however, that Figure 4 only shows the change in accuracy for the jth interaction degree when the entirety of subspace D (j) is incorporated into the regression. It may very well be that the models in Sec. 3.3 are utilizing only a small portion of this space, and thus the number of relevant feature products could be far smaller than the upper-bound of 10 19 . The interaction decomposition cannot separate out different parts of a given degree-j subspace, so future work might look into alternate algorithms that are able to divide up these spaces into meaningful components.
For a given interaction degree, one can ask not only if the set of feature products is being utilized by a tensor network model, but also how well the model is using them relative to some standard. In Sec. 3.4 we introduced the D-degree tensor network to serve as this standard, since its parameters could be trained to maximize performance using only a specific subset of interaction degrees. In our tests, the networks were limited to interaction degrees of at most 10, which corresponds to an expanded features space of dimension at most 10 11 as given by Eq. (15). The results shown in Figure 5 demonstrate that the full tensor network models are significantly under-utilizing the lowerdegree interactions, since the D-degree models are able to achieve accuracies up to 60 percentage points higher when constrained to those same degrees. This under-utilization was especially acute for Fashion MNIST, where the full models only reached cumulative accuracies of 25% for the first ten degrees, despite the fact that the cumulative-10 models had accuracies near 83%.
More significantly, some D-degree models trained using only the first six interaction degrees were able to achieve accuracies on MNIST that were greater than those of models trained using all degrees. While this could simply be due to more overfitting in the full models, it might also point to inherent limitation in the tensor network representation of W . We know, for example, that the regression coefficients in a tensor network model are necessarily coupled together by the elements of the component tensors, which may force the model to use suboptimal coefficients for the lower interaction degrees in order to avoid harmful contributions from the higher degrees. Given that detailed, "under-the-hood" analyses of these models are possible using methods such as the interaction decomposition introduced here, the existence and nature of this compromise seems like a promising area for further study.
Taken together, the results discussed here support the following two conclusions: 1. Common tensor network models are capable of utilizing regressors from a large portion of the expanded feature space generated by the featurization from [1].
2. However, a comparable level of performance may also be achieved by regression on a minuscule fraction of that same space.
For those looking to use tensor network models for machine learning, there is cause here for both optimism and caution. While the first conclusion makes it clear that tensor network regression models can incorporate useful information from a wide range of interaction degrees, the second conclusion implies that it is difficult for these models to extract any unique information from the higher-degree regressors. In light of this, we believe that the D-degree tensor network models, which have been absent in the literature up to this point, represent a promising approach for tensor network regression. Using these models, it is possible to exploit the representational efficiency of tensor networks while constraining the regression to a reasonable and interpretable set of feature products based on the inherent complexity of the dataset.
2. Construct (implicitly) a modified data tensorX using the mappings from Eq. (21), such that 3. Use degree-preserving contraction operations to contractX with the tensor network, following whichever efficient contraction scheme is appropriate for the network architecture of the model. 4. If the decomposition is being used to contract a D-degree network, then the degree index of all intermediate tensors can be be truncated to the largest degree in D.
Since the contraction of the network is done using degree-preserving contractions, the contributions from each interaction degree are kept separate throughout the entire process. The final output of the interaction decomposition (without truncation) is a second-order tensor of the form where d (j) ( x; W) is the degree-j contribution to the combined regression output f ( x; W). The computational cost of the procedure described above is best understood in terms of how much additional complexity it adds on top of a standard contraction of the network. This complexity comes from two sources: larger intermediate tensors due to the addition of the degree index, and an extra sum over the degree index that is present in the degree-preserving tensor product from Eq. (19). The first contribution is easy to characterize, since adding a degree index simply increases the size of the original tensors by a factor that is on the order of the maximum interaction degree j max in the decomposition. The second contribution is more subtle, since the number of terms in the tensor-product sum depends on the relative sizes of the degree indices of the two inputs. Consider again the tensor product between A and B from Eq. (19), and letj a andj b be the largest value of the degree index for A and B respectively, withj =j a +j b andj a ≤j b . Then it it can be shown that the number of terms s needed to generate allj + 1 slices of A⊗ B is given by s = (j a + 1)(j b + 1), which scales as O(j ajb ). This means that, for a fixedj, the value of s can range from a minimum ofj + 1 ifj a = 0 to a maximum of 1 4 (j) 2 +j + 1 for the fully symmetric case whenj a =j b = 1 2j . Given that the last contractions in the interaction decomposition will havej on the order of j max , this means that the most complex degree-preserving tensor products can either scale as O(j 2 max ) or O(j max ), depending on the amount of symmetry between the two input tensors. The fact that more symmetric contraction schemes can lead to worse scaling (quadratic rather than linear in j max ) is an interesting property of this method, although the use of such schemes may still be desirable due to other computational advantages.

A.2 Tabulation of D-degree Model Performance
The following two tables show the results of the numerical tests described in Sec. 3.4 and plotted in Figure 5, along with the relevant cumulative accuracy values for the full models from Figure 4. Table 1 gives the accuracies for MNIST, while Table 2 provides them for Fashion MNIST. Each value represents the average percent test accuracy across ten different initializations of the given model type, with the standard error of the last digit shown in parentheses. For the cumulative-j and degree-j models the column label denotes the value of j, while for the full models they denote the cumulative accuracy of the output up to the jth interaction degree.

A.3 Regression Model Comparisons
In Table 3, we compare our TR and TTN models with several low-order multilinear models and a deep learning model, in terms of their number of trainable parameters, computation time per epoch, and average accuracies on the 8 × 8 image datasets. The linear, bilinear, trilinear, and tetralinear regression models perform unconstrained regression on feature products of degree less than or equal to 1, 2, 3, and 4 respectively, which are the same regressors used by the cumulative-j models for 1 ≤ j ≤ 4. By "unconstrained", we mean that the coefficients for each feature product can be set arbitrarily rather than being generated by a low-rank tensor network. To offer a comparison with state-of-the-art neural network algorithms, we also provide the corresponding numbers for a convolutional neural network (CNN) model based on the Inception [33] architecture, which contains the most parameters and achieves the best performance on both datasets.