RG-Flow: A hierarchical and explainable flow model based on renormalization group and sparse prior

Flow-based generative models have become an important class of unsupervised learning approaches. In this work, we incorporate the key ideas of renormalization group (RG) and sparse prior distribution to design a hierarchical flow-based generative model, RG-Flow, which can separate information at different scales of images and extract disentangled representations at each scale. We demonstrate our method on synthetic multi-scale image datasets and the CelebA dataset, showing that the disentangled representations enable semantic manipulation and style mixing of the images at different scales. To visualize the latent representations, we introduce receptive fields for flow-based models and show that the receptive fields of RG-Flow are similar to those of convolutional neural networks. In addition, we replace the widely adopted isotropic Gaussian prior distribution by the sparse Laplacian distribution to further enhance the disentanglement of representations. From a theoretical perspective, our proposed method has $O(\log L)$ complexity for inpainting of an image with edge length $L$, compared to previous generative models with $O(L^2)$ complexity.


Introduction
One of the most important unsupervised learning tasks is to learn the data distribution and build generative models. Over the past few years, various types of generative models have been proposed. Flow-based generative models are a particular family of generative models with tractable distributions [1,2,3,4,5,6,7,8,9], yet their latent variables are on equal footing and mixed globally. Here, we propose a new flow-based model, RG-Flow, which is inspired by the idea of renormalization group in statistical physics. arXiv:2010.00029v5 [cs.LG] 15 Aug 2022 RG-Flow imposes locality and hierarchical architecture in bijective transformations. It allows us to access and modify information in input data at different scales by controlling different latent variables, which offers better explainability. Combined with sparse prior distributions [10,11,12], we show that RG-Flow achieves hierarchical disentanglement of representations.
Renormalization group (RG) is a powerful tool in physics to analyze statistical mechanics systems and quantum field theories [13,14]. It progressively extracts more coarse-scale statistical features of the physical system and decimates irrelevant fine-grained statistics at each scale. Typically, the transformations in RG are local, not bijective, and designed by physicists. Flow-based models also use cascaded transformations to progressively turn a complicated data distribution into a simple distribution, such as a Gaussian, but on the contrary, those transformations are usually global, bijective, and automatically learned. In this work, we propose RG-Flow, which combines the key ideas from both RG and flow-based models. RG-Flow learns the optimal RG transformation from data using local bijective transformations, and also serves as a hierarchical generative model for the data. Latent representations are extracted gradually at different scales to capture the statistical features of the data at the corresponding scales, and supplemented jointly to invert the transformations when generate the data. This method has been recently explored in the physics community as NeuralRG [15,16].
Our main contributions are two-fold: First, we point out that RG-Flow can naturally separate the information at different scales in the input image distribution, and assign hierarchical latent variables on a hyperbolic tree. Taking the human face dataset CelebA [17] as an example, the network will not only find high-level representations like the gender and the emotion, but also mid-level and low-level ones like the shape of a single eye. To visualize representations at different scales, we adopt the concept of receptive field from convolutional neural networks (CNN) [18,19] and visualize their structures in RG-Flow. In addition, since the representations are separated in a hierarchical fashion, they can be mixed to different extents at different scales, thus separate the mixing of content and style in images. Second, we find that the sparse prior distribution is helpful to further disentangle representations and make them more explainable. As the widely adopted Gaussian prior is rotationally symmetric, each latent variable in a flow model can be arbitrarily mixed with others and lose any clear semantic meaning. Using a sparse prior, we clearly demonstrate the semantic meaning of the latent space.

Related work
Some flow-based generative models also possess multi-scale latent space [1,2,20,21,22,23], and recently hierarchies of features have been utilized in [24], where the high-level features are shown to perform strongly in out-of-distribution detection task. However, previous models do not impose hard locality constraint in their multi-scale architectures. In Appendix C, the differences between globally connected multi-scale flows and RG-Flow are discussed, and we find that semantically meaningful receptive fields do not show up in the former. Recently, other more expressive bijective maps have been developed [6,9,25], and those methods can be incorporated into our proposed architecture to further improve the expressive power of RG-Flow.
Some other classes of generative models rely on a separate inference model to obtain the latent representation. Examples include variational autoencoders [26], adversarial autoencoders [27], InfoGAN [28], BiGAN [29,30], and progressive growing GAN [31]. Those techniques typically do not arrange latent variables in a hierarchical way, and their inference of latent variables is approximate. Notably, recent advances have suggested that introducing hierarchical latent variables in them can be beneficial [32]. In addition, the coarse-to-fine fashion of the generation process has also been discussed in other generative models, such as Laplacian pyramid of adversarial networks [33], and multiscale autoregressive models [34].
Typically, in flow-based generative models, Gaussian distribution is used as the prior distribution for the latent space. Due to the rotational symmetry of Gaussian prior, an arbitrary rotation of the latent space leaves the likelihood unchanged, therefore the latent variables can be arbitrarily mixed during the training and lose their semantic meanings. Sparse priors [10,11,12] have been proposed as an important tool for unsupervised learning, which leads to better explainability in various domains [57,58,59]. To break the symmetry of Gaussian prior and improve the explainability, we incorporate the sparse Laplacian prior in RG-Flow. Please refer to figure E1 for a quick illustration on the difference between Gaussian prior and the sparse prior, where the sparse prior intuitively leads to better disentanglement of the features.
Renormalization group (RG) has a broad impact ranging from particle physics to astrophysics. Apart from the analytical studies in field theories [14,60,61], RG has also been useful in numerically simulating quantum states. The multi-scale entanglement renormalization ansatz (MERA) [62,63] implements the hierarchical architecture of RG in tensor networks to represent quantum states. The exact holographic mapping (EHM) [64,65,66] further extends MERA to a bijective (unitary) flow model between the latent product state and the observable entangled state. Recently, the MERA architecture and deep neural networks have been incorporated to design a flow-base generative model that learns EHM from statistical physics and quantum field theory actions [15,16]. In quantum machine learning, recent development of quantum convolutional neural networks have also utilized the MERA architecture [67]. The similarity between RG and deep learning has been discussed in several works [68,69,70,71,72,73], and information theoretic objectives to guide machine-learning RG transformations have been proposed in recent works [74,16,75]. The semantic meaning of the emergent latent space has been related to quantum gravity [76,77], which leads to the exciting development of machine-learning holography [78,79,80,81,82].

Flow-based generative models
Flow-based generative models are a family of generative models with tractable distributions, which allows exact and efficient sampling and evaluation of the probability density [83,1,2,4]. The key idea is to build a bijective map x = G(z) between the observable variables x and the latent variables z, where z has a simple distribution p Z (z), e.g. Gaussian, as shown in figure 1. The latent variables z usually have a simple distribution that can be easily sampled, for example the i.i.d. Gaussian distribution. In this way, the data can be efficiently generated by first sampling z and then mapping them to x. Due to the equality p X (x) dx = p Z (z) dz ⇒ p X (x) = p Z (z) ∂z ∂x , the log-probability density log p X (x) is given by The bijective map is usually composed from a series of bijectors, G(z) = G 1 • G 2 • · · · • G n (z), where each bijector G i has a tractable Jacobian determinant and its inverse R i = G −1 i can be computed efficiently. R stands for "representation" and G stands for "generation". The two key ingredients in flow-based models are the design of the bijective map G and the choice of the prior distribution p Z . Figure 1. The observable variables x are the data that we want to model, which may follow a complicated probability distribution p X (x). The goal of flow-based models is to learn a bijective mapping G between the observable space and the latent representation such that the latent variables follow a simple distribution p Z (z). The red dash circle on the right shows the ideal target distribution p * Z (z) in the latent space. The red dash circle on the left shows the inverse target distribution p * X (x) in the data space. The loss function measures the discrepancy between the data distribution p X (x) and the target distribution p * X (x).

Structure of RG-Flow
Much of previous research has focused on designing more powerful bijective blocks for the generator G to improve its expressive power and achieve better approximations of complicated probability distributions. In this work, we focus on designing an architecture that arranges the bijective blocks in a hierarchical fashion to separate features at different scales in the data and disentangle latent representations.
(a) Each layer of forward RG transformation extracts coarse-grained information (decimation) to send to the next level for further processing and splits out the independent fine-grained features, which is to be left at the current stage. RG progressively builds a hierarchical representation from the input image at different scales. (b) The inverse RG transformation generates the fine-grained image from latent variables. At each layer, the coarse-grained information from higher level is merged with the independent fine-grained information at the current level, and then the inverse transform of the merged representation is sent to the lower level.
Our design is motivated by the idea of RG in physics, which progressively separates the coarse-grained information in input data from independent fine-grained information by local transformations at different scales. Let x be the observable variables, or the input image (level 0), denoted as x (0) ≡ x. A step of the RG transformation extracts the coarse-grained information x (1) , send it to the next layer (level 1), and splits out the remaining independent fine-grained information as auxiliary variables z (0) , which is left to the current level. This procedure is described by the following recurrence equation (at level h for example): which is illustrated in figure 2 (a), where dim(x (h+1) ) + dim(z (h) ) = dim(x (h) ). At each level, the transformation R h is a local bijective map, which is constructed by stacking trainable bijective blocks, and we will specify its details later. The split-out information z (h) can be viewed as latent variables arranged at the corresponding scale. Following the bijectivity of R h , the inverse RG transformation step G h ≡ R −1 h generates the finegrained image: At the highest level h L , the most coarse-grained image x (h L ) = G h L (z (h L ) ) can be considered as generated directly from the latent variables z (h L ) without referring to any higher-level image, where h L = log 2 L − log 2 m for the original image of size L × L and the local transformations acting on kernel size m × m. Therefore, given the latent variables z = {z (h) } at all levels h, the original image can be restored by the following composite map: x ≡ x (0) = G 0 (G 1 (G 2 (. . . , z (2) ), z (1) ), z (0) ) ≡ G(z), as illustrated in figure 2 (b). RG-Flow is a flow-based generative model that uses the above composite bijective map G as the generator. e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8  To model the RG transformation, we arrange the bijective blocks in a hierarchical network architecture. figure 3 (a) shows the side view of the network, where each green or yellow block is a local bijective map. Following the notation of MERA, the green blocks are the disentanglers, which reparametrize local variables to reduce their correlations, and the yellow blocks are the decimators, which split the decimated features out as latent variables. If there are no disentanglers, the structure would simply be a tree. Disentanglers are added to reduce the local correlations of variables and encourage the decimated variables to be noise-like fine-grained information. The blue dots on the bottom are the observable variables x, or the input image, and the red crosses are the latent variables z. We omit color channels of the images at each level in the illustration, since we keep the number of channels unchanged through the transformation. Figure 3 (b) shows the top-down view of a step of the RG transformation. The disentanglers (green blocks) and the decimators (yellow blocks) are interwoven on top of each other. The covered area of a disentangler or decimator is defined as the kernel size m × m of the bijector. For example, in figure 3 (b), the kernel size is 4 × 4. After each decimator, three fourth of the degrees of freedom are decimated into latent variables (red crosses), so the edge length of the image is halved.
As a precise mathematical description of figure 3, for a single RG transformation step R h including a disentangler layer and a decimator layer at level h, in a spatial block (p, q) labeled by p, q = 0, 1, . . . , L 2 h m − 1, the mapping from x (h) to (x (h+1) , z (h) ) is given by: where y is the intermediate result after the disentangler but not the decimator, are pixel positions that the disentangler and the decimator act on respectively, and k m = {(ka, kb) | a, b = 0, 1, . . . , m k − 1} denotes the set of pixels in a m × m square with stride k. The notation x (h) (i,j) stands for the variable (a vector of all color channels) at the RG level h and the spatial position (i, j), and similarly for y and z.
The disentanglers R dis h and the decimators R dec h can be any bijective neural network. Practically, We use the coupling layer proposed in the Real NVP architecture [1] to build them, with a detailed description in Appendix A. By specifying the above RG h is automatically specified as the inverse transformation step.

Training objective
After decomposing the input data into multiple scales, we still need to enforce that the latent variables at the same scale are disentangled. We take the usual approach that the latent variables z are independent random variables, described by a factorized prior distribution where the index l = (h, i, j, c) labels every latent variable by its RG level h, spatial position (i, j), and color channel c. This prior gives the network the incentive to minimize the mutual information between latent variables. This minimal bulk mutual information (minBMI) principle has been proposed to be the information theoretic principle that defines the RG transformation [15,16]. The particular choice of the distribution p(z l ) is discussed in 3.6.
To model the data distribution, we minimize the negative log-likelihood of inputs x sampled from the dataset. The loss function reads: where R ≡ G −1 is the forward RG transformation and contains trainable parameters, and the representation R(x) = z is the latent variables obtained by transforming the input sample. Please note that the second term of Equation 11 is a constant as it only involves the data distribution p X (x). The minimization of negative log-likelihood evaluated on training data is equivalent to the maximization of probability of generating those data from the neural network. As shown in Figure 1, our goal is to transform a complicated data distribution p X (x) to a simple target distribution p * Z (z) in the latent space, shown by the red dash circle on the right. The transformation G inverts p * Z (z) to p * X (x) in the data space, which is shown by the red dash circle on the left. The maximum likelihood loss function measures the discrepancy between the target distribution p * X (x) and the data distribution p X (x) by KL-divergence.

Receptive fields of latent variables
Considering the hierarchical and local architecture in our network, we define the generation causal cone for a latent variable to be the affected area in the observable image when that latent variable is changed. This is illustrated as the red cone in figure 3 (c).
To visualize the representations in the latent space, we define the receptive field for a latent variable z l as: where | · | c denotes the 1-norm over the color channel dimension, so that RF l is a twodimensional array for each l. The receptive field shows the response of the generated image x = G(z) to an infinitesimal change of the latent variable z l , averaged over the latent distribution p Z . Following the definition, the receptive field of any latent variable is always contained in its generation causal cone, and the receptive fields of higher-level latent variables are larger than those of lower-level ones. In particular, if the receptive fields of two latent variables do not overlap, which is often the case at low levels, those latent variables automatically become disentangled.

Image inpainting and error correction
Another advantage of the locality in our network can be demonstrated in the inpainting task. Similar to the generation causal cone, we can define the inference causal cone shown as the blue cone in figure 3 (d). If we perturb a pixel at the bottom of the inference causal cone, all the latent variables inside the cone will be affected, while the ones outside cannot be affected. An important property of the hyperbolic treelike network is that higher levels contain exponentially fewer latent variables. Even though the inference causal cone expands as we reach higher levels, the number of latent variables dilutes exponentially as well, resulting in a constant number of latent variables covered by the inference causal cone at each level.
Therefore, if a small local region on the input image is corrupted, only O(log L) latent variables can be affected by the corruption, where L is the edge length of the entire image, and we only need to adjust those latent variables to inpaint the region. While for globally connected networks, all O(L 2 ) latent variables have to be adjusted.

Sparse prior distribution
In 3.3, we have implemented the minBMI principle of RG transformation by using a factorized prior distribution, i.e. p * Z (z) = l p(z l ). The common choice for p(z l ) is the standard Gaussian distribution, which makes p * Z (z) spherical symmetric. Therefore, if we apply an arbitrary rotation on the latent space, which transforms a latent vector z to z , the prior probability p * Z (z ) will remain the same as p * Z (z), so there is no incentive for the network R to map the input x to z rather than z . However, each basis vector of z is a linear combination of those of z, so the latent representations can be arbitrarily mixed by the rotation.
To overcome this issue, we propose an anisotropic sparse prior distribution for p * Z (z), which breaks the spherical symmetry of the latent space. The sparsity means that it has a heavy tail along each basis vector, incentivizing the network to map each semantically meaningful feature in the observable space to a single variable in the latent space, which helps the disentanglement of representations. In this work, we choose Laplacian distribution p(z l ) = 1 2b exp(−|z l |/b), which is sparser than Gaussian distribution, as the former has a larger kurtosis than the latter. In Appendix E, we show an example of a two-dimensional pinwheel distribution to illustrate this intuition. In each group of subfigures, the one on the left shows the receptive field of a latent variable, and the two rows show the variation of two sample images when we vary that latent variable.

Synthetic multi-scale image datasets
To illustrate RG-Flow's ability to disentangle representations at different scales and spatial positions, we propose two synthetic image datasets with multi-scale features, which we name MSDS1 and MSDS2. Their samples are shown in Appendix B. In each image, there are 16 ovals with different colors and orientations. In MSDS1, all ovals in a same image have almost the same color, while their orientations are randomly distributed, so the color is a global feature and the orientations are local ones. In MSDS2, on the contrary, the orientation is a global feature and the colors are local ones.
We implement RG-Flow as shown in figure 3. After training, we find that RG-Flow can easily capture the characteristics of those datasets. Namely, the ovals in each image from MSDS1 have almost the same color, and from MSDS2 the same orientation. In figure 4, we plot the receptive fields of some latent variables at different levels, and the effect of varying them. For MSDS1, if we vary a high-level latent variable, the color of the whole image will change, which shows that the network has captured the global feature of this dataset. Meanwhile, if we vary a low-level latent variable, the orientation of only one oval at the corresponding position will change. As the ovals are spatially separated, the low-level representations of them are disentangled by the generation causal cones of RG-Flow. Similarly, for MSDS2, if we vary a high-level latent variable, the orientations of all ovals will change. If we vary a low-level latent variable, the color of only one oval will change. For comparison, we also train Real NVP on the two datasets, and find that it fails to learn the global and the local characteristics of those datasets. Quantitative results are reported in Appendix B.

Human face dataset
Next, we apply RG-Flow to more complicated multi-scale datasets. Most of our experiments use the human face dataset CelebA [17], and we crop and scale the images to 32 × 32 pixels. Details of the network and the training procedure are described in Appendix A. Experiments on other datasets and quantitative evaluations are reported in Appendix G.
After training, the network learns to progressively generate finer-grained images, as shown in figure 5 (a). Note that the colors in the coarse-grained images are not necessarily the same as those at the corresponding positions in the fine-grained images, because there is no constraint to prevent the RG transformation from mixing color channels.

Receptive fields
To visualize the latent space representation, we compute the receptive field for each latent variable, and list some of them in figure 5 (b). The receptive size is small for low-level variables and large for high-level ones, as indicated from the generation causal Coarse-grained (a) Figure 5.
(a) Progressive generation of three sample images from CelebA using the inverse RG transformation. At high levels, the images being generated are coarsegrained. The inverse RG adds details to the images in each level, and the images become fine-grained at low levels. (b) Receptive fields of typical latent variables from low levels to high levels. The strength of each receptive field is rescaled to one for better visualization. (c) Histograms of receptive field strengths of latent variables at each level. The x-axis shows the receptive field strength, defined as the pixel-wise average of the latent variable's receptive field, and y-axis shows the logarithm of the number of latent variables in the histogram bin. cone. At the lowest level (h = 0), the receptive fields are merely small dots. At the second lowest level (h = 1), small structures emerge, such as an eyebrow, an eye, and a mouth. At the middle level (h = 2), there are more complex structures like a pair of eyebrows and a facial expression. At the highest level (h = 3), each receptive field grows to the whole image. We will investigate those explainable latent representations in 4.4. For comparison, we show receptive fields of Real NVP in Appendix C. Even though Real NVP has multi-scale architecture, since it is not locally constrained, it fails to capture semantic representations at different levels.

Learned features on different scales
In this section, we show that some of those emergent structures correspond to explainable latent features. A flow-based generative model is a maximal encoding procedure, because it uses bijective maps to preserve the number of variables before and after the encoding. It is believed that the images in the dataset live on a low-dimensional manifold, and we do not need all dimensions of the original dataset to encode them. In figure 5 (c) we show the histograms of the receptive fields' strengths, defined as RF l averaged over all pixels. Most latent variables produce receptive fields with small strengths, meaning that if we vary them the generated images will not be significantly affected. We focus on those latent variables with receptive field strengths greater than one, which have visible effects on the generated images. We use h to label the RG level of latent variables. For example, the lowest-level latent variables have h = 0, whereas the highest-level ones have h = 4. Therefore, we focus on h = 1 (low-level), h = 2 (mid-level), and h = 3 (high-level).
For high-level latent representations, we found in total 30 latent variables that have visible effects, and six of them are identified with disentangled and explainable semantics. Those factors are gender, emotion, light angle, azimuth, hair color, and skin color. In figure 6 (a), we plot the effects of varying those six high-level variables, together with their receptive fields. The samples are generated using mixed temperatures, as described in Appendix D. For mid-level latent representations, we plot four of them in figure 6 (b), which control eyes, eyebrows, hair bang, and collar respectively. For low-level representations, we plot two of them in figure 6 (c), controlling an eyebrow and an eye respectively. They are fully disentangled because their receptive fields do not overlap.

Image mixing in scaling direction
Given two images x A and x B , the conventional way of image mixing in the latent space takes a linear combination z = λz A + (1 − λ)z B between z A = G −1 (x A ) and z B = G −1 (x B ), with λ ∈ [0, 1], and generates the mixed image x = G(z). In RG-Flow, the direct access of the latent variables z (h) at any level h of the hyperbolic tree-like latent space enables us to mix the images in a different manner, which we call "hyperbolic mixing".
We mix the large-scale (high-level) features of x A and the small-scale (low-level) features of x B by combining their latent variables at corresponding levels: where Θ serves as a threshold of the scales. As shown in figure 7 (a), when we vary Θ from 0 to 4, more low-level information in the blonde-hair image is mixed with the high-level information in the black-hair image. Especially when Θ = 3, the mixed face has similar eyebrows, eyes, nose, and mouth to the blonde-hair image, while the highestlevel information, such as face orientation and hair color, is taken from the black-hair image. Note that this mixing is not symmetric under the interchange of z A and z B , see figure 7 (b) for comparison. This hyperbolic mixing achieves a similar effect to StyleGAN [49,51], where we can mix the style information from one image and the content information from another. In figure 7 (c), we show more examples of mixing faces.

Image inpainting and error correction
When we have a small and local region to be inpainted in the input image, the inference causal cone of RG-Flow ensures that at most O(log L) latent variables will be affected by that region. In figure 8, we show that RG-Flow can faithfully recover the corrupted region (marked as red) only using latent variables locating inside the inference causal cone, which are around one third of all latent variables. For comparison, if we randomly pick the same number of latent variables and put the constraint that we can only modify them in Real NVP, it fails to recover the region, as shown in figure 8 (Constrained Real NVP). To achieve the recovery using Real NVP with a similar quality to RG-Flow, as shown in figure 8 (Real NVP), all latent variables need to be modified, which are of O(L 2 ) order. See Appendix F for more details about the inpainting task and its quantitative evaluations.

Discussion and conclusion
In this paper, we have combined the ideas of renormalization group and sparse prior distribution to design RG-Flow, a flow-based generative model with hierarchical architecture and explainable latent space. In the analysis of its structure, we have shown that it can separate information in the input image at different scales and encode them in latent variables living on a hyperbolic tree. To visualize the latent representations in RG-Flow, we have defined the receptive fields for flow-based models in analogy to that for CNN. The experiments on synthetic datasets have shown the disentanglement of representations at different levels and spatial positions, and the receptive fields serve as a visual guidance to find explainable representations. Richer applications have been demonstrated on the CelebA dataset, including efficient image inpainting that utilizes the locality of the image, and mixing of style and content. In contrast, such semantically meaningful representations of mid-level and low-level structures do not emerge in globally connected flow models like Real NVP. The versatile architecture of RG-Flow can be incorporated with any bijective map to further improve its expressive power.
In RG-Flow, two low-level representations are fully disentangled if their receptive fields do not overlap. For high-level representations with large receptive fields, we use a sparse prior to encourage their disentanglement. However, we find that if the dataset only contains few high-level factors, such as the 3D Chairs dataset [84] shown in Appendix G, it is hard to find disentangled and explainable high-level representations, because of the redundant nature of the maximal encoding in flow-based models. Incorporating information theoretic criteria to disentangle high-level representations will be an interesting future direction.

Acknowledgements
H Y H and Y Z Y are supported by a UC Hellman Fellowship. H Y H is also grateful for the supporting from Swarma-Kaifeng Project which is sponsored by Swarma Club and Kaifeng Foundation. Y C and B O are supported by NSF-IIS-1718991.

Data availability statement
Our code is available at https://github.com/hongyehu/RG-Flow The data that support the findings of this study are openly available at the following URL/DOI: CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html CIFAR-10: https://www.cs.toronto.edu/kriz/cifar.html 3D Chairs: https://www.di.ens.fr/willow/research/seeing3Dchairs/ Appendix A. Details of network and training procedure m < l a t e x i t s h a 1 _ b a s e 6 4 = " K i / A + A e 4 9 x Z l J 7 f M T + a L Y C t 2 J W E = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B h P B U 9 g N g h 4 D X j x G N A 9 I l j A 7 m U 2 G z G O Z m R X C k k / w 4 k E R r 3 6 R N / / G S b I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R q W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q I p F h G n n W h y O / c 7 T 1 Q b p u S j n S Y 0 F H g k W c w I t k 5 6 q I r q o F z x a / 4 C a J 0 E O a l A j u a g / N U f K p I K K i 3 h 2 J h e 4 C c 2 z L C 2 j H A 6 K / V T Q x N M J n h E e 4 5 K L K g J s 8 W p M 3 T h l C G K l X Y l L V q o v y c y L I y Z i s h 1 C m z H Z t W b i / 9 5 v d T G N 2 H G Z J J a K s l y U Z x y Z B W a / 4 2 G T F N i + d Q R T D R z t y I y x h o T 6 9 I p u R C C 1 Z f X S b t e C / x a c F + v N K 7 y O I p w B u d w C Q F c Q w P u o A k t I D C C Z 3 i F N 4 9 7 L 9 6 7 9 7 F s L X j 5 z C n 8 g f f 5 A 4 h l j T 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K i / A + A e 4 9 x Z l J 7 f M T + a L Y C t 2 J W E = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B h P B U 9 g N g h 4 D X j x G N A 9 I l j A 7 m U 2 G z G O Z m R X C k k / w 4 k E R r 3 6 R N / / G S b I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R q W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q I p F h G n n W h y O / c 7 T 1 Q b p u S j n S Y 0 F H g k W c w I t k 5 6 q I r q o F z x a / 4 C a J 0 E O a l A j u a g / N U f K p I K K i 3 h 2 J h e 4 C c 2 z L C 2 j H A 6 K / V T Q x N M J n h E e 4 5 K L K g J s 8 W p M 3 T h l C G K l X Y l L V q o v y c y L I y Z i s h 1 C m z H Z t W b i / 9 5 v d T G N 2 H G Z J J a K s l y U Z x y Z B W a / 4 2 G T F N i + d Q R T D R z t y I y x h o T 6 9 I p u R C C 1 Z f X S b t e C / x a c F + v N K 7 y O I p w B u d w C Q F c Q w P u o A k t I D C C Z 3 i F N 4 9 7 L 9 6 7 9 7 F s L X j 5 z C n 8 g f f 5 A 4 h l j T 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K i / A + A e 4 9 x Z l J 7 f M T + a L Y C t 2 J W E = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B h P B U 9 g N g h 4 D X j x G N A 9 I l j A 7 m U 2 G z G O Z m R X C k k / w 4 k E R r 3 6 R N / / G S b I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R q W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q I p F h G n n W h y O / c 7 T 1 Q b p u S j n S Y 0 F H g k W c w I t k 5 6 q I r q o F z x a / 4 C a J 0 E O a l A j u a g / N U f K p I K K i 3 h 2 J h e 4 C c 2 z L C 2 j H A 6 K / V T Q x N M J n h E e 4 5 K L K g J s 8 W p M 3 T h l C G K l X Y l L V q o v y c y L I y Z i s h 1 C m z H Z t W b i / 9 5 v d T G N 2 H G Z J J a K s l y U Z x y Z B W a / 4 2 G T F N i + d Q R T D R z t y I y x h o T 6 9 I p u R C C 1 Z f X S b t e C / x a c F + v N K 7 y O I p w B u d w C Q F c Q w P u o A k t I D C C Z 3 i F N 4 9 7 L 9 6 7 9 7 F s L X j 5 z C n 8 g f f 5 A 4 h l j T 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K i / A + A e 4 9 x Z l J 7 f M T + a L Y C t 2 J W E = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B h P B U 9 g N g h 4 D X j x G N A 9 I l j A 7 m U 2 G z G O Z m R X C k k / w 4 k E R r 3 6 R N / / G S b I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R q W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q I p F h G n n W h y O / c 7 T 1 Q b p u S j n S Y 0 F H g k W c w I t k 5 6 q I r q o F z x a / 4 C a J 0 E O a l A j u a g / N U f K p I K K i 3 h 2 J h e 4 C c 2 z L C 2 j H A 6 K / V T Q x N M J n h E e 4 5 K L K g J s 8 W p M 3 T h l C G K l X Y l L V q o v y c y L I y Z i s h 1 C m z H Z t W b i / 9 5 v d T G N 2 H G Z J J a K s l y U Z x y Z B W a / 4 2 G T F N i + d Q R T D R z t y I y x h o T 6 9 I p u R C C 1 Z f X S b t e C / x a c F + v N K 7 y O I p w B u d w C Q F c Q w P u o A k t I D C C Z 3 i F N 4 9 7 L 9 6 7 9 7 F s L X j 5 z C n 8 g f f 5 A 4 h l j T 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P s S e / l P 4 l s h q g x Z Z t m c P I k s C l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P s S e / l P 4 l s h q g x Z Z t m c P I k s C l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P s S e / l P 4 l s h q g x Z Z t m c P I k s C l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P s S e / l P 4 l s h q g x Z Z t m c P I k s C                   Figure A1. The details of a disentangler or decimator in our network.

d 8 T K R Z a T 0 R g O w U 2 I 7 3 s z c T / v G 5 i w i s / Z T J O D J V k s S h M O D I R m n 2 P B k x R Y v j E E k w U s 7 c i M s I K E 2 M z K t g Q v O W X V 0 m r W v H c i n d b L d V r W R x 5 O I F T u A A P L q E O N 9 C A J h A Q 8 A y v 8 O Y o 5 8 V 5 d z 4 W r T k n m z m G P 3 A + f w C d J Y + R < / l a t e x i t >
As shown in figure 3 (a), each RG step includes a disentangler (green block) and a decimator (yellow block), which is parameterized by a bijective neural network. The blocks in every horizontal level share parameters, so we view them as a same block in the discussions. In our reported experiments, disentanglers and decimators in different RG steps do not share parameters. However, we can also make them share parameters, which implements a scale-invariant course-graining process.
For each disentangler or decimator, we spilt the image into 4 × 4 patches as shown in figure 3 (b), stack them along the batch dimension, feed them into the network, and merge the output patches into a new image. After each RG step, the edge length of the image is halved (the number of black dots in a row, compared to red crosses), except for the last (highest-level) RG step that decimates all variables.
The choice of the bijective neural networks for the disentanglers and decimators can be versatile, and the performance of RG-Flow strongly depends on them. Since tuning the expressive power of those blocks is not the focus of our work, we take the coupling layer from Real NVP [1], denoted as RNVP block in the following, which has great expressive power and is easy to invert. Figure A1 illustrates the architecture of a disentangler or decimator in our implementation. Each RNVP block is shown as a red block. It takes a 4 × 4 image patch x as input, and split it into x 1 and x 2 using the checkerboard mask m. They are coupled to produce the output x 2 , using the formula where is the element-wise product. The scale network s(·) and the translation network t(·) can be arbitrarily complex to enhance the expressive power, as long as the number of variables is the same for their input and output. Then we use x 2 to alter x 1 in the similar manner and outputs x 1 , and combine them to produce the output x of the RNVP block. We use residual networks with n res residual blocks as s(·) and t(·), shown as blue blocks in figure A1, and we choose n res = 4 in our implementation. Each residual block has 3 linear layers with size 24 → 512, 512 → 512, 512 → 24. where 24 = 1 2 × 4 × 4 × 3 is the number of variables in a masked image patch. Between linear layers, we insert SiLU activations [85], which is reported to give better results than ReLU and softplus, and its smoothness benefits our further analysis with higher order derivatives. We use Kaiming initialization [86] and weight normalization [87] on the linear layers.
The CelebA dataset contains rich information at different scales, including highlevel information like gender and emotion, mid-level one like shapes of eyes and nose, and low-level one like details in hair and wrinkles. Because lower-level RG steps take larger images as input, we heuristically put more parameters in them. The numbers of RNVP blocks in all RG steps are n layer = 8, 6, 4, 2 respectively.
To preprocess the dataset, we use the aligned images from CelebA, crop a 148 × 148 patch at the center of each image, downscale the patch to 32 × 32 using bicubic downsampling, and randomly flip it horizontally.
We use AdamW optimizer [88] with conventional learning rate 10 −3 and weight decay rate 5 × 10 −5 . To further stabilize training, we use gradient clipping with global norm 1. Between coupling layers, we use checkpointing [89] to reduce memory usage.
The maximal batch size can be set to 1, 024 on an Nvidia Titan RTX given the current setup, which approximately has a million parameters. In our experiment, the batch size is conventionally set to 64, and a training step takes about 1.2 seconds.

Appendix B. Details of synthetic multi-scale datasets
To illustrate RG-Flow's ability to disentangle representations at different scale and spatial positions, we propose two synthetic image datasets with multi-scale features, which we name MSDS1 and MSDS2, as shown in figure B1. Each dataset contains 10 5 images of 32 × 32 pixels. In each image, there are 16 ovals with different colors and orientations, and their positions have small random variations to deform the 4 × 4 grid. In MSDS1, all ovals in a same image have almost the same color, while their orientations are randomly distributed, so the color is a global feature and the orientations are local ones. In MSDS2, on the contrary, the orientation is a global feature and the colors are local ones. We train RG-Flow on the two datasets, with n layer = 4 and n res = 4, and other hyperparameters described in Appendix A. For comparison, we also train Real NVP on those datasets, with approximately the same number of trainable parameters. Their generated images are shown in figure B2, where we can intuitively see that RG-Flow has learned the characteristics of the two datasets. Namely, the ovals in each image from MSDS1 have almost the same color, and from MSDS2 the same orientation. In contrast, Real NVP fails to capture the global and the local features of those datasets. The metrics of bits per dimension (BPD) and Fréchet Inception distance (FID) are listed in table B1. Note that FID may not reflect much semantic property for such synthetic datasets. The results also show that RG-Flow with Laplacian prior captures the disentangled representations better than that with Gaussian prior because of the sparsity of Laplacian distribution, which is discussed more in Appendix E.   Figure E1. Two-dimensional pinwheel distribution.
learn the target distribution. In the second column, we sample 100 points and color them by their living quadrant in the prior distribution. Then we map them to the observable space using the trained model, as shown in the third column. We see that the four quadrants of the latent space are approximately mapped to the four legs of the pinwheel in the observable space if Laplacian prior is used. In the case of Gaussian prior, since it has rotational symmetry, the points in different quadrants are mixed more in the observable space, which makes it harder to explain the mapping. The quantitative comparison of the performance between Laplacian and Gaussian distributions on the MSDS datasets are reported in table B1.

Appendix F. Details of inpainting experiments
For the inpainting experiments shown in figure 8, we randomly choose a corrupted region of 10 × 10 pixels on the ground truth image, marked as the red patch in the second row of figure 8. We generate an image x g from latent variables z g , and use its corresponding region to fill in the corrupted region. Then we map the filled image x f back to the latent variables z f and compute its log-likelihood. To recover the ground truth image, we optimize z g to maximize the log-likelihood. For RG-Flow, we only vary the latent variables living inside the inference causal cone, which are about 1, 200 out of 3, 072 latent variables. For the constrained Real NVP, we randomly pick the same amount of latent variables and allow them to be optimized, and we find it fails to inpaint the image in general. As a check, we find Real NVP can successfully inpaint the images if we optimize all latent variables, as shown in the last row of figure 8.
We use the conventional Adam optimizer to do the optimization. During the optimization procedure, we find the optimizer can be trapped in local minima. Therefore, for all experiments, we first randomly draw 200 initial samples of latent variables that are allowed to be optimized, then pick the one with the largest loglikelihood as the initialization.
To quantitatively evaluate the quality of inpainted images, we compute the peak signal-to-noise ratio (PSNR) of them against the ground truth images, and take the average over the 15 samples shown in figure 8. To further incorporate semantic properties in the evaluation, we also use the Inception-v3 network [90]

Appendix G. Experiments on other datasets
For CIFAR-10 dataset [91], we use the same hyperparameters as described in Appendix A. For 3D Chairs dataset [84], we use n layer = 8 for all RG steps, because there is not so much low-level information as in CelebA and CIFAR-10. We also train Real NVP on those datasets for comparison, with approximately the same number of trainable parameters. The metrics of bits per dimension (BPD) and Fréchet Inception distance (FID) are listed in  Table G1. Bits per dimension (BPD) from RG-Flow and Real NVP trained on various datasets. Figure G1. Samples from RG-Flow trained on CelebA dataset. We use T = 0.9 when sampling.