Brought to you by:
Paper

Stabilizing invertible neural networks using mixture models

and

Published 12 July 2021 © 2021 IOP Publishing Ltd
, , Citation Paul Hagemann and Sebastian Neumayer 2021 Inverse Problems 37 085002 DOI 10.1088/1361-6420/abe928

0266-5611/37/8/085002

Abstract

In this paper, we analyze the properties of invertible neural networks, which provide a way of solving inverse problems. Our main focus lies on investigating and controlling the Lipschitz constants of the corresponding inverse networks. Without such a control, numerical simulations are prone to errors and not much is gained against traditional approaches. Fortunately, our analysis indicates that changing the latent distribution from a standard normal one to a Gaussian mixture model resolves the issue of exploding Lipschitz constants. Indeed, numerical simulations confirm that this modification leads to significantly improved sampling quality in multimodal applications.

Export citation and abstract BibTeX RIS

1. Introduction

Reconstructing parameters of physical models is an important task in science. Usually, such problems are severely underdetermined and sophisticated reconstruction techniques are necessary. Whereas classical regularisation methods focus on finding just the most desirable or most likely solution of an inverse problem, more recent methods focus on analyzing the complete distribution of possible parameters. In particular, this provides us with a way to quantify how trustworthy the obtained solution is. Among the most popular methods for uncertainty quantification are Bayesian methods [17], which build up on evaluating the posterior using Bayes theorem, and Markov chain Monte Carlo (MCMC) [41]. Over the last years, these classical frameworks and questions have been combined with neural networks (NNs). To give a few examples, the Bayesian framework has been extended to NNs [1, 38], auto-encoders have been combined with MCMC [9] and generative adversarial networks have been applied for sampling from conditional distributions [40]. Another approach in this direction relies on so-called invertible neural networks (INNs) [4, 7, 14, 32] for sampling from the conditional distribution. Such models have been successfully applied for stellar parameter estimation [36], conditional image generation [5] and polyphonic music transcription [29].

Our main goal in this article is to deepen the mathematical understanding of INNs. Let us briefly describe the general framework. Assume that the random vectors $(X,Y):{\Omega}\to {\mathbb{R}}^{d}{\times}{\mathbb{R}}^{n}$ are related via f(X) ≈ Y for some forward mapping $f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}$. Clearly, this setting includes the case of noisy measurements. In practice, we often only have access to the values of Y and want to solve the corresponding inverse problem. To this end, our aim is to find an invertible mapping $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ such that

  • Ty (X) ≈ Y, i.e., that Ty is an approximation of f;
  • T−1(y, Z) with $Z\sim \mathcal{N}(0,{I}_{k})$, k = dn, is distributed according to P(X|Y)(y, ⋅).

Throughout this paper, the mapping T is modeled as an INN, i.e., as NN with an explicitly invertible structure. Therefore, no approximations are necessary and evaluating the backward pass is as simple as the forward pass. Here, Z should account for the information loss when evaluating the forward model f. In other words, Z parameterizes the ill-posed part of the forward problem f.

Our theoretical contribution is twofold. First, we generalize the consistency result for perfectly trained INNs given in [4] toward latent distributions Z depending on the actual measurement y. Here, we want to emphasize that our proof does not require any regularity properties for the distributions. Next, we analyze the behavior of Lip(T−1) for a homeomorphism T that approximately maps a multimodal distribution to the standard normal distribution. More precisely, we obtain that Lip(T−1) explodes as the amount of mass between the modes and the approximation error decreases. In particular, this implies for INNs that using ZN(0, Ik ) can lead to instabilities of the inverse map for multimodal problems. As the results are formulated in a very general setting, they could be also of interest for generative models, which include variational auto-encoders [33] and generative adversarial networks (GANs) [20]. Another interesting work that analyzes the Lipschitz continuity of different INN architectures is [8], where also the effects of numerical non-invertibility are illustrated.

The performed numerical experiments confirm our theoretical findings, i.e., that $Z\sim \mathcal{N}(0,{I}_{k})$ is unsuitable for recovering multimodal distributions. Note that this is indeed a well-known problem in generative modeling [26]. Consequently, we propose to replace $Z\sim \mathcal{N}(0,{I}_{k})$ by a Gaussian mixture model (GMM) with learnable probabilities that depend on the measurements y. In some sense, this enables the model to decide how many modes it wants to use in the latent space. Therefore, splitting or merging distributions should not be necessary anymore. Such an approach has lead to promising results for GANs [24, 30] and variational auto-encoders [13]. In order to learn the probabilities corresponding to the modes, we employ the Gumbel trick [28, 39] and a L2-penalty for the networks weights, enforcing the Lipschitz continuity of the INN. More precisely, the L2-penalty ensures that splitting or merging distributions is unattractive. Similar approaches have been successfully applied for generative modeling, see [13, 15].

1.1. Outline

We recall the necessary theoretical foundations in section 2. Then, we introduce the continuous optimization problem in section 3 and discuss its theoretical properties. In particular, we investigate the effect of different latent space distributions on the Lipschitz constants of an INN. The training procedure for our proposed model is discussed in section 4. For minimizing over the probabilities of the GMM, we employ the Gumbel trick. In section 5, we demonstrate the efficiency of our model on three inverse problems that exhibit a multimodal structure. Conclusions are drawn in section 6.

2. Preliminaries

2.1. Probability theory and conditional expectations

First, we give a short introduction to the required probability theoretic foundations based on [2, 6, 16, 34]. Let $({\Omega},\mathfrak{A},P)$ be a probability space and ${\mathbb{R}}^{m}$ be equipped with the Borel σ-algebra $\mathcal{B}({\mathbb{R}}^{m})$. A random vector is a measurable mapping $X:{\Omega}\to {\mathbb{R}}^{m}$ with distribution ${P}_{X}\in \mathcal{P}({\mathbb{R}}^{m})$ defined by PX (A) = P(X−1(A)) for $A\in \mathcal{B}({\mathbb{R}}^{m})$. For any random vector X, there exists a smallest σ-algebra on Ω such that X is measurable, which is denoted with σ(X). Two random vectors X and Y are called equal in distribution if PX = PY and we write $X\;d{=}\;Y$. The space $\mathcal{P}({\mathbb{R}}^{m})$ is equipped with the weak convergence, i.e., a sequence ${({\mu }_{k})}_{k}\subset \mathcal{P}({\mathbb{R}}^{m})$ converges weakly to μ, written μk ⇀ μ, if

where ${\mathcal{C}}_{b}({\mathbb{R}}^{m})$ denotes the space of continuous and bounded functions. The support of a probability measure μ is the closed set

In case that there exists ${\rho }_{\mu }:{\mathbb{R}}^{m}\to \mathbb{R}$ with μ(A) = ∫A ρμ  dx for all $A\in \mathcal{B}({\mathbb{R}}^{m})$, this ρμ is called the probability density of μ. For $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ and 1 ⩽ p, let ${L}^{p}({\mathbb{R}}^{m})$ be the Banach space (of equivalence classes) of complex-valued functions with norm

In a similar manner, we denote the Banach space of (equivalence classes of) random vectors X satisfying $\mathbb{E}(\vert X{\vert }^{p}){< }\infty $ by ${L}^{p}({\Omega},\mathfrak{A},P)$. The mean of a random vector X is defined as $\mathbb{E}(X)={(\mathbb{E}({X}_{1}),\dots ,\mathbb{E}({X}_{m}))}^{\mathrm{T}}$. Further, for any $X\in {L}^{2}({\Omega},\mathfrak{A},P)$ the covariance matrix of X is defined by $\mathrm{Cov}(X)={(\mathrm{Cov}({X}_{i},{X}_{j}))}_{i,j=1}^{m}$, where $\mathrm{Cov}({X}_{i},{X}_{j})=\mathbb{E}(({X}_{i}-\mathbb{E}({X}_{i}))({X}_{j}-\mathbb{E}({X}_{j})))$.

For a measurable function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$, the push-forward measure of $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ by T is defined as T# μ := μT−1. A measurable function f on ${\mathbb{R}}^{n}$ is integrable with respect to ν := T# μ if and only if the composition fT is integrable with respect μ, i.e.,

The product measure $\mu \otimes \nu \in \mathcal{P}({\mathbb{R}}^{m}{\times}{\mathbb{R}}^{n})$ of $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ and $\nu \in \mathcal{P}({\mathbb{R}}^{n})$ is the unique measure satisfying μν(A × B) = μ(A)ν(B) for all $A\in \mathcal{B}({\mathbb{R}}^{m})$ and $B\in \mathcal{B}({\mathbb{R}}^{n})$.

If $X\in {L}^{1}({\Omega},\mathfrak{A},P)$ and $Y:{\Omega}\to {\mathbb{R}}^{n}$ is a second random vector, then there exists a unique σ(Y)-measurable random vector $\mathbb{E}(X\vert Y)$ called conditional expectation of X given Y with the property that

Further, there exists a PY -almost surely unique, measurable mapping $g:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ such that $\mathbb{E}(X\vert Y)=g{\circ}Y$. Using this mapping g, we define the conditional expectation of X given Y = y as $\mathbb{E}(X\vert Y=y)=g(y)$. For $A\in \mathfrak{A}$ the conditional probability of A given Y = y is defined by $P(A\vert Y=y)=\mathbb{E}({1}_{A}\vert Y=y)$. More generally, regular conditional distributions are defined in terms of Markov kernels, i.e., there exists a mapping ${P}_{(X\vert Y)}:{\mathbb{R}}^{n}{\times}\mathcal{B}({\mathbb{R}}^{m})\to \mathbb{R}$ such that

  • (a)  
    P(X|Y)(y, ⋅) is a probability measure on $\mathcal{B}({\mathbb{R}}^{m})$ for every $y\in {\mathbb{R}}^{n}$,
  • (b)  
    P(X|Y)(⋅, A) is measurable for every $A\in \mathcal{B}({\mathbb{R}}^{m})$ and for all $A\in \mathcal{B}({\mathbb{R}}^{m})$ it holds
    Equation (1)

For any two such kernels P(X|Y) and Q(X|Y) there exists $B\in \mathcal{B}({\mathbb{R}}^{n})$ with PY (B) = 1 such that for all yB and $A\in \mathcal{B}({\mathbb{R}}^{m})$ it holds P(X|Y)(y, A) = Q(X|Y)(y, A). Finally, let us conclude with two useful properties, which we need throughout this paper.

  • (a)  
    For random vectors $X:{\Omega}\to {\mathbb{R}}^{m}$ and $Y:{\Omega}\to {\mathbb{R}}^{n}$ and measurable $f:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ with $f(Y)X\in {L}^{1}({\Omega},\mathfrak{A},P)$ it holds E(f(Y)X|Y) = f(Y)E(X|Y), where the products are meant component-wise. In particular, this implies for $y\in {\mathbb{R}}^{n}$ that
    Equation (2)
  • (b)  
    For random vectors $X:{\Omega}\to {\mathbb{R}}^{m}$, $Y:{\Omega}\to {\mathbb{R}}^{n}$ and a measurable map $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{p}$ it holds
    Equation (3)

2.2. Discrepancies

Next, we introduce a way to quantify the distance between two probability measures. To this end, choose a measurable, symmetric and bounded function $K:{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}\to \mathbb{R}$ that is integrally strictly positive definite, i.e., for any $f\ne 0\in {L}^{2}({\mathbb{R}}^{m})$ it holds

Note that this property is indeed stronger than K being strictly positive definite.

Then, the corresponding discrepancy ${\mathcal{D}}_{K}(\mu ,\nu )$ is defined via

Equation (4)

Due to our requirements on K, the discrepancy ${\mathcal{D}}_{K}$ is a metric, see [43]. In particular, it holds ${\mathcal{D}}_{K}(\mu ,\nu )=0$ if and only if μ = ν. In the following remark, we require the space ${C}_{0}({\mathbb{R}}^{m})$ of continuous functions decaying to zero at infinity.

Remark 1. Let K(x, y) = ψ(xy) with $\psi \in {C}_{0}({\mathbb{R}}^{m})\cap {L}^{1}({\mathbb{R}}^{m})$ and assume there exists an $l\in \mathbb{N}$ such that

where $\hat{\psi }$ denotes the Fourier transform of ψ. Then, the discrepancy ${\mathcal{D}}_{K}$ metrizes the weak topology on $\mathcal{P}({\mathbb{R}}^{m})$, see [43].

2.3. Optimal transport

For our theoretical investigations, Wasserstein distances turn out to be a more appropriate metric than discrepancies. This is mainly due to the fact that spatial distance is directly encoded into this metric. Actually, both concepts can be related to each other under relatively mild conditions, see also remark 1. A more detailed overview on the topic can be found in [3, 42]. Let $\mu ,\nu \in \mathcal{P}({\mathbb{R}}^{m})$ and $c\in C({\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m})$ be a non-negative and continuous function. Then, the Kantorovich problem of optimal transport (OT) reads

Equation (5)

where Π(μ, ν) denotes the weakly compact set of joint probability measures π on ${\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}$ with marginals μ and ν. Recall that the OT functional $\pi {{\mapsto}\int }_{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}c\enspace \enspace \mathrm{d}\pi $ is weakly lower semi-continuous, (5) has a solution and every such minimizer $\hat{\pi }$ is called OT plan. Note that OT plans are in general not unique.

For $\mu ,\nu \in {\mathcal{P}}_{p}({\mathbb{R}}^{m}){:=}\left\{\mu \in \mathcal{P}({\mathbb{R}}^{m}){:\int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}\mu (x){< }\infty \right\}$, p ∈ [1, ), the p -Wasserstein distance Wp is defined by

Equation (6)

Recall that this defines a metric on ${\mathcal{P}}_{p}({\mathbb{R}}^{m})$, which satisfies Wp (μm μ) → 0 if and only if ${\int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}{\mu }_{m}(x){\to \int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}\mu (x)$ and μm ⇀ μ. If $\mu ,\nu \in {\mathcal{P}}_{p}({\mathbb{R}}^{m})$ and at least one of them has a density with respect to the Lebesgue measure, then (6) has a unique solution for any p ∈ (1, ). For $\mu ,\nu \in \mathcal{P}({\mathbb{R}}^{m})$ and sets $A,B\subset {\mathbb{R}}^{m}$, we estimate

Equation (7)

Remark 2. As mentioned in remark 1, ${\mathcal{D}}_{K}$ metrizes the weak convergence under certain conditions. Further, convergence of the pth moments always holds if all measures are supported on a compact set. In this case, weak convergence, convergence in the Wasserstein metric and convergence in discrepancy are all equivalent. As these requirements are usually fulfilled in practice, most theoretical properties of INNs based on W1 carry over to ${\mathcal{D}}_{K}$.

2.4. Invertible neural networks

Throughout this paper, an INN $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$, k = dn, is constructed as composition T = TL PL ◦...◦T1P1, where the ${({P}_{l})}_{l=1}^{L}$ are random (but fixed) permutation matrices and the Tl are defined by

Equation (8)

where x is split arbitrarily into x1 and x2 (for simplicity of notation we assume that ${x}_{1}\in {\mathbb{R}}^{n}$ and ${x}_{2}\in {\mathbb{R}}^{k}$) and ⊙ denotes the Hadamard product, see [4]. Note that there is a computational precedence in this construction, i.e., the computation of v2 requires the knowledge of v1. The continuous functions ${s}_{l,1}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{k}$, ${s}_{l,2}:{\mathbb{R}}^{k}\to {\mathbb{R}}^{n}$, ${t}_{l,1}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{k}$ and ${t}_{l,2}:{\mathbb{R}}^{k}\to {\mathbb{R}}^{n}$ are NNs and do not need to be invertible themselves. In order to emphasize the dependence of the INN on the parameters u of these NNs, we use the notation T(⋅; u). The inverse of Tl is explicitly given by

where ⊘ denotes the point-wise quotient. Hence, the whole INN is invertible. Clearly, T and T−1 are both continuous maps.

Other architectures for constructing INNs include the real NVP architecture introduced in [14], where the layers have the simplified form

or an approach that builds on making residual NNs invertible, see [7]. Compared to real NVP layers, the chosen approach allows more representational freedom in the first component. Unfortunately, this comes at the cost of harder Jacobian evaluations, which are not needed in this paper though. Invertible residual NNs have the advantage that they come with sharp Lipschitz bounds. In particular, they rely on spectral normalisation in each layer, effectively bounding the Lipschitz constant. Further, there are neural ODEs [10], where a black box ODE solver is used in order to simulate the output of a continuous depth residual NN, and the GLOW [32] architecture, where the Pl are no longer fixed as for real NVP and hence also learnable parameters.

3. Analyzing the continuous optimization problem

Given a random vector $(X,Y):{\Omega}\to {\mathbb{R}}^{d}{\times}{\mathbb{R}}^{n}$ with realizations ${({x}_{i},{y}_{i})}_{i=1}^{N}$, our aim is to recover the regular conditional distribution P(X|Y)(y, ⋅) for arbitrary y based on the samples. For this purpose, we want to find a homeomorphism $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ such that for some fixed $Z:{\Omega}\to {\mathbb{R}}^{k}$, k = dn, and any realization y of Y, the expression T−1(y, Z) is approximately P(X|Y=y)-distributed, see [4]. Note that the authors of [4] assume $Z\sim \mathcal{N}(0,{I}_{k})$, but in fact the approach does not rely on this particular choice. Here, we model T as homeomorphism with parameters u, for example an INN, and choose the best candidate by optimizing over the parameters. To quantify the reconstruction quality in dependence on u, three loss functions are constructed according to the following two principles:

  • (a)  
    First, the data should be matched, i.e., Ty (X; u) ≈ Y, where Ty denotes the output part corresponding to Y. For inverse problems, this is equivalent to learning the forward map. Here, the fit between Ty (X; u) and Y is quantified by the squared L2-norm
  • (b)  
    Second, the distributions of X and (Y, Z) should be close to the respective distributions T−1(Y, Z; u) and T(X; u). To quantify this, we use a metric D on the space of probability measures and end up with the loss functions
    Equation (9)
    Actual choices for D are discussed in section 4.

In practice, we usually minimize a conic combination of these objective functions, i.e., for some $\lambda =({\lambda }_{1},{\lambda }_{2})\in {\mathbb{R}}_{{\geqslant}0}^{2}$ we solve

First, note that Ly (u) = 0 ⇔ Ty (X; u) = Y a.e. and

Further, the following remark indicates that controlling the Lipschitz constant of T−1 also enables us to estimate the Lx loss in terms of Lz .

Remark 3. Note that it holds ${P}_{{T}^{-1}(Y,Z;u)}={{T}^{-1}}_{\#}{P}_{(Y,Z)}$ and PT(X;u) = T# PX . Let $\hat{\pi }$ denote an OT plan for W1(P(Y,Z), PT(X;u)). If T−1 is Lipschitz continuous, we can estimate

i.e., we get Lx (u) ⩽ Lip(T−1)Lz (u) when choosing D = W1.

Under the assumption that all three loss functions are zero, the authors of [4] proved that T−1(y, Z) ∼ P(X|Y)(y, ⋅). In other words, T can be used to efficiently sample from the true posterior distribution P(X|Y)(y, ⋅). However, their proof requires that the distributions of the random vectors X and Z have densities and that Z is independent of Y. As we are going to see in section 4, it is actually beneficial to choose Z dependent on Y. We overcome the mentioned limitations in the next theorem and provide a rigorous prove of this important result.

Theorem 1. Let $X:{\Omega}\to {\mathbb{R}}^{d}$, $Y:{\Omega}\to {\mathbb{R}}^{n}$ and $Z:{\Omega}\to {\mathbb{R}}^{k}$ be random vectors. Further, let $T=({T}_{y},{T}_{z}):{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ be a homeomorphism such that $T{\circ}X\;d{=}\;(Y,Z)$ and Ty X = Y a.e. Then, it holds

for all $B\in \mathcal{B}({\mathbb{R}}^{d})$ and PY -a.e. $y\in {\mathbb{R}}^{n}$.

Proof. As T is a homeomorphism, it suffices to show P(X|Y)(y, T−1(B)) = (δy P(Z|Y)(y, ⋅))(B) for all $B\in \mathcal{B}({\mathbb{R}}^{n+k})$ and PY -a.e. $y\in {\mathbb{R}}^{n}$. Using (3) and integrating over y, we can equivalently show that for all $A\in \mathcal{B}({\mathbb{R}}^{n})$ and all $B\in \mathcal{B}({\mathbb{R}}^{n+k})$ it holds

As both ∫A P(TX|Y)(y, ⋅)dPY (y) and ∫A δy P(Z|Y)(y, ⋅)(⋅)dPY (y) are finite measures, we only need to consider sets of the form B = B1 × B2 with ${B}_{1}\in \mathcal{B}({\mathbb{R}}^{n})$ and ${B}_{2}\in \mathcal{B}({\mathbb{R}}^{k})$, see [34, lemma 1.42].

By (1) and since Ty X = Y a.e., we conclude

For all $C\in \sigma (Y)={Y}^{-1}(\mathcal{B}({\mathbb{R}}^{n}))$ it holds that ωC if and only if Y(ω) ∈ Y(C). As $T{\circ}X\;d{=}\;(Y,Z)$ and Ty X = Y a.e., we obtain

In particular, this implies that $E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y)=E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y)$ and hence also $E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y=y)=E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y=y)$ for PY -a.e. y. By (1) and (2), we get

which is precisely the claim. □

Under some additional assumptions, we are able to show the following error estimate based on the TV norm ${\Vert}\mu {{\Vert}}_{\text{TV}}{:=}{\mathrm{sup}}_{B\in \mathcal{B}({R}^{n})}\vert \mu (B)\vert $. Note that the restrictions and conditions are still rather strong and do not entirely fit to our problem. However, to the best of our knowledge, this is the first attempt in this direction at all.

Theorem 2. Let $X:{\Omega}\to {\mathbb{R}}^{d}$, $Y:{\Omega}\to {\mathbb{R}}^{n}$ and $Z:{\Omega}\to {\mathbb{R}}^{k}$ be random vectors such that |(Y, Z)| ⩽ r and |(Y, X)| ⩽ r almost everywhere. Further, let T be a homeomorphism with E(|Ty (X; u) − Y|2) ⩽ epsilon and ${\Vert}{P}_{(Y,Z)}-{P}_{(Y,{T}_{z}(X))}{{\Vert}}_{\text{TV}}{\leqslant}\delta {< }1$. Then it holds for all A with PY (A) ≠ 0 that

Proof. Due to the bijectivity of T and the triangle inequality, we may estimate

Note that {YA} ⊂ Ω can be equipped with the measure Q := P/PY (A). Then, the random variables T(X) and (Y, Z) restricted to {YA} have distributions QT(X) = P(T(X)|YA) and ${Q}_{(Y,{T}_{z}(X))}={P}_{((Y,{T}_{z}(X))\vert Y\in A)}$, respectively. Hence, we may estimate by Hölders inequality

As |(Y, Z)| ⩽ r and $\vert \left(Y,{T}_{z}(X)\right.\vert {\leqslant}4r\enspace \mathrm{max}(\mathrm{Lip}(T),1)$ almost everywhere, we can use [19, theorem 4] to control the Wasserstein distance by the total variation distance and estimate

Combining the estimates implies the result. □

3.1. Instability with standard normal distributions

For numerical purposes, it is important that both T and T−1 have reasonable Lipschitz constants. Here, we specify a case where using $Z\sim \mathcal{N}(0,{I}_{k})$ as latent distribution leads to exploding Lipschitz constants of T−1. Note that our results are in a similar spirit as [12, 27], where also the Lipschitz constants of INNs are investigated. Due to its useful theoretical properties, we choose W1 as metric D throughout this paragraph. Clearly, the established results have similar implications for other metrics that induce the same topology. The following general proposition is a first step toward our desired result.

Proposition 1. Let $\nu =\mathcal{N}(0,{I}_{n})\in \mathcal{P}({\mathbb{R}}^{n})$ and $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ be arbitrary. Then, there exists a constant C such that for any measurable function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$ satisfying W1(T# μ, ν) ⩽ epsilon and disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$ it holds for δ, epsilon ⩾ 0 small enough that

Proof. If T(Q) ∩ T(R) ≠ ∅, the claim follows immediately. In the following, we denote the volume of ${B}_{1}(0)\subset {\mathbb{R}}^{n}$ by V1 and the normalizing constant for ν with Cν . Provided that δ, epsilon < 1/4, we conclude for the ball Br (0) with ν(Br (0)) > 3/4 by (7) that 1/4 > dist(T(Q) Br (0))/8 and similarly for T(R). Hence, there exists a constant s independent of δ, epsilon and T such that dist(T(Q), 0) ⩽ s and dist(T(R), 0) ⩽ s. Possibly increasing s such that s > max(1,  ln(Cν V1)), we choose the constant C as

Assume for fixed δ, epsilon that

If epsilon and δ are small enough, it holds rs. As dist(T(Q) ∩ Bs (0), T(R) ∩ Bs (0)) > 2r, there exists $x\in {\mathbb{R}}^{n}$ such that the open ball Br (x) satisfies Br (x) ∩ T(Q) = Br (x) ∩ T(R) = ∅ and Br (x) ⊂ B2s (0). Further, we estimate

as well as dist(T(Q), Br/2(x)) ⩾ r/2 and dist(T(R), Br/2(x)) ⩾ r/2. Using (7), we obtain ${W}_{1}({T}_{\#}\mu ,\nu ){\geqslant}{\frac{r}{2}\varepsilon }^{n/(n+1)}$, leading to the contradiction

Note that the previous result is not formulated in its most general form, i.e., the mass on Q and R can be different. Further, the measure ν can be exchanged as long as the mass is concentrated around zero. Based on proposition 1, we directly obtain the following corollary.

Corollary 1. Let $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ be arbitrary. Then, there exists a constant C such that for any measurable, invertible function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$ satisfying ${W}_{1}({T}_{\#}\mu ,\mathcal{N}(0,{I}_{n})){\leqslant}{\epsilon}$ and disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$ it holds for δ, epsilon ⩾ 0 small enough that

Proof. Due to proposition 1, it holds

Now, we are finally able to state the desired result for INNs with $Z\sim \mathcal{N}(0,{I}_{k})$. In particular, the following result also holds if we condition only on a single measurement Y = y occurring with positive probability.

Corollary 2. For $A\subset {\mathbb{R}}^{n}$ with PY (A) ≠ 0, diam(A) = ρ set μ := ∫A PX|Y (y, ⋅)dPY (y)/PY (A). Further, choose disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$. If $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n+k}$ is a homeomorphism such that ${W}_{1}({{T}_{z}}_{\#}\mu ,\mathcal{N}(0,{I}_{k})){\leqslant}{\epsilon}$ and Ty (QR) ⊂ A, we get for δ, epsilon ⩾ 0 small enough that the Lipschitz constant of T−1 satisfies

Proof. Similar as before, Lip(T−1) can be estimated by

Equation (10)

Due to proposition 1 and since Ty (QR) ⊂ A, we get

Inserting this into (10) implies the claim. □

Loosely speaking, the condition Ty (QR) ⊂ A is fulfilled if the predicted labels are close to the actual labels. Furthermore, the condition that ${W}_{1}({{T}_{z}}_{\#}\mu ,\mathcal{N}(0,{I}_{k}))$ is small is enforced via the Ly and Lz loss. Consequently, we can expect that Lip(T−1) explodes during training if the true data distribution given measurements in A is multimodal.

3.2. A remedy using Gaussian mixture models

The Lipschitz constant of T−1 with multimodal input data distribution explode due to the fact that a splitting of the standard normal distribution is necessary. As a remedy, we propose to replace the latent variable $Z\sim \mathcal{N}(0,{I}_{k})$ with a GMM depending on y. Using a flexible number of modes in the latent space, we can avoid the necessity to split mass when recovering a multimodal distribution in the input data space.

Formally, we choose the random vector Z (depending on y) as $Z\sim {\sum }_{i=1}^{r}\enspace {p}_{i}(y)\mathcal{N}({\mu }_{i},{\sigma }^{2}{I}_{k})$, where r is the maximal number of modes in the latent space and pi (y) are the probabilities of the different modes. These probabilities pi are realised by an NN, which we denote with $w:{\mathbb{R}}^{n}\to {{\Delta}}_{r}$. Usually, we choose σ and μi such that the modes are nicely separated. Note that w gives our model the flexibility to decide how many modes it wants to use given the data y. Clearly, we could also incorporate covariance matrices, but for simplicity we stick with scaled identity matrices. Denoting the parameters of the NN w with u2, we now wish to solve the following optimization problem

Equation (11)

Next, we provide an example where choosing Z as GMM indeed reduces Lip(T−1).

Example 1. Let ${X}_{\sigma }:{\Omega}\to \mathbb{R}$ and $Z:{\Omega}\to \mathbb{R}$ be random vectors. As proven in corollary 1, we get for ${X}_{\sigma }\sim 0.5\mathcal{N}(-1,{\sigma }^{2})+0.5\mathcal{N}(1,{\sigma }^{2})$ and $Z\sim \mathcal{N}(0,1)$ that for any C > 0, there exists ${\sigma }_{0}^{2}{ >}0$ such that for any ${\sigma }^{2}{< }{\sigma }_{0}^{2}$ and any $T:\mathbb{R}\to \left\{0\right\}{\times}\mathbb{R}$ with ${W}_{1}({P}_{{T}^{-1}{\circ}Z},{P}_{{X}_{\sigma }}){< }{\epsilon}$ it holds that Lip(T−1) > C.

On the other hand, if we allow Z to be a GMM, then there exists for any σ2 > 0 a GMM Z and a mapping $T:\mathbb{R}\to \left\{0\right\}{\times}\mathbb{R}$ such that ${P}_{{T}^{-1}{\circ}Z}={P}_{{X}_{\sigma }}$ and ${L}_{{T}^{-1}}=1$, namely Z = Xσ and T = I. Clearly, we can also fix $Z\sim 0.5\mathcal{N}(-1,0.1)+0.5\mathcal{N}(1,0.1)$. In this case, we can find T such that ${P}_{{T}^{-1}(Z)}={P}_{{X}_{\sigma }}$ and Lip(T−1) remains bounded as σ → 0.

4. Learning INNs with multimodal latent distributions

In this section, we discuss the training procedure if T is chosen as INN together with a multimodal latent distribution. For this purpose, we have to specify and discretize our generic continuous problem (11).

4.1. Model specification and discretization

First, we fix the metric D in the loss functions (9). Due to their computational efficiency, we use discrepancies ${\mathcal{D}}_{K}$ with a multiscale version of the inverse quadric kernel, i.e.,

where ri ∈ {0.05, 0.2, 0.9}. The different scales ensure that both local and global features are captured. Clearly, computationally more involved choices such as the Wasserstein metric W1 or the sliced Wasserstein distance [29] could be used as well. The relation between Wp and ${\mathcal{D}}_{K}$ is discussed in remark 2.

In the previous section, we analyzed the problem of modeling multimodal distributions with an INN in a very abstract setting by considering the data as a distribution. In practice, however, we have to sample from the random vectors X, Y and Z. The obtained samples ${({x}_{i},{y}_{i},{z}_{i})}_{i=1}^{m}$ induce empirical distributions of the form ${P}_{X}^{m}=\frac{1}{m}{\sum }_{i=1}^{m}\enspace \delta (\cdot -{x}_{i})$, which are then inserted into the loss (11). To this end, we replace ${P}_{{T}^{-1}(Y,Z;u)}$ by ${{T}^{-1}}_{\#}{P}_{(X,Y)}^{m}$ and PT(X;u) by ${T}_{\#}{P}_{X}^{m}$, resulting in the overall problem

Equation (12)

where T depends on u1 and Z on u2. Here, the discrepancies can be easily evaluated based on (4), which is just a finite sum. Note that there are also other discretization approaches, which may have more desirable statistical properties, e.g., the unbiased version of the discrepancy outlined in [22]. For solving (12), we want to use the gradient descent based Adam optimizer [31]. This approach leads to a sequence ${\left\{{T}^{m}\right\}}_{m\in \mathbb{N}}$ of INNs that approximate the true solution of the problem. Deriving the gradients with respect to u1 is a standard task for NNs. In order to derive gradients with respect to u2, we need a sampling procedure for $Z\sim {\sum }_{i=1}^{r}\enspace {p}_{i}(y)\mathcal{N}({\mu }_{i},{\sigma }^{2}{I}_{k})$ such that the samples are differentiable with respect to u2. Without the differentiability requirement, we would just draw a number l from a random vector γ with range {1, ..., r} and P(γ = i) = pi and then sample from $\mathcal{N}({\mu }_{l},{\sigma }^{2}{I}_{k})$. In the next paragraph, the Gumbel softmax trick is utilized to achieve differentiability.

4.2. Differentiable sampling according to the GMM

Recall that the distribution function of Gumbel(0, 1) random vectors is given by F(x) = exp(− exp(−x)), which allows for efficient sampling. The Gumbel reparameterization trick [28, 39] relies on the observation that if we add Gumbel(0, 1) random vectors to the logits log(pi ), then the arg max of the resulting vector has the same distribution as γ. More precisely, for independent Gumbel(0, 1) random vectors G1, ..., Gr it holds

Unfortunately, arg max leads to a non differentiable expression. To circumvent this issue, we replace arg max with a numerically more robust version of ${\text{softmax}}_{t}:{\mathbb{R}}^{r}\to {\mathbb{R}}^{r}$ given by ${\text{softmax}}_{t}{(p)}_{i}=\mathrm{exp}(({p}_{i}-m)/t)/{\sum }_{i=1}^{n}\enspace \mathrm{exp}(({p}_{i}-m)/t)$, where m = maxi pt . This results in a probability vector that converges to a corner of the probability simplex as the temperature t converges to zero. Instead of sampling from a single component, we then sample from all components and use the corresponding linear combination to obtain a sample from Z. This procedure for differentiable sampling from the GMM is summarized in algorithm 1. In our implementation, we use t ≈ 0.1 and decrease t during training.

Algorithm 1. Differentiable sampling according to the GMM.

Input: NN w, $y\in {\mathbb{R}}^{n}$, means μi , standard deviation σ and temperature t
Output: Vector $z\in {\mathbb{R}}^{k}$ that (approximately) follows the distribution PZ|Y (y, ⋅)
1: p = w(y)
2: Generate g = g1, ..., gk independent Gumbel(0, 1) samples
3: pi = log(pi ) + gi
4: m = maxi pi
5: ${s}_{i}={\text{softmax}}_{t}{(p)}_{i}=\mathrm{exp}(({p}_{i}-m)/t)/{\sum }_{i=1}^{n}\enspace \mathrm{exp}(({p}_{i}-m)/t)$
6: $z={\sum }_{i=1}^{k}{s}_{i}{\mu }_{i}+\sigma \mathcal{N}(0,{I}_{k})$

4.3. Enforcing the Lipschitz constraint

Fitting discrete and continuous distributions in a coupled process has proven to be difficult in many cases. Gaujac et al [18] encounter the problem that both models learn with different speeds and hence tend to get stuck in local minima. To avoid such problems, we enforce additional network properties, ensuring that the latent space is used in the correct way. In this subsection, we argue that a L2-penalty on the subnetworks weights of the INN prevents the Lipschitz constants from exploding. This possibly reduces the chance of reaching undesirable local minima. Further, we want modalities to be preserved, i.e., both the latent and the input data distribution should have an equal number of modes. If this is not the case, modes have to be either split or merged and hence the Lipschitz constant of the forward or backward map explode, respectively. Consequently, if we use a GMM for Z, controlling the Lipschitz constants is a natural way of forcing the INN to use the multimodality of Z.

Fortunately, following the techniques in [8], we can simultaneously control the Lipschitz constants of INN blocks and their inverse given by

via bounds on the Lipschitz constants of the NNs s and t. To prevent exploding values of the exponential, s is clamped, i.e., we restrict the range of the output components si to [−α, α] by using ${s}_{\text{clamp}}(x)=\frac{2\alpha }{\pi }\enspace \mathrm{arctan}(s(x)/\alpha )$ instead of s, see [5, equation (7)]. Similarly, we clamp t. Note that this does not affect the invertibility of the INN. Then, we can locally estimate on any box [a, b]d that

Equation (13)

where the constant c is an upper bound of g and 1/g on the clamped range of s, c1 depends on a, b and α, and c2 depends on α. Provided that the activation is one-Lipschitz, Lip(s) showing up in (13) is bounded in terms of the respective weight matrices Ai of s by

which follows directly from the chain rule. This requirement is fulfilled for most activations and in particular for the often applied ReLU activation. Note that the same reasoning is applicable for Lip(t).

Now, we want to extend (13) to our network structure. To this end, note that the layer Tl defined in (8) can be decomposed as Tl = S1,l S2,l with

Incorporating the estimate (13), we can bound the Lipschitz constants for Tl by

Then, Lip(T) and Lip(T−1) of the INN T are bounded by the respective products of the bounds on the individual blocks.

To enforce the Lipschitz continuity, we add the L2-penalty ${L}_{\text{reg}}=\frac{{\lambda }_{\text{reg}}}{2}{\sum }_{i=1}^{l}{\Vert}{A}_{i}{{\Vert}}^{2}$, where λreg is the regularization parameter and ${({A}_{i})}_{i=1}^{m}$ are the weight matrices of the subnetworks si,l and ti,l of the INN T. This is a rather soft way to enforce Lipschitz continuity of the subnetworks. Other ways are described in [23] and include weight clipping and gradient penalties. Even with the additional L2-penalty, our model can get stuck in bad local minima, and it might be worthwhile to investigate other (stronger) ways to prevent mode merging. Further, the derivation indicates that clamping the outputs of the subnetworks helps to control the Lipschitz constant. Another way to prevent merging of modes is to introduce a sparsity promoting loss on the probabilities, namely ${({\sum }_{i}{p}_{i}^{1/2})}^{2}$.

4.4. Padding

In practice, the dimension d is often different from the intrinsic dimension of the data modeled by $X:{\Omega}\to {\mathbb{R}}^{d}$. As $Z:{\Omega}\to {\mathbb{R}}^{k}$ is supposed to describe the lost information during the forward process, it appears sensible to choose k < dn in such cases. Similarly, if $Y:{\Omega}\to {\mathbb{R}}^{n}$ has discrete range, we usually assign each label to a unit vector ei , effectively increasing the overall dimension of the problem. In both cases, dn + k and we can not apply our framework directly. To resolve this issue, we enlarge the samples xi or (yi , zi ) with random Gaussian noise of small variance. Note that we could also use zero padding, but then we would lose some control on the invertibility of T as the training data would not span the complete space. Clearly, padding can be also used to increase the overall problem size, adding more freedom for learning certain structures. This is particularly useful for small scale problems, where the capacity of the NNs is relatively small.

As they are irrelevant for the problem that we want to model, the padded components are not incorporated into our proposed loss terms. Hence, we need to ensure that the model does not use the introduced randomness in the padding components of x = (xdata, xpad) and z = (zdata, zpad) for training. To overcome this problem, we propose to use the additional loss

Here, the first term ensures that the padding part of T stays close to zero, the second one helps to learn the correct backward mapping independent of zpad and the third one penalizes the use of padding in the x-part. Note that the first two terms were already used in [4]. Clearly, we can compute gradients of the proposed loss with respect to the parameters u2 of w in the GMM as the loss depends on zi in a differentiable manner.

5. Experiments

In this section, we underline our theoretical findings with some numerical experiments. For all experiments, the networks are chosen such that they are comparable in capacity and then trained using a similar amount of time or until they stagnate. The exact architectures and obtained parameters are available under https://github. com/PaulLyonel/StabilizingInvNetsUsingMixtureModels. Our PyTorch implementation builds up on the freely available FrEIA package 3 .

5.1. Eight modes problem

For this problem, $X:{\Omega}\to {\mathbb{R}}^{2}$ is chosen as a GMM with eight modes, i.e.,

All modes are assigned one of the four colors red, blue, green and purple. Then, the task is to learn the mapping from position to color and, more interestingly, the inverse map where the task is to reconstruct the conditional distribution P(X|Y)(y, ⋅) given a color y. Two versions of the data set are visualized in figure 1, where the data points x are colored according to their label. Here, the first set is based on the color labels from [4] and the second one uses a more challenging set of labels, where modes of the same color are further away from each other. For this problem, we are going to use a two dimensional latent distribution. Hence, the problem dimensions are d = 2, n = 4 and k = 2.

Figure 1.

Figure 1. Positions of the eight modes experiment with two different color label sets.

Standard image High-resolution image

As d < k + n, we have to use random padding. Here, we increase the dimension to 16, i.e., both input and output are padded to increase capacity. Unfortunately, as the padding is not part of the loss terms, the argumentation in remark 3 is not applicable anymore, i.e., training on Ly and Lz is not sufficient. This observation is supported by the fact that training without Lx and Lpad yields undesirable results for the 8 modes problem: the samples in figure 2(c) are created by propagating the padded data samples through the trained INN. As enforced by Lz , we obtain the standard normal distribution. However, if replace the padded components by zero, we obtain the samples in figure 2(d), which clearly indicates that the INN is using the random padding to fit the latent distribution.

Figure 2.

Figure 2. Effect of L2-regularization and of padding on sampling quality.

Standard image High-resolution image

Next, we illustrate the effect of the additional L2-penalty introduced in section 4. In figure 2(b), we provide a sampled data distribution from an INN with $Z\sim \mathcal{N}(0,{I}_{2})$ trained with large λreg. The learned model is not able to capture the multimodalities, since the L2-penalty controls Lip(T−1). This confirms our theoretical findings in section 3, i.e., that splitting up the standard normal distribution in order to match the true data distribution leads to exploding Lip(T−1).

Finally, we investigate how a GMM consisting of four Gaussians with standard deviation σ = 1 and means {(−4, −4), (−4, 4), (4, 4), (4, −4)} compares against $\mathcal{N}(0,{I}_{2})$ as latent distribution. In our experiments, we investigate the predicted labels, the backward samples and the forward latent fit for the two different label sets. Our obtained results are depicted in figures 3(a)–(d), respectively. The left column in each figure contains the label prediction of the INN, which is quite accurate in all cases. More interesting are the reconstructed input data distributions in the middle columns. As indicated by our theory, the standard INN has difficulties separating the modes. Especially in figure 3(b) many red points are reconstructed between the actual modes by the standard INN. Taking a closer look, we observe that the sampled red points still form one connected component for both examples, which can be explained by the Lipschitz continuity of the inverse INN. The results based on our multimodal INN in figures 3(c) and (d) are much better, i.e., the modes are separated correctly. Additionally, the learned discrete probabilities larger than zero match the number of modes for the corresponding color. Note that our model learned the correct probabilities almost perfectly, see table 1. In the right column of every figure, we included the latent samples produced by propagating the data samples through the INN. Here, we observe that the standard INN is able to perfectly fit the standard normal distribution in both cases. The latent fits for the multimodal INN also capture the GMM well, although the sizes of the modes vary a bit.

Figure 3.

Figure 3. Predicted labels, backward samples and latent samples for standard INNs and multimodal INNs.

Standard image High-resolution image

Table 1. Learned probabilities for the results in figure 3(c) (left) and 3(d) (right).

5.2. MNIST image inpainting

To show that our method also works for higher dimensional problems, we apply it for image inpainting. For this purpose, we restrict the MNIST dataset [37] to the digits 0 and 8. Further, the forward mapping for an image x ∈ [0, 1]28,28 is defined as f : [0, 1]28,28 → [0, 1]6,28 with x ↦ (x)22⩽i⩽27,0⩽j⩽27, i.e., the upper part of the image is removed. Note that the problem is chosen in such a way that it is unclear whether the lower part of the image should be completed to a zero or an eight. However, for certain types of inputs, it is of course more likely that they belong to one or the other. For this example, the problem dimensions are chosen as d = 784, n = 164 and k = 14, i.e., we use a 14-dimensional latent distribution. Further, the latent distribution is chosen as a GMM with two modes and σ = 0.6. Unfortunately, forcing our model to learn a good separation of the modes is non-trivial and we use the following heuristic for choosing the means of the two mixture components. We initialize the INN with random noise (of small scale) and choose the means as a scalar multiple of the forward propagations of the INN in the Z-space, i.e., such that they are well-separated. This initialization turned out to be necessary for learning a clean splitting between zeros and eights. Furthermore, it is beneficial to add noise to both the x-data and y-data to avoid artifacts in the generation process, since MNIST as a dataset itself is quite flat, see also [5]. Although not required, we found it beneficial to replace the second term in the padding loss by ${\left\vert {x}_{i}-{T}^{-1}\left({y}_{i},{z}_{i,\text{pad}},{T}_{{z}_{\text{data}}}({x}_{i})\right)\right\vert }^{2}$, which serves as a backward MSE and makes the images sharper.

Our obtained results are provided in figure 4. The learned probabilities for the displayed inputs are (0.83, 0.17), (0.46, 0.54) and (0.27, 0.73), respectively, where the ith number is the probability of sampling from the ith mean in the latent space. Taking a closer look at the latent space, we observe that the first mode mainly corresponds to sampling a zero, whereas the second one corresponds to sampling an eight. A nice benefit is that the splitting between the two modes provides a probability estimate whether a sample y belongs to a zero or an eight. In order to understand if this estimate is reasonable, we perform the following sanity check. We pick 1000 samples ${({y}_{i})}_{i=1}^{1000}$ from the test set and evaluate the probabilities p(yi ) that those samples correspond to a zero. For every sample yi , we perform algorithm 2 to get an estimate e(yi ) for the likeliness of being a zero. Essentially, the algorithm performs a weighted sum over all samples with label 0, where the weights are determined by the closeness to the sample yi based on a Gaussian kernel. Then, we calculate the mean of |p(yi ) − e(yi )| over 1000 random samples, resulting in an average error of around 0.10–0.11. Given that the mean of |0.5 − e(yi )| is around 0.25, we conclude that the estimates from our probability net are at least somewhat meaningful and better than an uneducated guess. Intriguingly, for the images in figure 4 the estimates are (0.82, 0.18), (0.51, 0.49) and (0.25, 0.75), respectively, which nicely match the probabilities predicted by the INN.

Figure 4.

Figure 4. Samples created by a multimodal INN. Condition y in upper left corner.

Standard image High-resolution image

Algorithm 2. Probability estimate.

Input: sample y, comparison samples ${({y}_{i})}_{i=1}^{m}$ with labels li ∈ {0, 8}
Output: estimate for the probability of y belonging to a zero
1: e = 0, s = 0
2: for i = 1, ..., m do
3:  if li = 0 then:
4:    e = e + exp(−|yyi |2)
5:  end if
6:  s = s + exp(−|yyi |2)
7: end for
8: return $\frac{e}{s}$

5.3. Inverse kinematics

In [35] the authors propose to benchmark invertible architectures on a problem inspired by inverse kinematics. More precisely, a vertically movable robot arm with three segments connected by joints is modeled mathematically as

Equation (14)

where the parameter h represents the vertical height, θ1, θ2, θ3 are the joint angles, l1, l2, l3 are the arm segment lengths and y = (y1, y2) denotes the modeled end positions. Given the end position y, our aim is to reconstruct the arm parameters x = (h, θ1, θ2, θ3). For this problem, the dimensions are chosen as d = 4, n = 2 and k = 2, i.e., no padding has to be applied. This is an interesting benchmark problem as the conditional distributions can be multimodal, the parameter space is large but still interpretable and the forward model (14), denoted by f, is more complex compared to the preceding tasks.

As a baseline, we employ approximate Bayesian computation (ABC) to get x samples satisfying y = f(x), see algorithm 3. For this purpose, we impose the priors $h\sim \frac{1}{2}\mathcal{N}(1,\frac{1}{64})+\frac{1}{2}\mathcal{N}(-1,\frac{1}{64})$, ${\theta }_{1},{\theta }_{2},{\theta }_{3}\sim \mathcal{N}(0,\frac{1}{4})$ and set the joint lengths to l = (0.5, 0.5, 1). Additionally, we sample using an INN with $Z\sim \mathcal{N}(0,{I}_{2})$ and a multimodal INN with $Z\sim {p}_{1}(y)\mathcal{N}((-2,0),0.09{I}_{2})+{p}_{2}(y)\mathcal{N}((2,0),0.09{I}_{2})$. At this point, we want to remark that computing accurate samples using ABC is very time consuming and not competitive against INNs. In particular the sampling procedure is for one fixed y, whereas the INN approach recovers the posteriors for all possible y.

Algorithm 3. Approximate Bayesian computation.

Input: prior p, some vector y, forward operator f, threshold ɛ
Output: m samples from the approximate posterior p(⋅|y)
1: for i = 1, ..., m do
2:  repeat
3:    sample xi according to the prior p and accept if |yf(xi )|2 < ɛ
4:  until sample xi is accepted
5: end for
6: return samples ${({x}_{i})}_{i=1}^{m}$

The predicted posteriors for y = (0, 1.5) and $\tilde {y}=(0.5,1.5)$ are visualized in figure 5, where we plot the four line segments determined by the forward model (14). Note that the learned probabilities p(y) = (0.52, 0.48) and $p(\tilde {y})=(0,1)$ nicely fit the true modality of the data. As an approximation quality measure, we first calculate the discrepancy ${\mathcal{D}}_{K}({\tilde {x}}_{i},{x}_{i})$, where ${({\tilde {x}}_{i})}_{i=1}^{m}$, m = 2000, are samples generated by algorithm 3 and ${({x}_{i})}_{i=1}^{m}$ are generated by the respective INN methods, i.e., by sampling from T−1(y, Z). This quantity is then averaged over five runs with a single set of rejection samples. Based on the computed samples and the forward model, we also calculate the resimulation error $R(y)=\frac{1}{m}{\sum }_{i}\vert f({x}_{i})-y{\vert }^{2}$. Both quality measures are computed five times and the average of the obtained results is provided in table 2. Note that we are evaluating the posterior only for two different y. As the loss function is based on the continuous distribution, results are likely to be slightly noisy.

Figure 5.

Figure 5. Sampled arm configurations for fixed positions y and $\tilde {y}$ using different methods.

Standard image High-resolution image

Table 2. Evaluation of INN and multimodal INN with samples ${\tilde {x}}_{i,y}$ created by algorithm 3 and samples xi,y from T−1(y, Z).

Method ${\mathcal{D}}_{K}({x}_{i,y},{\tilde {x}}_{i,y})$ R(y) ${\mathcal{D}}_{K}({x}_{i,\tilde {y}},{\tilde {x}}_{i,\tilde {y}})$ $R(\tilde {y})$
INN0.0330.0090.0130.0003
Multimodal INN0.0300.00030.0100.0011

Overall, this experiment confirms that our method excels at modeling multimodal distributions whereas for unimodal a standard INN suffices as well. Further, we observe that our method is flexible enough to model both types of distributions even in the case when y is not discrete.

6. Conclusions

In this paper, we investigated the latent space of INNs for inverse problems. Our theoretical results indicate that a proper choice of Z is crucial for obtaining stable INNs with reasonable Lipschitz constants. To this end, we parameterized the latent distribution Z in dependence on the labels y in order to model properties of the input data. This resulted in significantly improved behavior of both the INN T and T−1. In some sense, we can interpret this as adding more explicit knowledge to our problem. Hence, the INN only needs to fit a simpler problem compared to the original setting, which is usually possible with simpler models.

Unfortunately, establishing a result similar to theorem 1 that also incorporates approximation errors seems quite hard. A first attempt under rather strong assumptions is given in theorem 2. Relaxing the assumptions such that they fit to the chosen loss terms would give a useful error estimate between the sampled posterior based on the INN and the true posterior. Further, it would be interesting to apply our model to real word inverse problems. Possibly, our proposed framework can be used to automatically detect multimodal structures in such data sets. To this end, the application of more sophisticated controls on the Lipschitz constants [11, 21] could be beneficial. Additionally, we could also enrich the model by learning the parameters of the permutation matrices using a similar approach as in [25, 32]. Finally, it would be interesting to further investigate the role and influence of padding for the training process.

Acknowledgments

The authors want to thank Johannes Hertrich and Gabriele Steidl for fruitful discussions on the topic.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Footnotes

Please wait… references are loading.
10.1088/1361-6420/abe928