Stabilizing invertible neural networks using mixture models

Paul Hagemann; Sebastian Neumayer

doi:10.1088/1361-6420/abe928

1. Introduction

Reconstructing parameters of physical models is an important task in science. Usually, such problems are severely underdetermined and sophisticated reconstruction techniques are necessary. Whereas classical regularisation methods focus on finding just the most desirable or most likely solution of an inverse problem, more recent methods focus on analyzing the complete distribution of possible parameters. In particular, this provides us with a way to quantify how trustworthy the obtained solution is. Among the most popular methods for uncertainty quantification are Bayesian methods [17], which build up on evaluating the posterior using Bayes theorem, and Markov chain Monte Carlo (MCMC) [41]. Over the last years, these classical frameworks and questions have been combined with neural networks (NNs). To give a few examples, the Bayesian framework has been extended to NNs [1, 38], auto-encoders have been combined with MCMC [9] and generative adversarial networks have been applied for sampling from conditional distributions [40]. Another approach in this direction relies on so-called invertible neural networks (INNs) [4, 7, 14, 32] for sampling from the conditional distribution. Such models have been successfully applied for stellar parameter estimation [36], conditional image generation [5] and polyphonic music transcription [29].

Our main goal in this article is to deepen the mathematical understanding of INNs. Let us briefly describe the general framework. Assume that the random vectors $(X,Y):{\Omega}\to {\mathbb{R}}^{d}{\times}{\mathbb{R}}^{n}$ are related via f(X) ≈ Y for some forward mapping $f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}$ . Clearly, this setting includes the case of noisy measurements. In practice, we often only have access to the values of Y and want to solve the corresponding inverse problem. To this end, our aim is to find an invertible mapping $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ such that

T_y(X) ≈ Y, i.e., that T_y is an approximation of f;
T⁻¹(y, Z) with $Z\sim \mathcal{N}(0,{I}_{k})$ , k = d − n, is distributed according to P_(X|Y)(y, ⋅).

Throughout this paper, the mapping T is modeled as an INN, i.e., as NN with an explicitly invertible structure. Therefore, no approximations are necessary and evaluating the backward pass is as simple as the forward pass. Here, Z should account for the information loss when evaluating the forward model f. In other words, Z parameterizes the ill-posed part of the forward problem f.

Our theoretical contribution is twofold. First, we generalize the consistency result for perfectly trained INNs given in [4] toward latent distributions Z depending on the actual measurement y. Here, we want to emphasize that our proof does not require any regularity properties for the distributions. Next, we analyze the behavior of Lip(T⁻¹) for a homeomorphism T that approximately maps a multimodal distribution to the standard normal distribution. More precisely, we obtain that Lip(T⁻¹) explodes as the amount of mass between the modes and the approximation error decreases. In particular, this implies for INNs that using Z ∼ N(0, I_k) can lead to instabilities of the inverse map for multimodal problems. As the results are formulated in a very general setting, they could be also of interest for generative models, which include variational auto-encoders [33] and generative adversarial networks (GANs) [20]. Another interesting work that analyzes the Lipschitz continuity of different INN architectures is [8], where also the effects of numerical non-invertibility are illustrated.

The performed numerical experiments confirm our theoretical findings, i.e., that $Z\sim \mathcal{N}(0,{I}_{k})$ is unsuitable for recovering multimodal distributions. Note that this is indeed a well-known problem in generative modeling [26]. Consequently, we propose to replace $Z\sim \mathcal{N}(0,{I}_{k})$ by a Gaussian mixture model (GMM) with learnable probabilities that depend on the measurements y. In some sense, this enables the model to decide how many modes it wants to use in the latent space. Therefore, splitting or merging distributions should not be necessary anymore. Such an approach has lead to promising results for GANs [24, 30] and variational auto-encoders [13]. In order to learn the probabilities corresponding to the modes, we employ the Gumbel trick [28, 39] and a L²-penalty for the networks weights, enforcing the Lipschitz continuity of the INN. More precisely, the L²-penalty ensures that splitting or merging distributions is unattractive. Similar approaches have been successfully applied for generative modeling, see [13, 15].

1.1. Outline

We recall the necessary theoretical foundations in section 2. Then, we introduce the continuous optimization problem in section 3 and discuss its theoretical properties. In particular, we investigate the effect of different latent space distributions on the Lipschitz constants of an INN. The training procedure for our proposed model is discussed in section 4. For minimizing over the probabilities of the GMM, we employ the Gumbel trick. In section 5, we demonstrate the efficiency of our model on three inverse problems that exhibit a multimodal structure. Conclusions are drawn in section 6.

2. Preliminaries

2.1. Probability theory and conditional expectations

First, we give a short introduction to the required probability theoretic foundations based on [2, 6, 16, 34]. Let $({\Omega},\mathfrak{A},P)$ be a probability space and ${\mathbb{R}}^{m}$ be equipped with the Borel σ-algebra $\mathcal{B}({\mathbb{R}}^{m})$ . A random vector is a measurable mapping $X:{\Omega}\to {\mathbb{R}}^{m}$ with distribution ${P}_{X}\in \mathcal{P}({\mathbb{R}}^{m})$ defined by P_X(A) = P(X⁻¹(A)) for $A\in \mathcal{B}({\mathbb{R}}^{m})$ . For any random vector X, there exists a smallest σ-algebra on Ω such that X is measurable, which is denoted with σ(X). Two random vectors X and Y are called equal in distribution if P_X = P_Y and we write $X\;d{=}\;Y$ . The space $\mathcal{P}({\mathbb{R}}^{m})$ is equipped with the weak convergence, i.e., a sequence ${({\mu }_{k})}_{k}\subset \mathcal{P}({\mathbb{R}}^{m})$ converges weakly to μ, written μ_k ⇀ μ, if

$\begin{equation*}\underset{k\to \infty }{\mathrm{lim}}{\int }_{{\mathbb{R}}^{m}}\varphi \enspace \enspace \mathrm{d}{\mu }_{k}={\int }_{{\mathbb{R}}^{m}}\varphi \enspace \enspace \mathrm{d}\mu \quad \text{for}\enspace \text{all}\enspace \varphi \in {\mathcal{C}}_{b}({\mathbb{R}}^{m}),\end{equation*}$

where ${\mathcal{C}}_{b}({\mathbb{R}}^{m})$ denotes the space of continuous and bounded functions. The support of a probability measure μ is the closed set

$\begin{equation*}\text{supp}(\mu ){:=}\left\{x\in {\mathbb{R}}^{m}:B\subset {\mathbb{R}}^{m}\;\text{open,}\;x\in B\enspace {\Rightarrow}\enspace \mu (B){ >}0\right\}.\end{equation*}$

In case that there exists ${\rho }_{\mu }:{\mathbb{R}}^{m}\to \mathbb{R}$ with μ(A) = ∫_A ρ_μ dx for all $A\in \mathcal{B}({\mathbb{R}}^{m})$ , this ρ_μ is called the probability density of μ. For $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ and 1 ⩽ p ⩽ ∞, let ${L}^{p}({\mathbb{R}}^{m})$ be the Banach space (of equivalence classes) of complex-valued functions with norm

$\begin{equation*}{\Vert}f{{\Vert}}_{{L}^{p}({\mathbb{R}}^{m})}={\left({\int }_{{\mathbb{R}}^{m}}\vert f(x){\vert }^{p}\enspace \mathrm{d}x\right)}^{\frac{1}{p}}{< }\infty .\end{equation*}$

In a similar manner, we denote the Banach space of (equivalence classes of) random vectors X satisfying $\mathbb{E}(\vert X{\vert }^{p}){< }\infty$ by ${L}^{p}({\Omega},\mathfrak{A},P)$ . The mean of a random vector X is defined as $\mathbb{E}(X)={(\mathbb{E}({X}_{1}),\dots ,\mathbb{E}({X}_{m}))}^{\mathrm{T}}$ . Further, for any $X\in {L}^{2}({\Omega},\mathfrak{A},P)$ the covariance matrix of X is defined by $\mathrm{Cov}(X)={(\mathrm{Cov}({X}_{i},{X}_{j}))}_{i,j=1}^{m}$ , where $\mathrm{Cov}({X}_{i},{X}_{j})=\mathbb{E}(({X}_{i}-\mathbb{E}({X}_{i}))({X}_{j}-\mathbb{E}({X}_{j})))$ .

For a measurable function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$ , the push-forward measure of $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ by T is defined as T_# μ := μ◦T⁻¹. A measurable function f on ${\mathbb{R}}^{n}$ is integrable with respect to ν := T_# μ if and only if the composition f◦T is integrable with respect μ, i.e.,

$\begin{equation*}{\int }_{{\mathbb{R}}^{n}}f(y)\enspace \mathrm{d}\nu (y)={\int }_{{\mathbb{R}}^{m}}f(T(x))\enspace \mathrm{d}\mu (x).\end{equation*}$

The product measure $\mu \otimes \nu \in \mathcal{P}({\mathbb{R}}^{m}{\times}{\mathbb{R}}^{n})$ of $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ and $\nu \in \mathcal{P}({\mathbb{R}}^{n})$ is the unique measure satisfying μ ⊗ ν(A × B) = μ(A)ν(B) for all $A\in \mathcal{B}({\mathbb{R}}^{m})$ and $B\in \mathcal{B}({\mathbb{R}}^{n})$ .

If $X\in {L}^{1}({\Omega},\mathfrak{A},P)$ and $Y:{\Omega}\to {\mathbb{R}}^{n}$ is a second random vector, then there exists a unique σ(Y)-measurable random vector $\mathbb{E}(X\vert Y)$ called conditional expectation of X given Y with the property that

$\begin{equation*}{\int }_{A}X\enspace \enspace \mathrm{d}P={\int }_{A}\mathbb{E}(X\vert Y)\enspace \mathrm{d}P\quad \text{for}\enspace \text{all}\enspace A\in \sigma (Y).\end{equation*}$

Further, there exists a P_Y-almost surely unique, measurable mapping $g:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ such that $\mathbb{E}(X\vert Y)=g{\circ}Y$ . Using this mapping g, we define the conditional expectation of X given Y = y as $\mathbb{E}(X\vert Y=y)=g(y)$ . For $A\in \mathfrak{A}$ the conditional probability of A given Y = y is defined by $P(A\vert Y=y)=\mathbb{E}({1}_{A}\vert Y=y)$ . More generally, regular conditional distributions are defined in terms of Markov kernels, i.e., there exists a mapping ${P}_{(X\vert Y)}:{\mathbb{R}}^{n}{\times}\mathcal{B}({\mathbb{R}}^{m})\to \mathbb{R}$ such that

(a)
P_(X|Y)(y, ⋅) is a probability measure on $\mathcal{B}({\mathbb{R}}^{m})$ for every $y\in {\mathbb{R}}^{n}$ ,
(b)
P_(X|Y)(⋅, A) is measurable for every $A\in \mathcal{B}({\mathbb{R}}^{m})$ and for all $A\in \mathcal{B}({\mathbb{R}}^{m})$ it holds
$\begin{equation}{P}_{(X\vert Y)}(\cdot ,A)=E({1}_{A}{\circ}X\vert Y=\enspace \cdot )\qquad {\text{P}}_{Y}-\text{a.e.}\end{equation} \tag{ 1 }$

For any two such kernels P_(X|Y) and Q_(X|Y) there exists $B\in \mathcal{B}({\mathbb{R}}^{n})$ with P_Y(B) = 1 such that for all y ∈ B and $A\in \mathcal{B}({\mathbb{R}}^{m})$ it holds P_(X|Y)(y, A) = Q_(X|Y)(y, A). Finally, let us conclude with two useful properties, which we need throughout this paper.

(a)
For random vectors $X:{\Omega}\to {\mathbb{R}}^{m}$ and $Y:{\Omega}\to {\mathbb{R}}^{n}$ and measurable $f:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ with $f(Y)X\in {L}^{1}({\Omega},\mathfrak{A},P)$ it holds E(f(Y)X|Y) = f(Y)E(X|Y), where the products are meant component-wise. In particular, this implies for $y\in {\mathbb{R}}^{n}$ that
$\begin{equation}E(f(Y)X\vert Y=y)=f(y)E(X\vert Y=y).\end{equation} \tag{ 2 }$
(b)
For random vectors $X:{\Omega}\to {\mathbb{R}}^{m}$ , $Y:{\Omega}\to {\mathbb{R}}^{n}$ and a measurable map $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{p}$ it holds
$\begin{equation}{P}_{(X\vert Y)}\left(\cdot ,{T}^{-1}(\cdot )\right)={P}_{(T{\circ}X\vert Y)}.\end{equation} \tag{ 3 }$

2.2. Discrepancies

Next, we introduce a way to quantify the distance between two probability measures. To this end, choose a measurable, symmetric and bounded function $K:{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}\to \mathbb{R}$ that is integrally strictly positive definite, i.e., for any $f\ne 0\in {L}^{2}({\mathbb{R}}^{m})$ it holds

$\begin{equation*}\int \underset{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}{\int }K(x,y)f(x)f(y)\enspace \mathrm{d}x\enspace \enspace \mathrm{d}y{ >}0.\end{equation*}$

Note that this property is indeed stronger than K being strictly positive definite.

Then, the corresponding discrepancy ${\mathcal{D}}_{K}(\mu ,\nu )$ is defined via

$\begin{equation}{\mathcal{D}}_{K}^{2}(\mu ,\nu )=\underset{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}{\iint }K\enspace \enspace \mathrm{d}\mu \enspace \enspace \mathrm{d}\mu -2\underset{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}{\iint }K\enspace \enspace \mathrm{d}\mu \enspace \enspace \mathrm{d}\nu +\underset{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}{\iint }K\enspace \enspace \mathrm{d}\nu \enspace \enspace \mathrm{d}\nu .\end{equation} \tag{ 4 }$

Due to our requirements on K, the discrepancy ${\mathcal{D}}_{K}$ is a metric, see [43]. In particular, it holds ${\mathcal{D}}_{K}(\mu ,\nu )=0$ if and only if μ = ν. In the following remark, we require the space ${C}_{0}({\mathbb{R}}^{m})$ of continuous functions decaying to zero at infinity.

Remark 1. Let K(x, y) = ψ(x − y) with $\psi \in {C}_{0}({\mathbb{R}}^{m})\cap {L}^{1}({\mathbb{R}}^{m})$ and assume there exists an $l\in \mathbb{N}$ such that

$\begin{equation*}{\int }_{{\mathbb{R}}^{m}}\frac{1}{\hat{\psi }(x){(1+\vert x\vert )}^{l}}\enspace \mathrm{d}x{< }\infty ,\end{equation*}$

where $\hat{\psi }$ denotes the Fourier transform of ψ. Then, the discrepancy ${\mathcal{D}}_{K}$ metrizes the weak topology on $\mathcal{P}({\mathbb{R}}^{m})$ , see [43].

2.3. Optimal transport

For our theoretical investigations, Wasserstein distances turn out to be a more appropriate metric than discrepancies. This is mainly due to the fact that spatial distance is directly encoded into this metric. Actually, both concepts can be related to each other under relatively mild conditions, see also remark 1. A more detailed overview on the topic can be found in [3, 42]. Let $\mu ,\nu \in \mathcal{P}({\mathbb{R}}^{m})$ and $c\in C({\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m})$ be a non-negative and continuous function. Then, the Kantorovich problem of optimal transport (OT) reads

$\begin{equation}\mathrm{O}\mathrm{T}(\mu ,\nu ){:=}{\mathrm{inf}}_{\pi \in {\Pi}(\mu ,\nu )}{\int }_{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}c(x,y)\enspace \mathrm{d}\pi (x,y),\end{equation} \tag{ 5 }$

where Π(μ, ν) denotes the weakly compact set of joint probability measures π on ${\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}$ with marginals μ and ν. Recall that the OT functional $\pi {{\mapsto}\int }_{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}c\enspace \enspace \mathrm{d}\pi$ is weakly lower semi-continuous, (5) has a solution and every such minimizer $\hat{\pi }$ is called OT plan. Note that OT plans are in general not unique.

For $\mu ,\nu \in {\mathcal{P}}_{p}({\mathbb{R}}^{m}){:=}\left\{\mu \in \mathcal{P}({\mathbb{R}}^{m}){:\int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}\mu (x){< }\infty \right\}$ , p ∈ [1, ∞), the p -Wasserstein distance W_p is defined by

$\begin{equation}{W}_{p}(\mu ,\nu ){:=}{\left({\mathrm{min}}_{\pi \in {\Pi}(\mu ,\nu )}{\int }_{{\mathbb{R}}^{m}{\times}{\mathbb{R}}^{m}}\vert x-y{\vert }^{p}\mathrm{d}\pi (x,y)\right)}^{\frac{1}{p}}.\end{equation} \tag{ 6 }$

Recall that this defines a metric on ${\mathcal{P}}_{p}({\mathbb{R}}^{m})$ , which satisfies W_p(μ_m, μ) → 0 if and only if ${\int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}{\mu }_{m}(x){\to \int }_{{\mathbb{R}}^{m}}\vert x{\vert }^{p}\enspace \mathrm{d}\mu (x)$ and μ_m ⇀ μ. If $\mu ,\nu \in {\mathcal{P}}_{p}({\mathbb{R}}^{m})$ and at least one of them has a density with respect to the Lebesgue measure, then (6) has a unique solution for any p ∈ (1, ∞). For $\mu ,\nu \in \mathcal{P}({\mathbb{R}}^{m})$ and sets $A,B\subset {\mathbb{R}}^{m}$ , we estimate

$\begin{align}{W}_{p}^{p}(\mu ,\nu )& ={\mathrm{min}}_{\pi \in {\Pi}(\mu ,\nu )}{\int }_{{\mathbb{R}}^{n}{\times}{\mathbb{R}}^{n}}\vert x-y{\vert }^{p}\enspace \mathrm{d}\pi (x,y){\geqslant}{\mathrm{min}}_{\pi \in {\Pi}(\mu ,\nu )}{\int }_{A{\times}B}\vert x-y{\vert }^{p}\enspace \mathrm{d}\pi (x,y)\\ & {\geqslant}\mathrm{dist}{(A,B)}^{p}{\mathrm{min}}_{\pi \in {\Pi}(\mu ,\nu )}\pi (A{\times}B){\geqslant}\mathrm{dist}{(A,B)}^{p}\\ & \quad {\times}\mathrm{max}\left\{\mu (A)-\nu ({B}^{c}),\nu (B)-\mu ({A}^{c})\right\}.\end{align} \tag{ 7 }$

Remark 2. As mentioned in remark 1, ${\mathcal{D}}_{K}$ metrizes the weak convergence under certain conditions. Further, convergence of the pth moments always holds if all measures are supported on a compact set. In this case, weak convergence, convergence in the Wasserstein metric and convergence in discrepancy are all equivalent. As these requirements are usually fulfilled in practice, most theoretical properties of INNs based on W₁ carry over to ${\mathcal{D}}_{K}$ .

2.4. Invertible neural networks

Throughout this paper, an INN $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ , k = d − n, is constructed as composition T = T_L◦P_L◦...◦T₁◦P₁, where the ${({P}_{l})}_{l=1}^{L}$ are random (but fixed) permutation matrices and the T_l are defined by

$\begin{align}{T}_{l}({x}_{1},{x}_{2})& =({v}_{1},{v}_{2})=\left({x}_{1}\odot \enspace \mathrm{exp}({s}_{l,2}({x}_{2}))+{t}_{l,2}({x}_{2}),\enspace {x}_{2}\odot \enspace \mathrm{exp}({s}_{l,1}({v}_{1}))+{t}_{l,1}({v}_{1})\right),\end{align} \tag{ 8 }$

where x is split arbitrarily into x₁ and x₂ (for simplicity of notation we assume that ${x}_{1}\in {\mathbb{R}}^{n}$ and ${x}_{2}\in {\mathbb{R}}^{k}$ ) and ⊙ denotes the Hadamard product, see [4]. Note that there is a computational precedence in this construction, i.e., the computation of v₂ requires the knowledge of v₁. The continuous functions ${s}_{l,1}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{k}$ , ${s}_{l,2}:{\mathbb{R}}^{k}\to {\mathbb{R}}^{n}$ , ${t}_{l,1}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{k}$ and ${t}_{l,2}:{\mathbb{R}}^{k}\to {\mathbb{R}}^{n}$ are NNs and do not need to be invertible themselves. In order to emphasize the dependence of the INN on the parameters u of these NNs, we use the notation T(⋅; u). The inverse of T_l is explicitly given by

$\begin{align*}{T}_{l}^{-1}({v}_{1},{v}_{2})& =({x}_{1},{x}_{2})=\left(({v}_{1}-{t}_{l,2}({x}_{2}))\oslash \enspace \mathrm{exp}({s}_{l,2}({x}_{2})),\right.\enspace \\ & \left.\quad ({v}_{2}-{t}_{l,1}({v}_{1}))\oslash \enspace \mathrm{exp}({s}_{l,1}({v}_{1}))\right),\end{align*}$

where ⊘ denotes the point-wise quotient. Hence, the whole INN is invertible. Clearly, T and T⁻¹ are both continuous maps.

Other architectures for constructing INNs include the real NVP architecture introduced in [14], where the layers have the simplified form

$\begin{equation*}{T}_{l}({x}_{1},{x}_{2})=\left({x}_{1},\enspace {x}_{2}\odot \enspace \mathrm{exp}(s({x}_{1}))+t({x}_{1})\right),\end{equation*}$

or an approach that builds on making residual NNs invertible, see [7]. Compared to real NVP layers, the chosen approach allows more representational freedom in the first component. Unfortunately, this comes at the cost of harder Jacobian evaluations, which are not needed in this paper though. Invertible residual NNs have the advantage that they come with sharp Lipschitz bounds. In particular, they rely on spectral normalisation in each layer, effectively bounding the Lipschitz constant. Further, there are neural ODEs [10], where a black box ODE solver is used in order to simulate the output of a continuous depth residual NN, and the GLOW [32] architecture, where the P_l are no longer fixed as for real NVP and hence also learnable parameters.

3. Analyzing the continuous optimization problem

Given a random vector $(X,Y):{\Omega}\to {\mathbb{R}}^{d}{\times}{\mathbb{R}}^{n}$ with realizations ${({x}_{i},{y}_{i})}_{i=1}^{N}$ , our aim is to recover the regular conditional distribution P_(X|Y)(y, ⋅) for arbitrary y based on the samples. For this purpose, we want to find a homeomorphism $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ such that for some fixed $Z:{\Omega}\to {\mathbb{R}}^{k}$ , k = d − n, and any realization y of Y, the expression T⁻¹(y, Z) is approximately P_(X|Y=y)-distributed, see [4]. Note that the authors of [4] assume $Z\sim \mathcal{N}(0,{I}_{k})$ , but in fact the approach does not rely on this particular choice. Here, we model T as homeomorphism with parameters u, for example an INN, and choose the best candidate by optimizing over the parameters. To quantify the reconstruction quality in dependence on u, three loss functions are constructed according to the following two principles:

(a)
First, the data should be matched, i.e., T_y(X; u) ≈ Y, where T_y denotes the output part corresponding to Y. For inverse problems, this is equivalent to learning the forward map. Here, the fit between T_y(X; u) and Y is quantified by the squared L²-norm
$\begin{equation*}{L}_{y}(u)=E\left(\vert {T}_{y}(X;u)-Y{\vert }^{2}\right).\end{equation*}$
(b)
Second, the distributions of X and (Y, Z) should be close to the respective distributions T⁻¹(Y, Z; u) and T(X; u). To quantify this, we use a metric D on the space of probability measures and end up with the loss functions
$\begin{equation}{L}_{x}(u)=D\left({P}_{{T}^{-1}(Y,Z;u)},{P}_{X}\right)\quad \text{and}\quad {L}_{z}(u)=D\left({P}_{T(X;u)},{P}_{(Y,Z)}\right).\end{equation} \tag{ 9 }$
Actual choices for D are discussed in section 4.

In practice, we usually minimize a conic combination of these objective functions, i.e., for some $\lambda =({\lambda }_{1},{\lambda }_{2})\in {\mathbb{R}}_{{\geqslant}0}^{2}$ we solve

$\begin{equation*}\hat{u}\in {\mathrm{arg min}}_{u}\left\{{L}_{y}(u)+{\lambda }_{1}{L}_{x}(u)+{\lambda }_{2}{L}_{z}(u)\right\}.\end{equation*}$

First, note that L_y(u) = 0 ⇔ T_y(X; u) = Y a.e. and

$\begin{equation*}{L}_{x}(u)=0{\Leftrightarrow}{T}^{-1}(Y,Z;u)\;d{=}\;X{\Leftrightarrow}(Y,Z)\;d{=}\;T(X;u){\Leftrightarrow}{L}_{z}(u)=0.\end{equation*}$

Further, the following remark indicates that controlling the Lipschitz constant of T⁻¹ also enables us to estimate the L_x loss in terms of L_z.

Remark 3. Note that it holds ${P}_{{T}^{-1}(Y,Z;u)}={{T}^{-1}}_{\#}{P}_{(Y,Z)}$ and P_T(X;u) = T_# P_X. Let $\hat{\pi }$ denote an OT plan for W₁(P_(Y,Z), P_T(X;u)). If T⁻¹ is Lipschitz continuous, we can estimate

$\begin{align*}{W}_{1}\left({P}_{X},{P}_{{T}^{-1}(Y,Z;u)}\right)& ={W}_{1}({{T}^{-1}}_{\#}{P}_{T(X;u)},{{T}^{-1}}_{\#}{P}_{(Y,Z)})\\ & {\leqslant}{\int }_{{\mathbb{R}}^{d}}\vert x-y\vert \enspace \mathrm{d}{({T}^{-1}{\times}{T}^{-1})}_{\#}\hat{\pi }(x,y)\\ & ={\int }_{{\mathbb{R}}^{n+k}}\vert {T}^{-1}(x)-{T}^{-1}(y)\vert \enspace \mathrm{d}\hat{\pi }(x,y)\\ & {\leqslant}\mathrm{Lip}({T}^{-1}){W}_{1}\left({P}_{(Y,Z)},{P}_{T(X;u)}\right),\end{align*}$

i.e., we get L_x(u) ⩽ Lip(T⁻¹)L_z(u) when choosing D = W₁.

Under the assumption that all three loss functions are zero, the authors of [4] proved that T⁻¹(y, Z) ∼ P_(X|Y)(y, ⋅). In other words, T can be used to efficiently sample from the true posterior distribution P_(X|Y)(y, ⋅). However, their proof requires that the distributions of the random vectors X and Z have densities and that Z is independent of Y. As we are going to see in section 4, it is actually beneficial to choose Z dependent on Y. We overcome the mentioned limitations in the next theorem and provide a rigorous prove of this important result.

Theorem 1. Let $X:{\Omega}\to {\mathbb{R}}^{d}$ , $Y:{\Omega}\to {\mathbb{R}}^{n}$ and $Z:{\Omega}\to {\mathbb{R}}^{k}$ be random vectors. Further, let $T=({T}_{y},{T}_{z}):{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}{\times}{\mathbb{R}}^{k}$ be a homeomorphism such that $T{\circ}X\;d{=}\;(Y,Z)$ and T_y◦X = Y a.e. Then, it holds

$\begin{equation*}{P}_{(X\vert Y)}(y,B)={{T}^{-1}}_{\#}\left({\delta }_{y}\otimes {P}_{(Z\vert Y)}(y,\cdot )\right)(B)\end{equation*}$

for all $B\in \mathcal{B}({\mathbb{R}}^{d})$ and P_Y-a.e. $y\in {\mathbb{R}}^{n}$ .

Proof. As T is a homeomorphism, it suffices to show P_(X|Y)(y, T⁻¹(B)) = (δ_y ⊗ P_(Z|Y)(y, ⋅))(B) for all $B\in \mathcal{B}({\mathbb{R}}^{n+k})$ and P_Y-a.e. $y\in {\mathbb{R}}^{n}$ . Using (3) and integrating over y, we can equivalently show that for all $A\in \mathcal{B}({\mathbb{R}}^{n})$ and all $B\in \mathcal{B}({\mathbb{R}}^{n+k})$ it holds

$\begin{equation*}{\int }_{A}{P}_{(T{\circ}X\vert Y)}(y,B)\enspace \mathrm{d}{P}_{Y}(y)={\int }_{A}{\delta }_{y}\otimes {P}_{(Z\vert Y)}(y,\cdot )(B)\enspace \mathrm{d}{P}_{Y}(y).\end{equation*}$

As both ∫_A P_(T◦X|Y)(y, ⋅)dP_Y(y) and ∫_A δ_y ⊗ P_(Z|Y)(y, ⋅)(⋅)dP_Y(y) are finite measures, we only need to consider sets of the form B = B₁ × B₂ with ${B}_{1}\in \mathcal{B}({\mathbb{R}}^{n})$ and ${B}_{2}\in \mathcal{B}({\mathbb{R}}^{k})$ , see [34, lemma 1.42].

By (1) and since T_y◦X = Y a.e., we conclude

$\begin{align*}{\int }_{A}{P}_{(T{\circ}X\vert Y)}(y,B)\enspace \mathrm{d}{P}_{Y}(y)=& {\int }_{A}E\left({1}_{B}(T{\circ}X)\vert Y=y\right)\enspace \mathrm{d}{P}_{Y}(y)\\ =& {\int }_{A}E\left({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y=y\right)\enspace \mathrm{d}{P}_{Y}(y).\end{align*}$

For all $C\in \sigma (Y)={Y}^{-1}(\mathcal{B}({\mathbb{R}}^{n}))$ it holds that ω ∈ C if and only if Y(ω) ∈ Y(C). As $T{\circ}X\;d{=}\;(Y,Z)$ and T_y◦X = Y a.e., we obtain

$\begin{align*}& {\int }_{C}E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y)\enspace \mathrm{d}P={\int }_{{\Omega}}{1}_{C}{1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\enspace \mathrm{d}P\\ & ={\int }_{{\Omega}}{1}_{{B}_{1}\cap Y(C)}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\enspace \mathrm{d}P={\int }_{{\Omega}}{1}_{{B}_{1}\cap Y(C)}({T}_{y}{\circ}X){1}_{{B}_{2}}({T}_{z}{\circ}X)\enspace \mathrm{d}P\\ & ={P}_{T{\circ}X}\left(({B}_{1}\cap Y(C)){\times}{B}_{2}\right)={P}_{(Y,Z)}\left(({B}_{1}\cap Y(C)){\times}{B}_{2}\right)\\ & ={\int }_{{\Omega}}{1}_{{B}_{1}\cap Y(C)}(Y){1}_{{B}_{2}}(Z)\enspace \mathrm{d}P={\int }_{C}E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y)\enspace \mathrm{d}P.\end{align*}$

In particular, this implies that $E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y)=E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y)$ and hence also $E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}({T}_{z}{\circ}X)\vert Y=y)=E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y=y)$ for P_Y-a.e. y. By (1) and (2), we get

$\begin{align*}{\int }_{A}{P}_{(T{\circ}X\vert Y)}(y,B)\enspace \mathrm{d}{P}_{Y}(y)& ={\int }_{A}E({1}_{{B}_{1}}(Y){1}_{{B}_{2}}(Z)\vert Y=y)\enspace \mathrm{d}{P}_{Y}(y)\\ & ={\int }_{A}{1}_{{B}_{1}}(y)E({1}_{{B}_{2}}(Z)\vert Y=y)\enspace \mathrm{d}{P}_{Y}(y)\\ & ={\int }_{A\cap {B}_{1}}{P}_{(Z\vert Y)}(y,{B}_{2})\enspace \mathrm{d}{P}_{Y}(y)\\ & ={\int }_{A}{\delta }_{y}({B}_{1}){P}_{(Z\vert Y)}(y,{B}_{2})\enspace \mathrm{d}{P}_{Y}(y),\end{align*}$

which is precisely the claim. □

Under some additional assumptions, we are able to show the following error estimate based on the TV norm ${\Vert}\mu {{\Vert}}_{\text{TV}}{:=}{\mathrm{sup}}_{B\in \mathcal{B}({R}^{n})}\vert \mu (B)\vert$ . Note that the restrictions and conditions are still rather strong and do not entirely fit to our problem. However, to the best of our knowledge, this is the first attempt in this direction at all.

Theorem 2. Let $X:{\Omega}\to {\mathbb{R}}^{d}$ , $Y:{\Omega}\to {\mathbb{R}}^{n}$ and $Z:{\Omega}\to {\mathbb{R}}^{k}$ be random vectors such that |(Y, Z)| ⩽ r and |(Y, X)| ⩽ r almost everywhere. Further, let T be a homeomorphism with E(|T_y(X; u) − Y|²) ⩽ and ${\Vert}{P}_{(Y,Z)}-{P}_{(Y,{T}_{z}(X))}{{\Vert}}_{\text{TV}}{\leqslant}\delta {< }1$ . Then it holds for all A with P_Y(A) ≠ 0 that

$\begin{equation*}{W}_{1}\left({P}_{(X\vert Y\in A)},{{T}^{-1}}_{\#}{P}_{(Y,Z\vert Y\in A)}\right){\leqslant}\mathrm{Lip}({T}^{-1})\left(\frac{{{\epsilon}}^{1/2}}{{P}_{Y}{(A)}^{1/2}}+\frac{5r\enspace \mathrm{max}(\mathrm{Lip}(T),1)}{{P}_{Y}(A)}\delta \right).\end{equation*}$

Proof. Due to the bijectivity of T and the triangle inequality, we may estimate

$\begin{align*}& \hspace{-12pt}{W}_{1}\left({P}_{(X\vert Y\in A)},{{T}^{-1}}_{\#}{P}_{(Y,Z\vert Y\in A)}\right)\\ & {\leqslant}\mathrm{Lip}({T}^{-1}){W}_{1}\left({P}_{(T(X)\vert Y\in A)},{P}_{(Y,Z\vert Y\in A)}\right)\\ & {\leqslant}\mathrm{Lip}({T}^{-1})\left({W}_{1}\left({P}_{(T(X)\vert Y\in A)},{P}_{(Y,{T}_{z}(X)\vert Y\in A)}\right)+{W}_{1}\left({P}_{(Y,{T}_{z}(X)\vert Y\in A)},{P}_{(Y,Z\vert Y\in A)}\right)\right).\end{align*}$

Note that {Y ∈ A} ⊂ Ω can be equipped with the measure Q := P/P_Y(A). Then, the random variables T(X) and (Y, Z) restricted to {Y ∈ A} have distributions Q_T(X) = P_(T(X)|Y∈A) and ${Q}_{(Y,{T}_{z}(X))}={P}_{((Y,{T}_{z}(X))\vert Y\in A)}$ , respectively. Hence, we may estimate by Hölders inequality

$\begin{align*}{W}_{1}\left({P}_{(T(X)\vert Y\in A)},{P}_{((Y,{T}_{z}(X))\vert Y\in A)}\right)& {\leqslant}{\mathbb{E}}_{Q}(\vert {T}_{y}(X)-Y\vert )\\ & {\leqslant}{\left({\mathbb{E}}_{Q}(1){\mathbb{E}}_{Q}(\vert {T}_{y}(X)-Y{\vert }^{2})\right)}^{1/2}\\ & {\leqslant}{\frac{1}{{P}_{Y}{(A)}^{1/2}}\varepsilon }^{1/2}.\end{align*}$

As |(Y, Z)| ⩽ r and $\vert \left(Y,{T}_{z}(X)\right.\vert {\leqslant}4r\enspace \mathrm{max}(\mathrm{Lip}(T),1)$ almost everywhere, we can use [19, theorem 4] to control the Wasserstein distance by the total variation distance and estimate

$\begin{align*}& \hspace{-12pt}{W}_{1}({P}_{((Y,{T}_{z}(X))\vert Y\in A)},{P}_{((Y,Z)\vert Y\in A)})\\ & {\leqslant}5r\enspace \mathrm{max}(\mathrm{Lip}(T),1){{\Vert}{P}_{((Y,{T}_{z}(X))\vert Y\in A)}-{P}_{((Y,Z)\vert Y\in A)}{\Vert}}_{\text{TV}}\\ & =\frac{5r\enspace \mathrm{max}(\mathrm{Lip}(T),1)}{{P}_{Y}(A)}{\mathrm{sup}}_{B}\left\vert {P}_{(Y,{T}_{z}(X))}\left(B\cap (A{\times}{\mathrm{R}}^{k})\right)-{P}_{(Y,Z)}\left(B\cap (A{\times}{\mathrm{R}}^{k})\right)\right\vert \\ & {\leqslant}\frac{5r\enspace \mathrm{max}(\mathrm{Lip}(T),1)}{{P}_{Y}(A)}{\Vert}{P}_{(Y,{T}_{z}(X))}-{P}_{(Y,Z)}{{\Vert}}_{\text{TV}}.\end{align*}$

Combining the estimates implies the result. □

3.1. Instability with standard normal distributions

For numerical purposes, it is important that both T and T⁻¹ have reasonable Lipschitz constants. Here, we specify a case where using $Z\sim \mathcal{N}(0,{I}_{k})$ as latent distribution leads to exploding Lipschitz constants of T⁻¹. Note that our results are in a similar spirit as [12, 27], where also the Lipschitz constants of INNs are investigated. Due to its useful theoretical properties, we choose W₁ as metric D throughout this paragraph. Clearly, the established results have similar implications for other metrics that induce the same topology. The following general proposition is a first step toward our desired result.

Proposition 1. Let $\nu =\mathcal{N}(0,{I}_{n})\in \mathcal{P}({\mathbb{R}}^{n})$ and $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ be arbitrary. Then, there exists a constant C such that for any measurable function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$ satisfying W₁(T_# μ, ν) ⩽ and disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$ it holds for δ, ⩾ 0 small enough that

$\begin{equation*}\mathrm{dist}\left(T(Q),T(R)\right){\leqslant}C{\left(\delta +{{\epsilon}}^{\frac{n}{n+1}}\right)}^{\frac{1}{n}}.\end{equation*}$

Proof. If T(Q) ∩ T(R) ≠ ∅, the claim follows immediately. In the following, we denote the volume of ${B}_{1}(0)\subset {\mathbb{R}}^{n}$ by V₁ and the normalizing constant for ν with C_ν. Provided that δ, < 1/4, we conclude for the ball B_r(0) with ν(B_r(0)) > 3/4 by (7) that 1/4 > dist(T(Q) B_r(0))/8 and similarly for T(R). Hence, there exists a constant s independent of δ, and T such that dist(T(Q), 0) ⩽ s and dist(T(R), 0) ⩽ s. Possibly increasing s such that s > max(1, ln(C_ν V₁)), we choose the constant C as

$\begin{equation*}C=4{\left(\frac{\mathrm{exp}(2{s}^{2})}{{C}_{\nu }{V}_{1}}\right)}^{\frac{1}{n}}{ >}4.\end{equation*}$

Assume for fixed δ, that

$\begin{equation*}{r}^{n}{:=}\frac{{2}^{n}\enspace \mathrm{exp}(2{s}^{2})}{{C}_{\nu }{V}_{1}}\left(\delta +{{\epsilon}}^{\frac{n}{n+1}}\right){< }\frac{\mathrm{dist}{\left(T(Q),T(R)\right)}^{n}}{{2}^{n}}.\end{equation*}$

If and δ are small enough, it holds r ⩽ s. As dist(T(Q) ∩ B_s(0), T(R) ∩ B_s(0)) > 2r, there exists $x\in {\mathbb{R}}^{n}$ such that the open ball B_r(x) satisfies B_r(x) ∩ T(Q) = B_r(x) ∩ T(R) = ∅ and B_r(x) ⊂ B_2s(0). Further, we estimate

$\begin{align*}\nu \left({B}_{r/2}(x)\right)& ={C}_{\nu }{\int }_{{B}_{r/2}(x)}\enspace \mathrm{exp}(-\vert x{\vert }^{2}/2)\enspace \mathrm{d}x{\geqslant}{C}_{\nu }\enspace \mathrm{exp}(-2{s}^{2}){V}_{1}\frac{{r}^{n}}{{2}^{n}}{\geqslant}\delta +{{\epsilon}}^{\frac{n}{n+1}}\end{align*}$

as well as dist(T(Q), B_r/2(x)) ⩾ r/2 and dist(T(R), B_r/2(x)) ⩾ r/2. Using (7), we obtain ${W}_{1}({T}_{\#}\mu ,\nu ){\geqslant}{\frac{r}{2}\varepsilon }^{n/(n+1)}$ , leading to the contradiction

$\begin{equation*}{\epsilon}{\geqslant}{W}_{1}({T}_{\#}\mu ,\nu ){\geqslant}\frac{r}{2}{{\epsilon}}^{\frac{n}{n+1}}{\geqslant}{\left(\frac{\mathrm{exp}(2{s}^{2})}{{C}_{\nu }{V}_{1}}\right)}^{\frac{1}{n}}{\epsilon}{ >}{\epsilon}.\end{equation*}$

□

Note that the previous result is not formulated in its most general form, i.e., the mass on Q and R can be different. Further, the measure ν can be exchanged as long as the mass is concentrated around zero. Based on proposition 1, we directly obtain the following corollary.

Corollary 1. Let $\mu \in \mathcal{P}({\mathbb{R}}^{m})$ be arbitrary. Then, there exists a constant C such that for any measurable, invertible function $T:{\mathbb{R}}^{m}\to {\mathbb{R}}^{n}$ satisfying ${W}_{1}({T}_{\#}\mu ,\mathcal{N}(0,{I}_{n})){\leqslant}{\epsilon}$ and disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$ it holds for δ, ⩾ 0 small enough that

$\begin{equation*}\mathrm{Lip}({T}^{-1}){\geqslant}\frac{\mathrm{dist}(Q,R)}{2C}{\left(\delta +{{\epsilon}}^{\frac{n}{n+1}}\right)}^{-\frac{1}{n}}.\end{equation*}$

Proof. Due to proposition 1, it holds

$\begin{equation*}\mathrm{Lip}({T}^{-1}){\geqslant}{\mathrm{sup}}_{{x}_{1}\in T(Q),{x}_{2}\in T(R)}\frac{\vert {T}^{-1}({x}_{1})-{T}^{-1}({x}_{2})\vert }{\vert {x}_{1}-{x}_{2}\vert }{\geqslant}\frac{\mathrm{dist}(Q,R)}{2C{\left(\delta +{{\epsilon}}^{\frac{n}{n+1}}\right)}^{\frac{1}{n}}}.\end{equation*}$

□

Now, we are finally able to state the desired result for INNs with $Z\sim \mathcal{N}(0,{I}_{k})$ . In particular, the following result also holds if we condition only on a single measurement Y = y occurring with positive probability.

Corollary 2. For $A\subset {\mathbb{R}}^{n}$ with P_Y(A) ≠ 0, diam(A) = ρ set μ := ∫_A P_X|Y(y, ⋅)dP_Y(y)/P_Y(A). Further, choose disjoint sets Q, R with $\mathrm{min}\left\{\mu (Q),\mu (R)\right\}{\geqslant}\frac{1}{2}-\frac{\delta }{2}$ . If $T:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n+k}$ is a homeomorphism such that ${W}_{1}({{T}_{z}}_{\#}\mu ,\mathcal{N}(0,{I}_{k})){\leqslant}{\epsilon}$ and T_y(Q ∪ R) ⊂ A, we get for δ, ⩾ 0 small enough that the Lipschitz constant of T⁻¹ satisfies

$\begin{equation*}\mathrm{Lip}({T}^{-1}){\geqslant}\frac{\mathrm{dist}(Q,R)}{2C{\left(\delta +{{\epsilon}}^{\frac{k}{k+1}}\right)}^{\frac{1}{k}}+\rho }.\end{equation*}$

Proof. Similar as before, Lip(T⁻¹) can be estimated by

$\begin{equation}\mathrm{Lip}({T}^{-1}){\geqslant}{\mathrm{sup}}_{{x}_{1}\in Q,{x}_{2}\in R}\frac{\mathrm{dist}(Q,R)}{\vert T({x}_{1})-T({x}_{2})\vert }.\end{equation} \tag{ 10 }$

Due to proposition 1 and since T_y(Q ∪ R) ⊂ A, we get

$\begin{align*}& \hspace{-6pt}{\mathrm{inf}}_{{x}_{1}\in Q,{x}_{2}\in R}\vert T({x}_{1})-T({x}_{2})\vert \\ & {\leqslant}{\mathrm{inf}}_{{x}_{1}\in Q,{x}_{2}\in R}\vert {T}_{z}({x}_{1})-{T}_{z}({x}_{2})\vert +\vert {T}_{y}({x}_{1})-{T}_{y}({x}_{2})\vert \\ & {\leqslant}2C{\left(\delta +{{\epsilon}}^{\frac{k}{k+1}}\right)}^{1/k}+\rho .\end{align*}$

Inserting this into (10) implies the claim. □

Loosely speaking, the condition T_y(Q ∪ R) ⊂ A is fulfilled if the predicted labels are close to the actual labels. Furthermore, the condition that ${W}_{1}({{T}_{z}}_{\#}\mu ,\mathcal{N}(0,{I}_{k}))$ is small is enforced via the L_y and L_z loss. Consequently, we can expect that Lip(T⁻¹) explodes during training if the true data distribution given measurements in A is multimodal.

3.2. A remedy using Gaussian mixture models

The Lipschitz constant of T⁻¹ with multimodal input data distribution explode due to the fact that a splitting of the standard normal distribution is necessary. As a remedy, we propose to replace the latent variable $Z\sim \mathcal{N}(0,{I}_{k})$ with a GMM depending on y. Using a flexible number of modes in the latent space, we can avoid the necessity to split mass when recovering a multimodal distribution in the input data space.

Formally, we choose the random vector Z (depending on y) as $Z\sim {\sum }_{i=1}^{r}\enspace {p}_{i}(y)\mathcal{N}({\mu }_{i},{\sigma }^{2}{I}_{k})$ , where r is the maximal number of modes in the latent space and p_i(y) are the probabilities of the different modes. These probabilities p_i are realised by an NN, which we denote with $w:{\mathbb{R}}^{n}\to {{\Delta}}_{r}$ . Usually, we choose σ and μ_i such that the modes are nicely separated. Note that w gives our model the flexibility to decide how many modes it wants to use given the data y. Clearly, we could also incorporate covariance matrices, but for simplicity we stick with scaled identity matrices. Denoting the parameters of the NN w with u₂, we now wish to solve the following optimization problem

$\begin{equation}\hat{u}=({\hat{u}}_{1},{\hat{u}}_{2})\in {\mathrm{arg min}}_{u=({u}_{1},{u}_{2})}{L}_{y}(u)+{\lambda }_{1}{L}_{x}(u)+{\lambda }_{2}{L}_{z}(u).\end{equation} \tag{ 11 }$

Next, we provide an example where choosing Z as GMM indeed reduces Lip(T⁻¹).

Example 1. Let ${X}_{\sigma }:{\Omega}\to \mathbb{R}$ and $Z:{\Omega}\to \mathbb{R}$ be random vectors. As proven in corollary 1, we get for ${X}_{\sigma }\sim 0.5\mathcal{N}(-1,{\sigma }^{2})+0.5\mathcal{N}(1,{\sigma }^{2})$ and $Z\sim \mathcal{N}(0,1)$ that for any C > 0, there exists ${\sigma }_{0}^{2}{ >}0$ such that for any ${\sigma }^{2}{< }{\sigma }_{0}^{2}$ and any $T:\mathbb{R}\to \left\{0\right\}{\times}\mathbb{R}$ with ${W}_{1}({P}_{{T}^{-1}{\circ}Z},{P}_{{X}_{\sigma }}){< }{\epsilon}$ it holds that Lip(T⁻¹) > C.

On the other hand, if we allow Z to be a GMM, then there exists for any σ² > 0 a GMM Z and a mapping $T:\mathbb{R}\to \left\{0\right\}{\times}\mathbb{R}$ such that ${P}_{{T}^{-1}{\circ}Z}={P}_{{X}_{\sigma }}$ and ${L}_{{T}^{-1}}=1$ , namely Z = X_σ and T = I. Clearly, we can also fix $Z\sim 0.5\mathcal{N}(-1,0.1)+0.5\mathcal{N}(1,0.1)$ . In this case, we can find T such that ${P}_{{T}^{-1}(Z)}={P}_{{X}_{\sigma }}$ and Lip(T⁻¹) remains bounded as σ → 0.

4. Learning INNs with multimodal latent distributions

In this section, we discuss the training procedure if T is chosen as INN together with a multimodal latent distribution. For this purpose, we have to specify and discretize our generic continuous problem (11).

4.1. Model specification and discretization

First, we fix the metric D in the loss functions (9). Due to their computational efficiency, we use discrepancies ${\mathcal{D}}_{K}$ with a multiscale version of the inverse quadric kernel, i.e.,

$\begin{equation*}K(x,y)=\sum\limits _{i=1}^{3}\frac{{r}_{i}^{2}}{\sqrt{\vert x-y{\vert }^{2}+{r}_{i}^{2}}},\end{equation*}$

where r_i ∈ {0.05, 0.2, 0.9}. The different scales ensure that both local and global features are captured. Clearly, computationally more involved choices such as the Wasserstein metric W₁ or the sliced Wasserstein distance [29] could be used as well. The relation between W_p and ${\mathcal{D}}_{K}$ is discussed in remark 2.

In the previous section, we analyzed the problem of modeling multimodal distributions with an INN in a very abstract setting by considering the data as a distribution. In practice, however, we have to sample from the random vectors X, Y and Z. The obtained samples ${({x}_{i},{y}_{i},{z}_{i})}_{i=1}^{m}$ induce empirical distributions of the form ${P}_{X}^{m}=\frac{1}{m}{\sum }_{i=1}^{m}\enspace \delta (\cdot -{x}_{i})$ , which are then inserted into the loss (11). To this end, we replace ${P}_{{T}^{-1}(Y,Z;u)}$ by ${{T}^{-1}}_{\#}{P}_{(X,Y)}^{m}$ and P_T(X;u) by ${T}_{\#}{P}_{X}^{m}$ , resulting in the overall problem

$\begin{align}{\mathrm{min}}_{u=({u}_{1},{u}_{2})}& \left\{\frac{1}{m}\sum\limits _{i=1}^{m}\vert {T}_{y}({x}_{i};u)-{y}_{i}{\vert }^{2}+{\lambda }_{1}{\mathcal{D}}_{K}\left({{T}^{-1}}_{\#}{P}_{(Y,Z)}^{m},{P}_{X}^{m}\right)\right.\\ & \left.\quad +{\lambda }_{2}{\mathcal{D}}_{K}\left({T}_{\#}{P}_{X}^{m},{P}_{(Y,Z)}^{m}\right)\right\},\end{align} \tag{ 12 }$

where T depends on u₁ and Z on u₂. Here, the discrepancies can be easily evaluated based on (4), which is just a finite sum. Note that there are also other discretization approaches, which may have more desirable statistical properties, e.g., the unbiased version of the discrepancy outlined in [22]. For solving (12), we want to use the gradient descent based Adam optimizer [31]. This approach leads to a sequence ${\left\{{T}^{m}\right\}}_{m\in \mathbb{N}}$ of INNs that approximate the true solution of the problem. Deriving the gradients with respect to u₁ is a standard task for NNs. In order to derive gradients with respect to u₂, we need a sampling procedure for $Z\sim {\sum }_{i=1}^{r}\enspace {p}_{i}(y)\mathcal{N}({\mu }_{i},{\sigma }^{2}{I}_{k})$ such that the samples are differentiable with respect to u₂. Without the differentiability requirement, we would just draw a number l from a random vector γ with range {1, ..., r} and P(γ = i) = p_i and then sample from $\mathcal{N}({\mu }_{l},{\sigma }^{2}{I}_{k})$ . In the next paragraph, the Gumbel softmax trick is utilized to achieve differentiability.

4.2. Differentiable sampling according to the GMM

Recall that the distribution function of Gumbel(0, 1) random vectors is given by F(x) = exp(− exp(−x)), which allows for efficient sampling. The Gumbel reparameterization trick [28, 39] relies on the observation that if we add Gumbel(0, 1) random vectors to the logits log(p_i), then the arg max of the resulting vector has the same distribution as γ. More precisely, for independent Gumbel(0, 1) random vectors G₁, ..., G_r it holds

$\begin{equation*}P\left({\mathrm{arg max}}_{i}({G}_{i}+\mathrm{log}({p}_{i}))=k\right)={p}_{k}.\end{equation*}$

Unfortunately, arg max leads to a non differentiable expression. To circumvent this issue, we replace arg max with a numerically more robust version of ${\text{softmax}}_{t}:{\mathbb{R}}^{r}\to {\mathbb{R}}^{r}$ given by ${\text{softmax}}_{t}{(p)}_{i}=\mathrm{exp}(({p}_{i}-m)/t)/{\sum }_{i=1}^{n}\enspace \mathrm{exp}(({p}_{i}-m)/t)$ , where m = max_i p_t. This results in a probability vector that converges to a corner of the probability simplex as the temperature t converges to zero. Instead of sampling from a single component, we then sample from all components and use the corresponding linear combination to obtain a sample from Z. This procedure for differentiable sampling from the GMM is summarized in algorithm 1. In our implementation, we use t ≈ 0.1 and decrease t during training.

Algorithm 1. Differentiable sampling according to the GMM.

Input: NN w, $y\in {\mathbb{R}}^{n}$ $y\in {\mathbb{R}}^{n}$ , means μ_i, standard deviation σ and temperature t

Output: Vector $z\in {\mathbb{R}}^{k}$ $z\in {\mathbb{R}}^{k}$ that (approximately) follows the distribution P_Z|Y(y, ⋅)

1: p = w(y)

2: Generate g = g₁, ..., g_k independent Gumbel(0, 1) samples

3: p_i = log(p_i) + g_i

4: m = max_i p_i

5: ${s}_{i}={\text{softmax}}_{t}{(p)}_{i}=\mathrm{exp}(({p}_{i}-m)/t)/{\sum }_{i=1}^{n}\enspace \mathrm{exp}(({p}_{i}-m)/t)$ ${s}_{i}={\text{softmax}}_{t}{(p)}_{i}=\mathrm{exp}(({p}_{i}-m)/t)/{\sum }_{i=1}^{n}\enspace \mathrm{exp}(({p}_{i}-m)/t)$

6: $z={\sum }_{i=1}^{k}{s}_{i}{\mu }_{i}+\sigma \mathcal{N}(0,{I}_{k})$ $z={\sum }_{i=1}^{k}{s}_{i}{\mu }_{i}+\sigma \mathcal{N}(0,{I}_{k})$

4.3. Enforcing the Lipschitz constraint

Fitting discrete and continuous distributions in a coupled process has proven to be difficult in many cases. Gaujac et al [18] encounter the problem that both models learn with different speeds and hence tend to get stuck in local minima. To avoid such problems, we enforce additional network properties, ensuring that the latent space is used in the correct way. In this subsection, we argue that a L²-penalty on the subnetworks weights of the INN prevents the Lipschitz constants from exploding. This possibly reduces the chance of reaching undesirable local minima. Further, we want modalities to be preserved, i.e., both the latent and the input data distribution should have an equal number of modes. If this is not the case, modes have to be either split or merged and hence the Lipschitz constant of the forward or backward map explode, respectively. Consequently, if we use a GMM for Z, controlling the Lipschitz constants is a natural way of forcing the INN to use the multimodality of Z.

Fortunately, following the techniques in [8], we can simultaneously control the Lipschitz constants of INN blocks and their inverse given by

$\begin{align*}S({x}_{1},{x}_{2})& =\left({x}_{1},\enspace {x}_{2}\odot \enspace \mathrm{exp}(s({x}_{1}))+t({x}_{1})\right)\quad \;\text{and}\;\\ {S}^{-1}({x}_{1},{x}_{2})& =\left({x}_{1},\enspace ({x}_{2}-t({x}_{1}))\oslash \enspace \mathrm{exp}(s({x}_{1}))\right)\end{align*}$

via bounds on the Lipschitz constants of the NNs s and t. To prevent exploding values of the exponential, s is clamped, i.e., we restrict the range of the output components s_i to [−α, α] by using ${s}_{\text{clamp}}(x)=\frac{2\alpha }{\pi }\enspace \mathrm{arctan}(s(x)/\alpha )$ instead of s, see [5, equation (7)]. Similarly, we clamp t. Note that this does not affect the invertibility of the INN. Then, we can locally estimate on any box [a, b]^d that

$\begin{equation}\enspace \mathrm{max}\left\{\mathrm{Lip}(S),\mathrm{Lip}({S}^{-1})\right\}{\leqslant}c+{c}_{1}\enspace \mathrm{Lip}(s)+{c}_{2}\enspace \mathrm{Lip}(t),\end{equation} \tag{ 13 }$

where the constant c is an upper bound of g and 1/g on the clamped range of s, c₁ depends on a, b and α, and c₂ depends on α. Provided that the activation is one-Lipschitz, Lip(s) showing up in (13) is bounded in terms of the respective weight matrices A_i of s by

$\begin{equation*}\mathrm{Lip}(s){\leqslant}\prod\limits _{i=1}^{l}{\Vert}{A}_{i}{{\Vert}}_{2},\end{equation*}$

which follows directly from the chain rule. This requirement is fulfilled for most activations and in particular for the often applied ReLU activation. Note that the same reasoning is applicable for Lip(t).

Now, we want to extend (13) to our network structure. To this end, note that the layer T_l defined in (8) can be decomposed as T_l = S_1,l◦S_2,l with

$\begin{align*}{S}_{1,l}({v}_{1},{v}_{2})& =\left({v}_{1},\enspace {v}_{2}\odot \enspace \mathrm{exp}\left({s}_{1,l}({v}_{1})\right)+{t}_{1,l}({v}_{1})\right),\\ {S}_{2,l}({x}_{1},{x}_{2})& =\left({x}_{1}\odot \enspace \mathrm{exp}\left({s}_{2,l}({x}_{2})\right)+{t}_{2,l}({x}_{2}),\enspace {x}_{2}\right).\end{align*}$

Incorporating the estimate (13), we can bound the Lipschitz constants for T_l by

$\begin{equation*}\enspace \mathrm{max}\left\{\mathrm{Lip}({T}_{l}),\mathrm{Lip}({T}_{l}^{-1})\right\}{\leqslant}\prod\limits _{i=1}^{2}\left(c+{c}_{1}\enspace \mathrm{Lip}({s}_{i,l})+{c}_{2}\enspace \mathrm{Lip}({t}_{i,l})\right).\end{equation*}$

Then, Lip(T) and Lip(T⁻¹) of the INN T are bounded by the respective products of the bounds on the individual blocks.

To enforce the Lipschitz continuity, we add the L²-penalty ${L}_{\text{reg}}=\frac{{\lambda }_{\text{reg}}}{2}{\sum }_{i=1}^{l}{\Vert}{A}_{i}{{\Vert}}^{2}$ , where λ_reg is the regularization parameter and ${({A}_{i})}_{i=1}^{m}$ are the weight matrices of the subnetworks s_i,l and t_i,l of the INN T. This is a rather soft way to enforce Lipschitz continuity of the subnetworks. Other ways are described in [23] and include weight clipping and gradient penalties. Even with the additional L²-penalty, our model can get stuck in bad local minima, and it might be worthwhile to investigate other (stronger) ways to prevent mode merging. Further, the derivation indicates that clamping the outputs of the subnetworks helps to control the Lipschitz constant. Another way to prevent merging of modes is to introduce a sparsity promoting loss on the probabilities, namely ${({\sum }_{i}{p}_{i}^{1/2})}^{2}$ .

4.4. Padding

In practice, the dimension d is often different from the intrinsic dimension of the data modeled by $X:{\Omega}\to {\mathbb{R}}^{d}$ . As $Z:{\Omega}\to {\mathbb{R}}^{k}$ is supposed to describe the lost information during the forward process, it appears sensible to choose k < d − n in such cases. Similarly, if $Y:{\Omega}\to {\mathbb{R}}^{n}$ has discrete range, we usually assign each label to a unit vector e_i, effectively increasing the overall dimension of the problem. In both cases, d ≠ n + k and we can not apply our framework directly. To resolve this issue, we enlarge the samples x_i or (y_i, z_i) with random Gaussian noise of small variance. Note that we could also use zero padding, but then we would lose some control on the invertibility of T as the training data would not span the complete space. Clearly, padding can be also used to increase the overall problem size, adding more freedom for learning certain structures. This is particularly useful for small scale problems, where the capacity of the NNs is relatively small.

As they are irrelevant for the problem that we want to model, the padded components are not incorporated into our proposed loss terms. Hence, we need to ensure that the model does not use the introduced randomness in the padding components of x = (x_data, x_pad) and z = (z_data, z_pad) for training. To overcome this problem, we propose to use the additional loss

$\begin{align*}{L}_{\text{pad}}& ={\left\vert {T}_{\text{pad}}({x}_{i})-{z}_{i,\text{pad}}\right\vert }^{2}+{\left\vert {x}_{i}-{T}^{-1}\left({T}_{y}({x}_{i}),{z}_{i,\text{pad}},{T}_{{z}_{\text{data}}}({x}_{i})\right)\right\vert }^{2}+{\left\vert {T}_{\text{pad}}^{-1}({y}_{i},{z}_{i,\text{pad}},{z}_{i})\right\vert }^{2}\end{align*}$

Here, the first term ensures that the padding part of T stays close to zero, the second one helps to learn the correct backward mapping independent of z_pad and the third one penalizes the use of padding in the x-part. Note that the first two terms were already used in [4]. Clearly, we can compute gradients of the proposed loss with respect to the parameters u₂ of w in the GMM as the loss depends on z_i in a differentiable manner.

5. Experiments

In this section, we underline our theoretical findings with some numerical experiments. For all experiments, the networks are chosen such that they are comparable in capacity and then trained using a similar amount of time or until they stagnate. The exact architectures and obtained parameters are available under https://github. com/PaulLyonel/StabilizingInvNetsUsingMixtureModels. Our PyTorch implementation builds up on the freely available FrEIA package³ .

5.1. Eight modes problem

For this problem, $X:{\Omega}\to {\mathbb{R}}^{2}$ is chosen as a GMM with eight modes, i.e.,

$\begin{align*}X& \sim \frac{1}{8}\sum\limits _{i=1}^{8}\mathcal{N}({\mu }_{i},0.04{I}_{2})\quad \text{with}\quad {\mu }_{i}\\ & =\left(\mathrm{cos}(\frac{\pi }{8}(2i-1)),\mathrm{sin}(\frac{\pi }{8}(2i-1))\right)/\mathrm{sin}\left(\frac{\pi }{8}\right).\end{align*}$

All modes are assigned one of the four colors red, blue, green and purple. Then, the task is to learn the mapping from position to color and, more interestingly, the inverse map where the task is to reconstruct the conditional distribution P_(X|Y)(y, ⋅) given a color y. Two versions of the data set are visualized in figure 1, where the data points x are colored according to their label. Here, the first set is based on the color labels from [4] and the second one uses a more challenging set of labels, where modes of the same color are further away from each other. For this problem, we are going to use a two dimensional latent distribution. Hence, the problem dimensions are d = 2, n = 4 and k = 2.

**Figure 1.** Positions of the eight modes experiment with two different color label sets.
Download figure:
Standard image High-resolution image

As d < k + n, we have to use random padding. Here, we increase the dimension to 16, i.e., both input and output are padded to increase capacity. Unfortunately, as the padding is not part of the loss terms, the argumentation in remark 3 is not applicable anymore, i.e., training on L_y and L_z is not sufficient. This observation is supported by the fact that training without L_x and L_pad yields undesirable results for the 8 modes problem: the samples in figure 2(c) are created by propagating the padded data samples through the trained INN. As enforced by L_z, we obtain the standard normal distribution. However, if replace the padded components by zero, we obtain the samples in figure 2(d), which clearly indicates that the INN is using the random padding to fit the latent distribution.

Next, we illustrate the effect of the additional L²-penalty introduced in section 4. In figure 2(b), we provide a sampled data distribution from an INN with $Z\sim \mathcal{N}(0,{I}_{2})$ trained with large λ_reg. The learned model is not able to capture the multimodalities, since the L²-penalty controls Lip(T⁻¹). This confirms our theoretical findings in section 3, i.e., that splitting up the standard normal distribution in order to match the true data distribution leads to exploding Lip(T⁻¹).

Finally, we investigate how a GMM consisting of four Gaussians with standard deviation σ = 1 and means {(−4, −4), (−4, 4), (4, 4), (4, −4)} compares against $\mathcal{N}(0,{I}_{2})$ as latent distribution. In our experiments, we investigate the predicted labels, the backward samples and the forward latent fit for the two different label sets. Our obtained results are depicted in figures 3(a)–(d), respectively. The left column in each figure contains the label prediction of the INN, which is quite accurate in all cases. More interesting are the reconstructed input data distributions in the middle columns. As indicated by our theory, the standard INN has difficulties separating the modes. Especially in figure 3(b) many red points are reconstructed between the actual modes by the standard INN. Taking a closer look, we observe that the sampled red points still form one connected component for both examples, which can be explained by the Lipschitz continuity of the inverse INN. The results based on our multimodal INN in figures 3(c) and (d) are much better, i.e., the modes are separated correctly. Additionally, the learned discrete probabilities larger than zero match the number of modes for the corresponding color. Note that our model learned the correct probabilities almost perfectly, see table 1. In the right column of every figure, we included the latent samples produced by propagating the data samples through the INN. Here, we observe that the standard INN is able to perfectly fit the standard normal distribution in both cases. The latent fits for the multimodal INN also capture the GMM well, although the sizes of the modes vary a bit.

Table 1. Learned probabilities for the results in figure 3(c) (left) and 3(d) (right).

5.2. MNIST image inpainting

To show that our method also works for higher dimensional problems, we apply it for image inpainting. For this purpose, we restrict the MNIST dataset [37] to the digits 0 and 8. Further, the forward mapping for an image x ∈ [0, 1]^28,28 is defined as f : [0, 1]^28,28 → [0, 1]^6,28 with x ↦ (x)_{22⩽i⩽27,0⩽j⩽27}, i.e., the upper part of the image is removed. Note that the problem is chosen in such a way that it is unclear whether the lower part of the image should be completed to a zero or an eight. However, for certain types of inputs, it is of course more likely that they belong to one or the other. For this example, the problem dimensions are chosen as d = 784, n = 164 and k = 14, i.e., we use a 14-dimensional latent distribution. Further, the latent distribution is chosen as a GMM with two modes and σ = 0.6. Unfortunately, forcing our model to learn a good separation of the modes is non-trivial and we use the following heuristic for choosing the means of the two mixture components. We initialize the INN with random noise (of small scale) and choose the means as a scalar multiple of the forward propagations of the INN in the Z-space, i.e., such that they are well-separated. This initialization turned out to be necessary for learning a clean splitting between zeros and eights. Furthermore, it is beneficial to add noise to both the x-data and y-data to avoid artifacts in the generation process, since MNIST as a dataset itself is quite flat, see also [5]. Although not required, we found it beneficial to replace the second term in the padding loss by ${\left\vert {x}_{i}-{T}^{-1}\left({y}_{i},{z}_{i,\text{pad}},{T}_{{z}_{\text{data}}}({x}_{i})\right)\right\vert }^{2}$ , which serves as a backward MSE and makes the images sharper.

Our obtained results are provided in figure 4. The learned probabilities for the displayed inputs are (0.83, 0.17), (0.46, 0.54) and (0.27, 0.73), respectively, where the ith number is the probability of sampling from the ith mean in the latent space. Taking a closer look at the latent space, we observe that the first mode mainly corresponds to sampling a zero, whereas the second one corresponds to sampling an eight. A nice benefit is that the splitting between the two modes provides a probability estimate whether a sample y belongs to a zero or an eight. In order to understand if this estimate is reasonable, we perform the following sanity check. We pick 1000 samples ${({y}_{i})}_{i=1}^{1000}$ from the test set and evaluate the probabilities p(y_i) that those samples correspond to a zero. For every sample y_i, we perform algorithm 2 to get an estimate e(y_i) for the likeliness of being a zero. Essentially, the algorithm performs a weighted sum over all samples with label 0, where the weights are determined by the closeness to the sample y_i based on a Gaussian kernel. Then, we calculate the mean of |p(y_i) − e(y_i)| over 1000 random samples, resulting in an average error of around 0.10–0.11. Given that the mean of |0.5 − e(y_i)| is around 0.25, we conclude that the estimates from our probability net are at least somewhat meaningful and better than an uneducated guess. Intriguingly, for the images in figure 4 the estimates are (0.82, 0.18), (0.51, 0.49) and (0.25, 0.75), respectively, which nicely match the probabilities predicted by the INN.

Algorithm 2. Probability estimate.

Input: sample y, comparison samples ${({y}_{i})}_{i=1}^{m}$ ${({y}_{i})}_{i=1}^{m}$ with labels l_i ∈ {0, 8}

Output: estimate for the probability of y belonging to a zero

1: e = 0, s = 0

2: for i = 1, ..., m do

3: if l_i = 0 then:

4: e = e + exp(−|y − y_i|²)

5: end if

6: s = s + exp(−|y − y_i|²)

7: end for

8: return $\frac{e}{s}$ $\frac{e}{s}$

5.3. Inverse kinematics

In [35] the authors propose to benchmark invertible architectures on a problem inspired by inverse kinematics. More precisely, a vertically movable robot arm with three segments connected by joints is modeled mathematically as

$\begin{equation}\begin{aligned}& {y}_{1}=h+{l}_{1}\enspace \mathrm{sin}({\theta }_{1})+{l}_{2}\enspace \mathrm{sin}({\theta }_{2}-{\theta }_{1})+{l}_{3}\enspace \mathrm{sin}({\theta }_{3}-{\theta }_{2}-{\theta }_{1}),\\ & {y}_{2}={l}_{1}\enspace \mathrm{cos}({\theta }_{1})+{l}_{2}\enspace \mathrm{cos}({\theta }_{2}-{\theta }_{1})+{l}_{3}\enspace \mathrm{cos}({\theta }_{3}-{\theta }_{2}-{\theta }_{1}),\end{aligned}\end{equation} \tag{ 14 }$

where the parameter h represents the vertical height, θ₁, θ₂, θ₃ are the joint angles, l₁, l₂, l₃ are the arm segment lengths and y = (y₁, y₂) denotes the modeled end positions. Given the end position y, our aim is to reconstruct the arm parameters x = (h, θ₁, θ₂, θ₃). For this problem, the dimensions are chosen as d = 4, n = 2 and k = 2, i.e., no padding has to be applied. This is an interesting benchmark problem as the conditional distributions can be multimodal, the parameter space is large but still interpretable and the forward model (14), denoted by f, is more complex compared to the preceding tasks.

As a baseline, we employ approximate Bayesian computation (ABC) to get x samples satisfying y = f(x), see algorithm 3. For this purpose, we impose the priors $h\sim \frac{1}{2}\mathcal{N}(1,\frac{1}{64})+\frac{1}{2}\mathcal{N}(-1,\frac{1}{64})$ , ${\theta }_{1},{\theta }_{2},{\theta }_{3}\sim \mathcal{N}(0,\frac{1}{4})$ and set the joint lengths to l = (0.5, 0.5, 1). Additionally, we sample using an INN with $Z\sim \mathcal{N}(0,{I}_{2})$ and a multimodal INN with $Z\sim {p}_{1}(y)\mathcal{N}((-2,0),0.09{I}_{2})+{p}_{2}(y)\mathcal{N}((2,0),0.09{I}_{2})$ . At this point, we want to remark that computing accurate samples using ABC is very time consuming and not competitive against INNs. In particular the sampling procedure is for one fixed y, whereas the INN approach recovers the posteriors for all possible y.

Algorithm 3. Approximate Bayesian computation.

Input: prior p, some vector y, forward operator f, threshold ɛ

Output: m samples from the approximate posterior p(⋅|y)

1: for i = 1, ..., m do

2: repeat

3: sample x_i according to the prior p and accept if |y − f(x_i)|² < ɛ

4: until sample x_i is accepted

5: end for

6: return samples ${({x}_{i})}_{i=1}^{m}$ ${({x}_{i})}_{i=1}^{m}$

The predicted posteriors for y = (0, 1.5) and $\tilde {y}=(0.5,1.5)$ are visualized in figure 5, where we plot the four line segments determined by the forward model (14). Note that the learned probabilities p(y) = (0.52, 0.48) and $p(\tilde {y})=(0,1)$ nicely fit the true modality of the data. As an approximation quality measure, we first calculate the discrepancy ${\mathcal{D}}_{K}({\tilde {x}}_{i},{x}_{i})$ , where ${({\tilde {x}}_{i})}_{i=1}^{m}$ , m = 2000, are samples generated by algorithm 3 and ${({x}_{i})}_{i=1}^{m}$ are generated by the respective INN methods, i.e., by sampling from T⁻¹(y, Z). This quantity is then averaged over five runs with a single set of rejection samples. Based on the computed samples and the forward model, we also calculate the resimulation error $R(y)=\frac{1}{m}{\sum }_{i}\vert f({x}_{i})-y{\vert }^{2}$ . Both quality measures are computed five times and the average of the obtained results is provided in table 2. Note that we are evaluating the posterior only for two different y. As the loss function is based on the continuous distribution, results are likely to be slightly noisy.

Table 2. Evaluation of INN and multimodal INN with samples ${\tilde {x}}_{i,y}$ created by algorithm 3 and samples x_i,y from T⁻¹(y, Z).

Method	${\mathcal{D}}_{K}({x}_{i,y},{\tilde {x}}_{i,y})$	R(y)	${\mathcal{D}}_{K}({x}_{i,\tilde {y}},{\tilde {x}}_{i,\tilde {y}})$	$R(\tilde {y})$
INN	0.033	0.009	0.013	0.0003
Multimodal INN	0.030	0.0003	0.010	0.0011

Overall, this experiment confirms that our method excels at modeling multimodal distributions whereas for unimodal a standard INN suffices as well. Further, we observe that our method is flexible enough to model both types of distributions even in the case when y is not discrete.

6. Conclusions

In this paper, we investigated the latent space of INNs for inverse problems. Our theoretical results indicate that a proper choice of Z is crucial for obtaining stable INNs with reasonable Lipschitz constants. To this end, we parameterized the latent distribution Z in dependence on the labels y in order to model properties of the input data. This resulted in significantly improved behavior of both the INN T and T⁻¹. In some sense, we can interpret this as adding more explicit knowledge to our problem. Hence, the INN only needs to fit a simpler problem compared to the original setting, which is usually possible with simpler models.

Unfortunately, establishing a result similar to theorem 1 that also incorporates approximation errors seems quite hard. A first attempt under rather strong assumptions is given in theorem 2. Relaxing the assumptions such that they fit to the chosen loss terms would give a useful error estimate between the sampled posterior based on the INN and the true posterior. Further, it would be interesting to apply our model to real word inverse problems. Possibly, our proposed framework can be used to automatically detect multimodal structures in such data sets. To this end, the application of more sophisticated controls on the Lipschitz constants [11, 21] could be beneficial. Additionally, we could also enrich the model by learning the parameters of the permutation matrices using a similar approach as in [25, 32]. Finally, it would be interesting to further investigate the role and influence of padding for the training process.

Acknowledgments

The authors want to thank Johannes Hertrich and Gabriele Steidl for fruitful discussions on the topic.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Stabilizing invertible neural networks using mixture models

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Outline

2. Preliminaries

2.1. Probability theory and conditional expectations

2.2. Discrepancies

2.3. Optimal transport

2.4. Invertible neural networks

3. Analyzing the continuous optimization problem

3.1. Instability with standard normal distributions

3.2. A remedy using Gaussian mixture models

4. Learning INNs with multimodal latent distributions

4.1. Model specification and discretization

4.2. Differentiable sampling according to the GMM

4.3. Enforcing the Lipschitz constraint

4.4. Padding

5. Experiments

5.1. Eight modes problem

5.2. MNIST image inpainting

5.3. Inverse kinematics

6. Conclusions

Acknowledgments

Data availability statement

Footnotes

Stabilizing invertible neural networks using mixture models

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Outline

2. Preliminaries

2.1. Probability theory and conditional expectations

2.2. Discrepancies

2.3. Optimal transport

2.4. Invertible neural networks

3. Analyzing the continuous optimization problem

3.1. Instability with standard normal distributions

3.2. A remedy using Gaussian mixture models

4. Learning INNs with multimodal latent distributions

4.1. Model specification and discretization

4.2. Differentiable sampling according to the GMM

4.3. Enforcing the Lipschitz constraint

4.4. Padding

5. Experiments

5.1. Eight modes problem

5.2. MNIST image inpainting

5.3. Inverse kinematics

6. Conclusions

Acknowledgments

Data availability statement

Footnotes