InClass nets: independent classifier networks for nonparametric estimation of conditional independence mixture models and unsupervised classification

Konstantin T Matchev; Prasanth Shyamsundar

doi:10.1088/2632-2153/ac6483

1. Introduction

In this section we shall introduce the general problem of nonparametric estimation of conditional independence mixture models (CIMMs), discuss related work, and briefly describe our machine learning based estimation technique. In section 1.1 we define CIMMs and discuss their different estimation paradigms, namely parametric, semi-parametric and non-parametric. In sections 1.2 and 1.3 we review related ideas in the literature, and in section 1.4 we provide a high-level overview of our technique, before filling in the technical details in the subsequent sections.

1.1. Conditional independence mixture models

In many fields of science one encounters multivariate models which consist of several distinct sub-populations or components, say C in number. Each component $i\in\{1,\dots, C\}$ has its own characteristic probability density function $f\,^{(i)}(\mathcal{X})$ of the relevant multi-dimensional variable ${\mathcal{X}}$ . Such models are referred to as multivariate finite mixture models [1], and the probability density of ${\mathcal{X}}$ under such a model is given by

$\begin{equation} {\mathcal{P}}({\mathcal{X}}) = \sum_{i = 1}^C~w_i\,f\,^{(i)}({\mathcal{X}})\,,\quad \textrm{with} \quad \sum_{i = 1}^C~w_i = 1\,,\quad \textrm{and} \quad w_i \geqslant 0~~~\forall i\in\{1,\dots, C\}\,, \end{equation} \tag{ 1 }$

where the non-negative weights w_i parameterize the mixing proportions of the individual components. An important special case of these finite mixture models is that of the conditional independence multivariate finite mixture models³ [3, 4], which for brevity we will simply refer to as CIMMs. Under this special case, the variable ${\mathcal{X}}$ is parameterized using V 'variates' as ${\mathcal{X}} \equiv (x_1,\dots,x_V)$ such that for each component i, the density function $f\,^{(i)}$ factorizes into a product of distributions for the individual variates as

$\begin{equation} f\,^{(i)}({\mathcal{X}}) = \prod_{v = 1}^V~f\,^{(i)}_v(x_v)\,,\qquad \forall i\in\{1,\dots, C\}\,, \end{equation} \tag{ 2 }$

so that (1) becomes

$\begin{equation} {\mathcal{P}}({\mathcal{X}}) = \sum_{i = 1}^C~w_i~\prod_{v = 1}^V~f\,^{(i)}_v(x_v)\,. \end{equation} \tag{ 3 }$

Here $f\,^{(i)}_v(x_v)$ is the unit-normalized probability density of x_v within component i—the top index (i) denotes the component and the bottom index v denotes the variate. In our treatment, the individual variates x_v are themselves allowed to be multi-dimensional. In other words, V is not the dimensionality of the data, but rather the number of groups the attributes in ${\mathcal{X}}$ can be partitioned into so that they are (conditionally) independent of each other within the given component i a datapoint belongs to. This is similar to the treatment, for example, in [3, 5, 6]. In particular, the technique we develop below will be applicable in situations where the variates $x_\mathrm v$ are high-dimensional (dim( $x_\mathrm v$ ) $\gg 1$ ). Unless otherwise stated, henceforth a 'mixture model' shall refer to the CIMM of (3).

1.1.1. Applications of conditional independence mixture models

CIMMs have applications in situations where the correlations and dependence between different variables in the data are explained in terms of a latent or hidden confounding variable which influences the observed variables. This is referred to as Latent Structure Analysis (LSA) [7, 8]. In particular, when the confounding variable is discrete or categorical, it can be interpreted as representing the class a given datapoint belongs to. Such models are referred to as latent class models and their study and analysis is referred to as latent class analysis (LCA) [9].

The connection between CIMMs and LCMs can be seen in a straightforward manner as follows. We can sample a datapoint as per the mixture model in (3) by first generating the component index $i\in\{1,\dots, C\}$ as per the multinomial probability distribution induced by the weights w_i, and then sampling $(x_1,\dots,x_V)$ as per the distribution $f\,^{(i)}({\mathcal{X}})$ within component i. Now, the component index i can be interpreted as the latent variable that explains the dependence between the random variates $\{x_1,\dots,x_V\}$ in the mixture.

CIMMs, LCA, and mixture models in general have applications in a wide range of fields, including econometrics [10, 11], social sciences [12–15], bioinformatics [16], astronomy and astrophysics [17–23], high energy physics [24–26], and many others.

1.1.2. Nonparametric estimation of conditional independence mixture models

Estimation of a CIMM is simply the process of estimating the weights w_i and functions $f\,^{(i)}_v$ (assuming the number of components C is known) from a dataset sampled from the joint distribution ${\mathcal{P}}({\mathcal{X}})$ of the variates under the mixture model, see (3).

Under 'parametric' estimation of mixture models (conditionally independent or otherwise), one assumes that each of the distributions $f\,^{(i)}({\mathcal{X}})$ is from an appropriately chosen parametric class of distributions, e.g. multivariate Gaussians. The choice of the class of functions assumed to contain the true $f\,^{(i)}({\mathcal{X}})$ is informed by practitioner's prior knowledge of the problem at hand. This reduces the problem of estimating the mixture to the more tractable problem of estimating the weights w_i and the parameter values corresponding to the true $f\,^{(i)}({\mathcal{X}})$ . This weight and parameter estimation from the data is typically approached as a maximum likelihood estimation problem [27] (often tackled using the expectation–maximization algorithm [28]) or within a Bayesian approach [29].

'Semi-parametric' estimation of mixture models has been studied in many works, including [3, 30–32]. In this paper, however, we are interested in the 'nonparametric' estimation of CIMMs, i.e. no parametric forms will be assumed for the functions $f\,^{(i)}_v(x_v)$ . Nonparametric estimation has been addressed by several works recently [4–6, 31, 33–38]. In this paper, we introduce a novel machine-learning-based-approach, called the InClass nets technique, to split the dataset into its different components in a nonparametric way. This splitting naturally leads to the estimation of the mixture model. In order to perform the splitting, the InClass nets technique directly exploits the fact that the variates $x_\mathrm v$ are mutually independent of each other within each component. Earlier known approaches nonparametric CIMM estimation either (a) discretize the data-space into bins and estimate the representative value of the component functions $f\,_v^{(i)}$ in those bins [33], (b) estimate each component distribution using a smoothed kernel-based approach [4–6, 31, 34–36], or (c) express each component distribution in terms of a set of basis functions [38]. Due to the curse of dimensionality, these approaches are applicable only when the individual variates x_v are low-dimensional. Due to neural networks (NNs) ability to handle high dimensional data, our technique can tackle situations where the individual variates $x_\mathrm v$ are high-dimensional—this is the biggest advantage offered by our machine-learning-based-technique over existing approaches. This opens up the possibility of using CIMMs for hitherto unfeasible applications.

The estimation of mixture models is closely tied to the concept of 'identifiability' of mixture models. A statistical model is said to be identifiable if it is theoretically possible to estimate the model (i.e. uniquely identify the parameters and functions that describe it) based on an infinite dataset sampled from it. A model will not be identifiable if two or more parameterizations of the model are observationally indistinguishable even with an infinite dataset.

In the situations where the CIMM is identifiable, our technique estimates the true w_i and $f\,^{(i)}_v$ . On the other hand, when the model is not identifiable, our technique will yield one of the parameterizations that best fits the available data. In section 4.1, we provide some new results on the (nonparametric) identifiability of bivariate (V = 2) CIMMs, to supplement existing results on nonparametric mixture model identifiability [2, 5, 33, 39–45].

1.2. Unsupervised classification in machine learning

In this paper, we approach the estimation of CIMMs as a classification problem—classifying the datapoints in a given dataset into the different categories will lead to the estimation of the mixture model in a straightforward way.

This classification needs to be performed in an unsupervised manner since the dataset being analyzed does not contain labels for the component each datapoint belongs to. In this way, our method has connections to unsupervised clustering techniques like k-means clustering [46] and other density-based clustering techniques [47]. However, our approach does not rely on the different components being spatially clustered to perform the classification.

The intuition behind our method can be understood as follows: In supervised classification, the target class labels associated with the training datapoints serve as the supervisory signal for training the classifier. In the absence of target labels, a quantity that is dependent on or shares mutual information with, the (unavailable) target label can be used as the supervisory signal. Now, lets say we are training a classifier that bases its decision or output only on the first variate x₁. The other variates $\{x_2,\dots,x_V\}$ can serve as the supervisory signal, since they contain information regarding the component i the datapoint belongs to. Our approach can be thought of as training V classifiers, one for each of the V variates, with each classifier relying on the other V − 1 variates to act as the supervisory signal for training.

There are a few ways of interpreting and actualizing this intuition [48–50]. For example, in [50], NNs were trained without supervision to classify images, using a training dataset consisting of pairs of images, where the images in a given pair are from the same category. In the InClass nets approach developed in this paper, the NN architecture and training cost functions we develop will primarily be geared towards estimating CIMMs, which, as shown below in section 3.3, can nevertheless be used for classification similar to the technique of [50]. In section 5.2, we also discuss a straightforward extension of the technique in [50] to handle n-tuples of data, where the components of the n-tuple could be from different sample spaces (as opposed to pairs of images from the same sample space of images).

The idea of using a quantity that shares information with the true labels as the supervisory signal has been employed previously in weakly supervised classification techniques like 'Learning from Label Proportions' (LLPs) [51–53] and 'Classification Without Labels' (CWoLa) [54]. LLP and CWoLa learn to distinguish between different classes of datapoints, using multiple mixed datasets which differ in the mixing proportions of the classes—the identity of the mixed dataset a given datapoint belongs to serves as the supervisory signal, since it contains information about the class the datapoint belongs to (due to the mixing proportions in different mixtures being different). While CWoLa and LLP are not fully unsupervised techniques (since they still require a label indicating which mixture a training datapoint belongs to), they are applicable even in situations where the distribution of the feature ${\mathcal{X}}$ within a given class i does not factorize as $\displaystyle\prod_{v = 1}^V\,f\,^{(i)}_v$ .

1.3. Other related work

The idea of separating a mixture into its components using a mutual-information-based technique is similar in spirit to Independent Component Analysis (ICA) [55]. However, ICA solves a signal separation problem where multiple mixtures with different mixing weights for the components are provided—this is different from the problem of separating data from a single CIMM into its components.

Bayesian nonparametric methods have been applied in the context of mixture models to select the number of components C in the mixture model using the data itself [56]. In this technique, the individual components of the mixture themselves are parameterized. In contrast, our technique assumes that the number of components C is a priori known (we briefly discuss how C could be estimated in section 4.3), but the individual component distributions are left nonparameterized.

Mixture models have applications in data analysis in high energy physics [24–26], where datasets are mixtures of 'events' (datapoints) produced under different 'processes' (categories). The ${}_s\mathcal{P}lot$ technique [57], which is popular in data analysis in high energy physics, is used to analyze bivariate CIMMs where the distribution of one of the variables (referred to as the discriminating variable) is known a priori. In such situations, the ${}_s\mathcal{P}lot$ technique can estimate the distribution of the other variable, referred to as the control variable. On the other hand, the InClass nets approach introduced in this paper is capable of estimating the mixture model 'without any knowledge of the distributions of any of the variables'. In section 4.5, we describe how the InClass nets approach can be modified to incorporate additional information about the distributions of some of the variates.

1.4. Synopsis

1.4.1. Review of parametric model estimation using maximum likelihood estimation

Our technique for nonparametric mixture model estimation is closely related to the estimation of parametric models using maximum likelihood estimation (MLE). Under MLE, we have a parametric class of probability distributions $\{{\mathcal{P}}_{\boldsymbol{\theta}}\,:\,\boldsymbol{\theta}\in\Theta\}$ , parameterized by θ (possibly multi-dimensional). We are provided a dataset $\{{\mathcal{X}}_1,\dots, {\mathcal{X}}_N\}$ of size N sampled from an unknown data-distribution ${\mathcal{P}}_{\boldsymbol{\theta}^\ast}$ , which is known to belong to the parametric class. The goal is to determine the value of $\boldsymbol{\theta}^\ast$ (assuming the model is identifiable). The MLE estimator for $\boldsymbol{\theta}^\ast$ is given by

$\begin{align} \hat{\boldsymbol{\theta}}_\mathrm{MLE} = \mathop{\textrm{arg}\,\textrm{max}}_{\boldsymbol{\theta}\in\Theta}\left[\frac{1}{N}\sum_{a = 1}^N~\ln {\mathcal{P}}_{\boldsymbol{\theta}}({\mathcal{X}}_a)\right] = \mathop{\textrm{arg}\,\textrm{min}}_{\boldsymbol{\theta}\in\Theta}\left[-\frac{1}{N}\sum_{a = 1}^N~\ln {\mathcal{P}}_{\boldsymbol{\theta}}({\mathcal{X}}_a)\right]. \end{align} \tag{ 4 }$

The asymptotic consistency of this estimator follows from noting that

$\begin{align} \mathop{\textrm{plim}}_{N\rightarrow \infty}~\hat{\boldsymbol{\theta}}_\mathrm{MLE} & = \mathop{\textrm{arg}\,\textrm{min}}_{\boldsymbol{\theta}\in\Theta}\Big[E_{{\mathcal{P}}_{\boldsymbol{\theta}^\ast}}\big[-\ln {\mathcal{P}}_{\boldsymbol{\theta}}({\mathcal{X}})\big]\Big] \end{align} \tag{ 5a }$

$\begin{align} & = \mathop{\textrm{arg}\,\textrm{min}}_{\boldsymbol{\theta}\in\Theta}\Big[E_{{\mathcal{P}}_{\boldsymbol{\theta}^\ast}}\big[\ln {\mathcal{P}}_{\boldsymbol{\theta}^\ast}({\mathcal{X}}) - \ln {\mathcal{P}}_{\boldsymbol{\theta}}({\mathcal{X}})\big]\Big] \end{align} \tag{ 5b }$

$\begin{align} & = \mathop{\textrm{arg}\,\textrm{min}}_{\boldsymbol{\theta}\in\Theta}\Big[\mathrm{KL}\Big[{\mathcal{P}}_{\boldsymbol{\theta}^\ast}~\big|\big|~{\mathcal{P}}_{\boldsymbol{\theta}} \Big]\Big]\, , \end{align} \tag{ 5c }$

where $\mathop{\textrm{plim}}$ denotes convergence in probability, and $\mathrm{KL}\Big[{\mathcal{P}}_{\boldsymbol{\theta}^\ast}~\big|\big|~{\mathcal{P}}_{\boldsymbol{\theta}} \Big]$ is the Kullback–Leibler divergence from ${\mathcal{P}}_{\boldsymbol{\theta}}$ to ${\mathcal{P}}_{\boldsymbol{\theta}^\ast}$ , which is minimized when ${\mathcal{P}}_{\boldsymbol{\theta}}$ equals ${\mathcal{P}}_{\boldsymbol{\theta}^\ast}$ almost everywhere.

1.4.2. Blueprint for nonparametric estimation of CIMMs

The preceding review of MLE suggests the following approach to estimating CIMMs nonparametrically using a dataset sampled from the data-distribution ${\mathcal{P}}^\ast$ .

(a)
Search through the space of CIMMs ${\mathcal{P}}$ of the form given in (3), and
(b)
Minimize an appropriate objective function which depends on ${\mathcal{P}}$ and the available data. Asymptotically, minimizing the objective function should be equivalent to minimizing the Kullback–Leibler divergence from ${\mathcal{P}}$ to the data distribution ${\mathcal{P}}^\ast$ .

We will use NNs to parameterize CIMMs. More specifically, we will use V different NN-based classifiers (one for each variate), along with the marginal distributions of the individual variates in the data, to parameterize CIMMs. We refer to this as the Independent Pseudo Classifiers (IPCs) representation of CIMMs (section 2.1.1). We will show that every CIMM has a (non-unique) IPC representation. Thus the task of searching through the space of CIMMs has been converted into the task of searching through the space of classifiers, i.e, training the NN-based classifiers using an appropriate cost function.

Next, in section 2.1.4, we will derive an expression for the KL divergence from the distribution ${\mathcal{P}}$ (in the IPC representation) to the data distribution ${\mathcal{P}}^\ast$ (up to a constant term independent of ${\mathcal{P}}$ ). This is the quantity to be minimized in order to estimate ${\mathcal{P}}^\ast$ . After deriving the expression for $\mathrm{KL}\Big[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}}\Big]$ , one can find a data-sample-based estimator for the same by replacing expectations over ${\mathcal{P}}^\ast$ with sample means. This leads to a cost function which only depends on the NN-classifier 'outputs' for the various 'input' datapoints in the sample. Minimizing this cost function is asymptotically equivalent to minimizing $\mathrm{KL}\Big[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}}\Big]$ . In section 2.1, we will also describe how the weights w_i and component distributions $f\,^{(i)}_v$ of the estimated mixture model can be extracted from the classifier networks.

The rest of the paper is organized as follows. In section 2.2 we use the bivariate case as a simple case study to summarize the results from section 2.1, in the order in which they will be used in a typical analysis. We provide a public implementation of InClass nets as a Python package called RainDancesVI and use it to validate our InClass nets technique with several worked out examples in section 3. In section 4.1 we then derive some new results on the nonparametric identifiability of bivariate CIMMs, in the form of a necessary and a (different) sufficient condition for a bivariate CIMM to be identifiable. In sections 4 and 5 we discuss our technique in the context of science applications, and provide possible variations and extensions, before finally summarizing in section 6.

2. Methodology

2.1. Independent classifier networks (InClass nets)

For the purpose of nonparametric estimation of CIMMs, we introduce a new NN architecture which we shall call 'Independent Classifier networks' or 'InClass nets' for short. Under InClass nets, the V variates $\{x_1,\dots,x_V\}$ of the input ${\mathcal{X}}$ are fed into V independent NNs—one variate for each independent network. Each of the V networks returns a multi-class classifier output. More explicitly, for each $v\in \{1,\dots, V\}$ , the vth classifier network returns a vector $\left(\eta^{(1)}_v(x_v),\dots,\eta^{(C)}_v(x_v)\right)$ , whose ith component can 'roughly' be interpreted as the probability that a datapoint belongs to category i, conditional only on its x_v value.

The outputs of the independent classifiers are constrained to obey

$\begin{align} \eta^{(i)}_v(x_v) \geqslant 0\,,&\qquad \forall (i,v)\in\{1,\dots, C\}\times\{1,\dots, V\}\,, \end{align} \tag{ 6a }$

$\begin{align} \sum_{i = 1}^C~\eta^{(i)}_v(x_v) = 1\,,&\qquad \forall v\in\{1,\dots, V\}\,, \end{align} \tag{ 6b }$

possibly using the softmax output layer⁴ [58] as follows:

$\begin{equation} \eta^{(i)}_v(x_v) = \texttt{softmax}^{(i)}\left(z^{(1)}_v,\dots,z^{(C)}_v\right) \,,\qquad \forall (i,v)\in\{1,\dots, C\}\times\{1,\dots, V\}\,, \end{equation} \tag{ 7 }$

where the $z^{(i)}$ -s are the inputs to the final output layer (which performs the softmax operation) of the corresponding network. Alternatively, if softmax is used as an activation function of the final layer, then the $z^{(i)}$ -s represent the outputs of the layer before applying the activation function. The $\texttt{softmax}$ function is defined as

$\begin{equation} \texttt{softmax}^{(i)}\left(z^{(1)}_v,\dots,z^{(C)}_v\right) \equiv \frac{\exp\!{\left(z^{(i)}_v\right)}}{\sum\limits_{j = 1}^C~\displaystyle \exp\!{\left(z^{(j)}_v\right)}}\,. \end{equation} \tag{ 8 }$

Figure 1 illustrates the basic architecture of InClass nets. In the next few sections, we will build the framework for estimating CIMMs using InClass nets.

**Figure 1.** Basic architecture of independent classifier networks (InClass nets). The V variates $\{x_1,\dots,x_V\}$ of the input ${\mathcal{X}}$ are fed into V independent NNs, each of which returns a multi-class classifier output $\eta^{(i)}_v(x_v)$ for $v\in \{1,\dots, V\}$ .
Download figure:
Standard image High-resolution image

Recall that the variates x_v can be multi-dimensional. In particular, InClass nets can handle high-dimensional data types like images. The choice of architecture for the individual classifier networks can be influenced by the nature of the input data the classifier will handle.

For the purposes of this paper, we have restricted the output dimensionality of the individual classifiers to be the same (equal to C). We have also restricted the inputs $\{x_1,\dots,x_V\}$ of the individual classifiers to form a non-overlapping partition of the features or attributes in ${\mathcal{X}}$ , which is in line with the structure of CIMMs. However, InClass nets can have applications outside mixture model estimation as well, and for those purposes it may be appropriate to lift the above restrictions. For example, InClass nets can be used to perform unsupervised 'multi-label' classification, where the outputs of different classifiers correspond to different labels. In this case, the different classifiers can have different output dimensionalities and the inputs to these networks can also potentially have overlapping features. In section 5.2 we will briefly indicate how the multi-label variant of InClass nets can be trained to perform unsupervised classification by maximizing the mutual information between the classifier outputs.

2.1.1. Parameterizing mixture models with InClass nets

In this section we will show how CIMMs can be parametrized using InClass nets. The parametrization will be done using the IPCs representation of mixture models which will be introduced in section 2.1.1. But as a useful lead-up, let us first introduce the Constrained Independent Classifiers (CICs) representation.

2.1.1.1. Constrained independent classifiers representation

The mixture model in (3) is completely specified by the mixture weights w_i and the distributions $f\,^{(i)}_v$ . Recall that they satisfy

$\begin{align} w_i\geqslant 0\,, ~~f\,^{(i)}_v(x_v) \geqslant 0\,,&\qquad\forall (i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}\,, \end{align} \tag{ 9a }$

$\begin{align} \sum_{i = 1}^C~w_i = 1\,,& \end{align} \tag{ 9b }$

$\begin{align} \int dx_v~f\,^{(i)}_v(x_v) = 1\,,&\qquad \forall (i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}\,.\end{align} \tag{ 9c }$

The goal of this paper is to develop a machine learning technique to fit a mixture model to the given data in an agnostic, nonparametric, manner. In other words, we will estimate the weights w_i and the distributions $f\,^{(i)}_v$ , without assuming, a priori, any parameterized forms (like Gaussians, exponentials, etc) for $f\,^{(i)}_v$ . We will approach this as an unsupervised multi-class classification problem—classifying the data into different components will automatically result in an estimation of the mixture model⁵ . To this end, let us rewrite the mixture model distribution in terms of the marginal distributions ${\mathcal{P}}_{\!v}(x_v)$ of the individual variates and multi-class classifiers $\alpha^{(i)}_v(x_v)$ given by

$\begin{align} {\mathcal{P}}_{\!v}(x_v) & = \int dx_{1}\,\dots\int dx_{v-1}\int dx_{v+1}\,\dots\int dx_{V}~{\mathcal{P}}({\mathcal{X}}) \end{align} \tag{ 10a }$

$\begin{align} & = \sum_{i = 1}^C~w_i\,f\,^{(i)}_v(x_v)\,,\qquad \forall v\in\{1,\dots, V\}\,, \end{align} \tag{ 10b }$

$\begin{align} \alpha^{(i)}_v(x_v) & = \frac{w_i\,\,f\,^{(i)}_v(x_v)}{{\mathcal{P}}_{\!v}(x_v)}\,,\qquad \forall (i,v)\in\{1,\dots, C\}\times\{1,\dots, V\}\,. \end{align} \tag{ 10c }$

${\mathcal{P}}_{\!v}$ is the probability density of the vth variate in the full mixture and can be directly accessed from a dataset sampled from ${\mathcal{P}}$ . $\alpha^{(i)}_v(x_v)$ can be interpreted as the probability that an observed datapoint is from component i conditional on the value of x_v. The vector function $\left(\alpha^{(1)}_v(x_v),\dots,\alpha^{(C)}_v(x_v)\right)$ can be interpreted as the output of a multi-class 'classifier' that returns the probability of a datapoint ${\mathcal{X}}$ to have come from the different components based only on the vth variate. At this point, one might already notice an emerging connection with InClass nets, which we shall crucially exploit below. The marginals density functions ${\mathcal{P}}_{\!v}$ and the multi-class classifiers $\alpha^{(i)}_v$ satisfy

$\begin{align} {\mathcal{P}}_{\!v}(x_v)\geqslant 0\,,~~\alpha^{(i)}_v(x_v) &\geqslant 0\,,\quad\forall(i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}\,, \end{align} \tag{ 11a }$

$\begin{align} \int dx_v~{\mathcal{P}}_{\!v}(x_v) & = 1\,,\quad\forall v\in \{1,\dots, V\}\,, \end{align} \tag{ 11b }$

$\begin{align} \sum_{i = 1}^C \alpha^{(i)}_v(x_v) & = 1\,,\quad\forall v\in \{1,\dots, V\}\,, \end{align} \tag{ 11c }$

$\begin{align} \int dx_v~{\mathcal{P}}_{\!v}(x_v)\,\alpha^{(i)}_v(x_v) & = \int dx_{u}~{\mathcal{P}}_{\!u}(x_{u})\,\alpha^{(i)}_{u}(x_{u})\,,~~\forall(i,v,u)\in \{1,\dots, C\}\times\{1,\dots, V\}^2\,, \end{align} \tag{ 11d }$

where the integrals in (11d ) are simply equal to the weight w_i of the ith component. There is a one-to-one map⁶ from the description of the mixture model in terms of the w_i-s and $f\,^{(i)}_v$ -s satisfying (9) to the description in terms of ${\mathcal{P}}_{\!v}$ -s and $\alpha^{(i)}_v$ -s satisfying (11). This can be seen from the existence of the inverse transform shown below:

$\begin{align} w_i & = \int dx_v~{\mathcal{P}}_{\!v}(x_v)\,\alpha^{(i)}_v(x_v) \equiv E_{{\mathcal{P}}}\left[\alpha^{(i)}_v\right]\,,&&\qquad \forall (i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}, \end{align} \tag{ 12a }$

$\begin{align} f\,^{(i)}_v(x_v) & = \frac{{\mathcal{P}}_{\!v}(x_v)~\alpha^{(i)}_v(x_v)}{w_i}\,,&&\qquad \forall (i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}, \end{align} \tag{ 12b }$

where $E_{{\mathcal{P}}}[\,\cdots]$ represents the expectation value of $\,\cdots$ under the model. The probability density of ${\mathcal{X}}$ under the corresponding mixture model is given by

$\begin{align} {\mathcal{P}}({\mathcal{X}}) & = \sum_{i = 1}^C~w_i~\prod_{v = 1}^V~\frac{{\mathcal{P}}_{\!v}(x_v)~\alpha^{(i)}_v(x_v)}{w_i} \end{align} \tag{ 13a }$

$\begin{align} & = \left[\prod_{v = 1}^V {\mathcal{P}}_{\!v}(x_v)\right]~~\left[\sum_{i = 1}^C~w_i^{1-V}~\prod_{v = 1}^V~\alpha^{(i)}_v(x_v)\right]\,.\end{align} \tag{ 13b }$

As mentioned earlier, the marginal distributions ${\mathcal{P}}_{\!v}$ can be directly estimated from the data. The functions $\alpha^{(i)}_v$ can 'potentially' be modeled using InClass nets. The only hurdle is that while the outputs $\alpha^{(i)}_v$ of the V NNs can be constrained to obey (11a ) and (11c ) using the softmax output layer (as seen in (6)), the constraint in (11d ) in general will not be satisfied by independent classifiers. We will handle this difficulty next in section 2.1.1. We will refer to the description in terms of ${\mathcal{P}}_{\!v}$ -s and $\alpha^{(i)}_v$ -s satisfying (11a )–(11d ) as the CICs representation of the mixture model.

2.1.1.2. Independent pseudo classifiers representation

To accommodate the fact that independent classifiers will not obey the constraint (11d ) of the CICs representation, we introduce the IPCs representation in terms of pseudo marginals ${\mathcal{Q}}_{v}$ and pseudo classifiers $\beta^{(i)}_v$ which only satisfy the equivalents of constraints (11a –11c ):

$\begin{align} {\mathcal{Q}}_{v}(x_v)\geqslant 0\,,~~\beta^{(i)}_v(x_v) &\geqslant 0\,,&&\qquad \forall(i,v)\in \{1,\dots, C\}\times\{1,\dots, V\}\,, \end{align} \tag{ 14a }$

$\begin{align} \int dx_v~{\mathcal{Q}}_{v}(x_v) & = 1\,,&&\qquad \forall v\in \{1,\dots, V\}\,, \end{align} \tag{ 14b }$

$\begin{align} \sum_{i = 1}^C \beta^{(i)}_v(x_v) & = 1\,,&&\qquad \forall v\in \{1,\dots, V\}\,.\end{align} \tag{ 14c }$

The mixture weights under the IPC representation are given by

$\begin{equation} w_i = \frac{\tilde{w}_i}{\sum\limits_{j = 1}^C~\tilde{w}_{j}}\,,\qquad \forall i\in \{1,\dots, C\}\,, \end{equation} \tag{ 15 }$

where the unnormalized weights $\tilde{w}_i$ -s are given by

$\begin{align} \tilde{w}_i = \left[\prod_{v = 1}^V\int dx_v~{\mathcal{Q}}_{v}(x_v)\,\beta^{(i)}_v(x_v)\right]^{1/V} & = \left[\prod_{v = 1}^V~E_{{\mathcal{Q}}}\left[\beta^{(i)}_v\right]\right]^{1/V}\nonumber\\ & \equiv \left[\prod_{v = 1}^V~\varphi^{(i)}_v\right]^{1/V}\,,\qquad \forall i\in \{1,\dots, C\}\,, \end{align} \tag{ 16 }$

where $\varphi^{(i)}_v \equiv E_{{\mathcal{Q}}}\left[\beta^{(i)}_v\right]$ represents the expectation value of $\beta^{(i)}_v$ under the distribution ${\mathcal{Q}}_{v}$ . The distributions $f\,^{(i)}_v$ within the different components are given under the IPC representation by

$\begin{equation} f\,^{(i)}_v(x_v) = \frac{{\mathcal{Q}}_v(x_v)~\beta^{(i)}_v(x_v)}{E_{{\mathcal{Q}}}\left[\beta^{(i)}_v\right]} = \frac{{\mathcal{Q}}_v(x_v)~\beta^{(i)}_v(x_v)}{\varphi^{(i)}_v} \,. \end{equation} \tag{ 17 }$

In (15), we have used $\tilde{w_i}$ , which is defined in (16) as the geometric mean⁷ of $\varphi^{(i)}_v$ -s, as the actual mixture weight w_i, after an appropriate scaling to make the weights add up to 1 across all components. We will refer to $\varphi^{(i)}_v$ as the pseudo weight of component i corresponding to variate v. Using (15) and (17), we can write the probability density function for the mixture model in the IPC representation as

$\begin{align} {\mathcal{P}}({\mathcal{X}}) & = \sum_{i = 1}^C~w_i~\prod_{v = 1}^V~\frac{{\mathcal{Q}}_{v}(x_v)~\beta^{(i)}_v(x_v)}{\tilde{w}_i} \end{align} \tag{ 18a }$

$\begin{align} & = \left[\prod_{v = 1}^V {\mathcal{Q}}_{v}(x_v)\right]~~\frac{\displaystyle\sum_{i = 1}^C~\tilde{w}_i^{1-V}~\prod_{v = 1}^V~\beta^{(i)}_v(x_v)}{\displaystyle\sum_{i = 1}^C~\tilde{w}_i}\,, \end{align} \tag{ 18b }$

where the $\tilde{w}_i$ -s can be written in terms of ${\mathcal{Q}}_{v}$ -s and $\beta^{(i)}_v$ -s using (16). We will now make the following observations relevant to our goal of fitting a mixture model to data using InClass nets:

(a)
IPC describes a CIMM. Note that by construction, the mixture weights in (15) and the distributions in (17) are non-negative and normalized to 1.
(b)
The pseudo marginals and pseudo classifiers do not necessarily correspond to the true marginals and classifiers. However, the true marginals and classifiers of the CIC representation can be extracted from the IPC representation as follows
$\begin{align} {\mathcal{P}}_{\!v}(x_v) & = \frac{{\mathcal{Q}}_{v}(x_v)}{\sum\limits_{i = 1}^C~\tilde{w}_i}~\sum_{i = 1}^C~\frac{\beta^{(i)}_v(x_v)~\tilde{w}_i}{\varphi^{(i)}_v}\,, \end{align} \tag{ 19a }$

$\begin{align} \alpha^{(i)}_v(x_v) & = \left[\sum_{j = 1}^C~\frac{\beta^{(j)}_v(x_v)~\tilde{w}_{j}}{\varphi^{(j)}_v}\right]^{-1}~~\frac{\beta^{(i)}_v(x_v)~\tilde{w}_i}{\varphi^{(i)}_v}\,.\end{align} \tag{ 19b }$
These results follow from plugging in (15)–(17) in (10).
(c)
The IPC representation of a mixture model is not unique. Unlike the CIC representation, we cannot find a unique map from the weights w_i and $f\,^{(i)}_v$ to the pseudo marginals and pseudo classifiers. This is because of the additional degrees of freedom due to the removal of the constraints in (11d ).
(d)
Every mixture model has an IPC representation⁸ in which the pseudo marginals match the true marginals of the model. This can be seen from the fact that the true marginals ${\mathcal{P}}_{\!v}$ and classifiers $\alpha^{(i)}_v$ from the CIC representation of a mixture model can used as the pseudo marginals ${\mathcal{Q}}_v$ and pseudo classifiers $\beta^{(i)}_v$ under the IPC representation to get the same model.

Observation (d) means that in order to fit a mixture model to data, we can restrict ourselves to IPC representations of the mixture models with the pseudo marginals set to the marginals of the data. The only remaining unknowns in the IPC representation are the pseudo classifiers $\beta^{(i)}_v(x_v)$ which we can parameterize using an InClass net, identifying $\beta^{(i)}_v$ with the network output $\eta^{(i)}_v$ . Next, we will develop the technique to fit a mixture model parameterized with an InClass net to a given dataset.

2.1.2. Fitting mixture models to data with InClass nets

In this section we will construct a cost function which can be used to train InClass nets to fit mixture models to the given data. Let the data to which we want to fit a mixture model be sampled from the true underlying distribution ${\mathcal{P}}^\ast({\mathcal{X}})$ with true marginals ${\mathcal{P}}^\ast_{\!v}(x_v)$ . As per observation (d) in the previous section, we restrict our attention to IPC representations with ${\mathcal{Q}}_{v}\equiv {\mathcal{P}}^\ast_{\!v}$ . Using (16) and (18b ), we can write the probability density of ${\mathcal{X}}$ under this restricted class of mixture models as

$\begin{equation} {\mathcal{P}}({\mathcal{X}}) = \left[\prod_{v = 1}^V {\mathcal{P}}^\ast_{\!v}(x_v)\right]~\frac{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v(x_v)~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]}{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]}\,, \end{equation} \tag{ 20 }$

where $E_{{\mathcal{P}}^\ast}$ refers to the expectation value under the true distribution of the data. The best-fitting ${\mathcal{P}}$ can be estimated by minimizing the Kullback–Leibler (KL) divergence from ${\mathcal{P}}$ to ${\mathcal{P}}^\ast$ given by

$\begin{equation} \mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}} \right] = \int d{\mathcal{X}}~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}^\ast({\mathcal{X}})}{{\mathcal{P}}({\mathcal{X}})}\right]}\,. \end{equation} \tag{ 21 }$

As discussed in section 1.4, minimizing the KL divergence (over some class of distributions) is equivalent to, and commonly known in some disciplines as, maximizing the likelihood in the large statistics limit. Using the expression for ${\mathcal{P}}$ from (20), we can rewrite (21) as

$\begin{align} \mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}} \right] & = \int d{\mathcal{X}} ~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}^\ast({\mathcal{X}})}{\left[\prod\limits_{v = 1}^V {\mathcal{P}}^\ast_{\!v}(x_v)\right]}~\frac{\left[\prod\limits_{v = 1}^V {\mathcal{P}}^\ast_{\!v}(x_v)\right]}{{\mathcal{P}}({\mathcal{X}})}\right]} \end{align} \tag{ 22a }$

$\begin{align} & = \int d{\mathcal{X}} ~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}^\ast({\mathcal{X}})}{\left[\prod\limits_{v = 1}^V {\mathcal{P}}^\ast_{\!v}(x_v)\right]}\right]} - \int d{\mathcal{X}} ~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}({\mathcal{X}})}{\left[\prod\limits_{v = 1}^V {\mathcal{P}}^\ast_{\!v}(x_v)\right]}\right]} \end{align} \tag{ 22b }$

$\begin{align} & = C^\ast(x_1,\dots,x_V) - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\frac{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]}{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]}\right\}~\right], \end{align} \tag{ 22c }$

where $C^\ast(x_1,\dots,x_V)$ is the total correlation [59, 60] of the V variates in the data, which is one of the generalizations of mutual information to more than two variables. It is given by the KL divergence from the product distribution $\prod\limits_v {\mathcal{P}}^\ast_{\!v}(x_v)$ to the joint distribution ${\mathcal{P}}^\ast({\mathcal{X}})$ as

$\begin{equation} C^\ast(x_1,\dots,x_V) = \int d{\mathcal{X}}~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}^\ast({\mathcal{X}})}{{\mathcal{P}}^\ast_{\!1}(x_1) ~{\mathcal{P}}^\ast_{\!2}(x_2)~\dots~{\mathcal{P}}^\ast_{\!V}(x_V)}\right]}\,. \end{equation} \tag{ 23 }$

Note that the $C^\ast$ term in (22c ) is independent of the state of the InClass net under consideration. This means that the second term in (22c ) can be used as a cost function for the network to minimize in order to minimize the KL divergence, and hence fit the mixture model parameterized by the InClass net to the data. Noting the similarity between the two terms in (22b ) and drawing inspiration from the naming of 'cross entropy', we introduce the 'negative cross total correlation' cost function (neg_ctc_cost) defined as

$\begin{align} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} & = \mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}} \right] - C^\ast(x_1,\dots,x_V) \end{align} \tag{ 24a }$

$\begin{align} & = - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\frac{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]}{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]}\right\}~\right]\,.\end{align} \tag{ 24b }$

Note that $\beta^{(i)}_v$ are functions of the corresponding input variate x_v. Despite the complicated appearance, this cost function provides a viable approach to learning the underlying mixture model from data. Let us make the following observations in the context of training InClass nets using this cost function, with outputs $\eta^{(i)}_v$ of the network identified with $\beta^{(i)}_v$ .

(a)
The cost function depends only on the outputs $\beta^{(i)}_v$ of the network. More precisely, the cost function depends on the distribution of the network output. It does not need the input data to be labelled to learn the mixture model, and the only supervisory signal exploited by the training process is the joint-distribution of the input data.
(b)
The cost function for a given state of the InClass net can be estimated using a (mini-)batch of training samples by approximating the expectation values $E_{{\mathcal{P}}^\ast}[\,\cdots]$ with sample means, as shown below:
$\begin{align} & \texttt{neg}\_\texttt{ctc}\_\texttt{cost}\_\texttt{from}\_\texttt{data}\nonumber\\ & \quad = - \frac{1}{N_\mathrm{bat}}\sum_{a = 1}^{N_\mathrm{bat}}\left[~\log\left\{\frac{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v(x_{a,v})~\left(\frac{1}{N_\mathrm{bat}}\sum_{b = 1}^{N_\mathrm{bat}}\left[\beta^{(i)}_v(x_{b,v})\right]\right)^{(1-V)/V}\right]}{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(\frac{1}{N_\mathrm{bat}}\sum_{b = 1}^{N_\mathrm{bat}}\left[\beta^{(i)}_v(x_{b,v})\right]\right)^{1/V}\right]}\right\}~\right]\,, \end{align} \tag{ 25 }$
where a and b are sample indices, and $x_{a,v}$ is the vth variate of the ath datapoint. The batch size $N_\mathrm{bat}$ should be large enough to perform a good estimation of $E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]$ .

These observations allow us to train the NNs. The computation of the cost function for a batch of events is illustrated in figure 2. After training the InClass net, (15)–(18) and (17) can be used to extract the fitted model, with the pseudo marginals ${\mathcal{Q}}_v$ set to the true marginals ${\mathcal{P}}^\ast_v$ . The classifiers $\alpha^{(i)}_v(x_v)$ can be extracted from the pseudo classifiers $\beta^{(i)}_v(x_v)$ using (19b ). If one is interested in classifying the individual datapoints based on the full information ${\mathcal{X}}$ , an aggregate classifier can be constructed, based on (18b ), as

$\begin{equation} \alpha^{(i)}_\mathrm{aggregate}({\mathcal{X}}) = \frac{\displaystyle\tilde{w}_i^{1-V}~\prod_{v = 1}^V~\beta^{(i)}_v(x_v)}{\displaystyle\sum_{j = 1}^C~\tilde{w}_j^{1-V}~\prod_{v = 1}^V~\beta^{(j)}_v(x_v)}\,. \end{equation} \tag{ 26 }$

Note that if there is a mismatch between the model learned by the InClass net and the true distribution the data is sampled from, then classifying the data using the aggregate classifier will not necessarily lead to components within which the x_v-s are independent.

**Figure 2.** A flowchart illustrating the training of InClass nets. The indices $a\in\{1,\dots,N_\mathrm{bat}\}$ , $v\in\{1,\dots, V\}$ , and $i\in\{1,\dots, C\}$ correspond to samples, variates, and components, respectively. The diagram shows how the cost function in (25) is computed for a batch of $N_\mathrm{bat}$ datapoints, after they are sent through the InClass net. The output η of the InClass net is identified with the pseudo classifier β in the cost function.
Download figure:
Standard image High-resolution image

**Figure 2.** A flowchart illustrating the training of InClass nets. The indices $a\in\{1,\dots,N_\mathrm{bat}\}$ , $v\in\{1,\dots, V\}$ , and $i\in\{1,\dots, C\}$ correspond to samples, variates, and components, respectively. The diagram shows how the cost function in (25) is computed for a batch of $N_\mathrm{bat}$ datapoints, after they are sent through the InClass net. The output η of the InClass net is identified with the pseudo classifier β in the cost function.
Download figure:
Standard image High-resolution image

2.2. Bivariate case

When analyzing real data with CIMMs, a common difficulty is the identification of a suitable partitioning of the attributes of ${\mathcal{X}}$ into variates x_v so that the distribution within each component would factorize to a good approximation. In this sense, a higher number of (conditionally independent) variates represents stronger assumptions about the underlying model. This makes the bivariate case (V = 2) extremely important. The bivariate case is also difficult from an identifiability point of view—data distributed according to a conditional independence bivariate mixture model, in general, will not uniquely identify the model, since several different mixture models can lead to the same overall probability density ${\mathcal{P}}({\mathcal{X}})$ . In section 4.1, we will present some new results on the identifiability of conditional independence bivariate mixture models. In particular, we will provide the conditions under which bivariate mixture models are identifiable.

Despite being the most difficult case in terms of identifiability, the bivariate case lets us gain some useful intuition, as demonstrated with several examples in section 3 below. But first, in preparation for section 3, let us summarize the results from the previous sections for the bivariate case, in the order in which a typical analysis might use them.

2.2.1. Notation

The expressions from the previous sections become easier to follow if we explicitly write out the two variates, thus avoiding the product notation. To this end, let us simplify the notation by giving names x and y to our two variates x₁ and x₂, resulting in

$\begin{align} x \equiv x_1\,,\quad {\mathcal{P}}_{\!x} \equiv {\mathcal{P}}_{\!1}\,,\quad {\mathcal{Q}}_{x} \equiv {\mathcal{Q}}_{1}\,,\quad {\mathcal{P}}^\ast_{\!x} \equiv {\mathcal{P}}^\ast_{\!1}\,,\quad \alpha^{(i)}_x \equiv \alpha^{(i)}_1\,,\quad \beta^{(i)}_x \equiv \beta^{(i)}_1\,,\quad f\,^{(i)}_x \equiv f\,^{(i)}_1\,, \end{align} \tag{ 27a }$

$\begin{align}y \equiv x_2\,,\quad {\mathcal{P}}_{\!y} \equiv {\mathcal{P}}_{\!2}\,,\quad {\mathcal{Q}}_{y} \equiv {\mathcal{Q}}_{2}\,,\quad {\mathcal{P}}^\ast_{\!y} \equiv {\mathcal{P}}^\ast_{\!2}\,,\quad \alpha^{(i)}_y \equiv \alpha^{(i)}_2\,,\quad \beta^{(i)}_y \equiv \beta^{(i)}_2\,,\quad f\,^{(i)}_y \equiv f\,^{(i)}_2\,.\end{align} \tag{ 27b }$

Under this notation, the CIMM of (3) becomes simply

$\begin{equation}{\mathcal{P}}(x, y) = \sum_{i = 1}^C~w_i~f\,^{(i)}_x(x)~f\,^{(i)}_y(y)\,.\end{equation} \tag{ 28 }$

2.2.2. Cost function

Noting that total correlation is a generalization of mutual information for more than two variables, we will refer to the negative cross total correlation cost function of (24b ) in the bivariate special case as the 'negative cross mutual information' cost function (neg_cmi_cost). Under our new notation, it is given by

$\begin{equation} \texttt{neg}\_\texttt{cmi}\_\texttt{cost} = - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\frac{\displaystyle\sum_{i = 1}^C~\frac{\beta^{(i)}_x\,\beta^{(i)}_y}{\sqrt{\displaystyle E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_x\right]\,E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_y\right]}}}{\displaystyle\sum_{i = 1}^C~\sqrt{\displaystyle E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_x\right]\,E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_y\right]}}\right\}~\right]\,, \end{equation} \tag{ 29 }$

where, as before, $E_{{\mathcal{P}}^\ast}$ represents the expectation over the true distribution ${\mathcal{P}}^\ast(x, y)$ from which the data is sampled. As in (25), the expectations over ${\mathcal{P}}^\ast$ can be estimated using sample means to train the NNs using this cost function.

2.2.3. Extracting the learned mixture model from the trained network

After training the InClass net, the trained $\beta^{(i)}_x$ and $\beta^{(i)}_y$ cannot directly be interpreted as classifiers based on x and y since they may correspond to different mixture weights. In order to extract the learned mixture model (and the corresponding classifiers), we can first estimate the pseudo weights $\varphi^{(i)}_x$ and $\varphi^{(i)}_y$ from the data as

$\begin{equation} \varphi^{(i)}_x = E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_x\right]\,,\qquad\qquad \varphi^{(i)}_y = E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_y\right]\,. \end{equation} \tag{ 30 }$

Now, using (16) and (19b ), the marginals and classifiers for the model represented by the InClass net can be constructed as

$\begin{align} {\mathcal{P}}_{\!x}(x) = {\mathcal{P}}^\ast_{\!x}(x)~\frac{\displaystyle\sum_{i = 1}^C~\beta^{(i)}_x(x)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(i)}_y}{\displaystyle \varphi^{(i)}_x}}}{\displaystyle\sum_{i = 1}^C~\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}\,,\qquad\qquad \alpha^{(i)}_x(x) = \frac{\beta^{(i)}_x(x)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(i)}_y}{\displaystyle \varphi^{(i)}_x}}}{\displaystyle\sum_{j = 1}^C~\beta^{(j)}_x(x)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(j)}_y}{\displaystyle \varphi^{(j)}_x}}}\,, \end{align} \tag{ 31a }$

$\begin{align} {\mathcal{P}}_{\!y}(y) = {\mathcal{P}}^\ast_{\!y}(y)~\frac{\displaystyle\sum_{i = 1}^C~\beta^{(i)}_y(y)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(i)}_x}{\displaystyle \varphi^{(i)}_y}}}{\displaystyle\sum_{i = 1}^C~\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}\,,\qquad\qquad \alpha^{(i)}_y(y) = \frac{\beta^{(i)}_y(y)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(i)}_x}{\displaystyle \varphi^{(i)}_y}}}{\displaystyle\sum_{j = 1}^C~\beta^{(j)}_y(y)~\displaystyle\sqrt{\displaystyle\frac{\varphi^{(j)}_x}{\displaystyle \varphi^{(j)}_y}}}\,.\end{align} \tag{ 31b }$

From (15) and (16), the component weights of the learned model are given by

$\begin{equation} w_i = \frac{\displaystyle\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}{\displaystyle\sum_{j = 1}^C~\sqrt{\displaystyle \varphi^{(j)}_x\,\varphi^{(j)}_y}}\end{equation} \tag{ 32 }$

and from (17), the distributions $f\,^{(i)}_x$ and $f\,^{(i)}_y$ within each component are given by

$\begin{equation} f\,^{(i)}_x(x) = \frac{{\mathcal{P}}^\ast_{\!x}(x)~\beta^{(i)}_x(x)}{\displaystyle \varphi^{(i)}_x}\,,\qquad\qquad f\,^{(i)}_y(y) = \frac{{\mathcal{P}}^\ast_{\!y}(y)~\beta^{(i)}_y(y)}{\displaystyle \varphi^{(i)}_y}\,. \end{equation} \tag{ 33 }$

The corresponding joint distribution is given by

$\begin{equation} {\mathcal{P}}(x, y) = {\mathcal{P}}^\ast_{\!x}(x)\,{\mathcal{P}}^\ast_{\!y}(y)~\frac{\displaystyle\sum_{i = 1}^C~\frac{\beta^{(i)}_x(x)\,\beta^{(i)}_y(y)}{\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}}{\displaystyle\sum_{i = 1}^C~\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}\,. \end{equation} \tag{ 34 }$

Note that after estimating ${\mathcal{P}}^\ast_{\!x}$ , ${\mathcal{P}}^\ast_{\!y}$ , $\varphi^{(i)}_x$ , and $\varphi^{(i)}_y$ from the dataset, the mixture model can be read off directly from the InClass net using (32) and (33).

2.2.4. Aggregate classifier

From (26), the aggregate classifier that classifies the individual datapoints based on the full information (x, y) is given by

$\begin{equation} \alpha^{(i)}_\mathrm{aggregate}(x, y) = \frac{\displaystyle\frac{\beta^{(i)}_x(x)\,\beta^{(i)}_y(y)}{\sqrt{\displaystyle \varphi^{(i)}_x\,\varphi^{(i)}_y}}}{\displaystyle\sum_{j = 1}^C~\frac{\beta^{(j)}_x(x)\,\beta^{(j)}_y(y)}{\sqrt{\displaystyle \varphi^{(j)}_x\,\varphi^{(j)}_y}}}\,. \end{equation} \tag{ 35 }$

3. Results

We provide a public, tensorflow-based [61], implementation of InClass nets as a Python 3 package called RainDancesVI [62]. The package provides a) routines for wrapping the classifier networks of individual variates into InClass nets, and b) cost functions for training them. It also provides utilities for extracting the model learned by the network post-training. In this section we will demonstrate the working of InClass nets using several toy examples [63] analyzed using RainDancesVI. In each case, we assume that the number of components C in the mixture is known a priori. The examples considered below are meant for illustration purposes, and were deliberately chosen to require no domain knowledge. At the same time, there are many potential applications of the method to real experimental data, e.g. in astro-particle physics for studying dark matter kinematic substructure in the Milky Way [22], which we are currently pursuing in a separate project.

3.1. Mixture of two independent bivariate Gaussians $(V = 2, C = 2)$

In the first example, we consider the mixture of two independent bivariate Gaussians. In the first component, x and y are both (independently) normally distributed with mean −1 and standard deviation 1.5. The second component is identical, except x and y both have mean +1. The mixture weights are taken to be $w_1 = 0.4, w_2 = 0.6$ . Table 1 summarizes the mixture model specification and figure 3 shows the normalized joint distributions of (x, y) under each of the two components as heatmaps. Figure 4 shows the normalized joint distribution of (x, y) under the mixture model and our InClass net will estimate the mixture model based on data generated as per this distribution.

**Figure 3.** Heatmaps of the normalized joint distributions of (x, y) under component 1 (left panel) and component 2 (right panel) for the example considered in section 3.1.
Download figure:
Standard image High-resolution image

**Figure 4.** Heatmap of the normalized joint distribution of (x, y) under the mixture model defined in table 1.
Download figure:
Standard image High-resolution image

Table 1. The mixture model specification for the example considered in section 3.1.

i	w_i	$f\,^{(i)}_x$	$f\,^{(i)}_y$
1	0.4	$\mathcal{N}(\textrm{mean} = -1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = -1, \textrm{SD} = 1.5)$
2	0.6	$\mathcal{N}(\textrm{mean} = +1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = +1, \textrm{SD} = 1.5)$

The classifier networks $\beta^{(i)}_x$ and $\beta^{(i)}_y$ were constructed using keras with the tensorflow backend. The NNs are distinct, but have identical architectures. The networks are fairly simple, consisting of three sequential dense layers of 32 nodes using the rectified linear unit (ReLU) [64] activation function. The output layer is a dense layer with two nodes (since C = 2), with the softmax activation function. The individual classifier networks were then wrapped into an InClass net using the RainDancesVI package. The resulting network has a total of 4484 trainable parameters.

We trained the InClass net to minimize the negative cross mutual information cost function (29), using 100 000 datapoints sampled from the distribution depicted in figure 4. The optimization was performed for 15 epochs with the Adam [65] optimizer (with default hyperparameters) using a batch size of 50. After training the network, we used the same dataset to estimate the pseudo weights $\varphi^{(i)}_{x}$ and $\varphi^{(i)}_{y}$ and the mixture weights w_i using (30) and (32). Note that the estimation of mixture models can only be performed up to permutations of the components indexed by i. For clarity of the presentation, unless otherwise stated, the components of the true mixture model will be matched with the respective closest candidates from the machine-learned components. The results of the estimation of the mixture weights are summarized in table 2, which demonstrates an excellent agreement between the true and estimated values.

Table 2. Results of the estimation of the mixture weights for the example considered in section 3.1.

i	Estimated $\varphi^{(i)}_x$	Estimated $\varphi^{(i)}_y$	Estimated w_i	True w_i
1	0.4055	0.4048	0.4051	0.4
2	0.5945	0.5952	0.5949	0.6

The solid red curves in figure 5 depict the classifiers $\alpha^{(i)}_{x}(x)$ (left panel) and $\alpha^{(i)}_{y}(y)$ (right panel) learned by the network—they are extracted from $\beta^{(i)}_{x}$ and $\beta^{(i)}_{y}$ with the help of (31). For comparison, the true classifiers based on the exact functional forms of the component distributions are also shown as green dash-dot curves. The red solid lines and the green dash-dot lines almost coincide, which validates our method.

Next, we used (33) to estimate the distributions $f\,^{(i)}_x$ and $f\,^{(i)}_y$ . The resulting distributions are shown with red solid lines in the left and right panels of figure 6, respectively. In applying (33), for simplicity we used the exact expressions for the marginal distributions of x and y in the mixture. In a typical example, the exact expressions for the marginals will not be available, but can be easily estimated from the data, say using a histogram or kernel density estimation [66, 67]. In figure 6, we also show the true $f\,^{(i)}_x$ and $f\,^{(i)}_y$ as green dash-dot curves. The good agreement between the true w_i, $f\,^{(i)}_{x}$ , and $f\,^{(i)}_{y}$ and their estimates shown in table 2 and figure 6, demonstrates that the InClass net has successfully estimated the mixture model. Finally, we use (35) to estimate the aggregate classifier $\alpha^{(i)}_\textrm{aggregate}(x,y)$ which is shown as a heatmap in the left panel of figure 7. For comparison, in the right panel we show the true aggregate classifier based on the exact functional forms of the component distributions $f\,^{(i)}(x,y)$ . As expected, the two heatmaps are in very good agreement.

**Figure 6.** The distributions $f\,^{(i)}_x$ (left panel) and $f\,^{(i)}_y$ (right panel). The estimated (true) distributions are shown with red solid (green dash-dot) lines.
Download figure:
Standard image High-resolution image

**Figure 6.** The distributions $f\,^{(i)}_x$ (left panel) and $f\,^{(i)}_y$ (right panel). The estimated (true) distributions are shown with red solid (green dash-dot) lines.
Download figure:
Standard image High-resolution image

**Figure 7.** The estimated aggregate classifier $\alpha^{(i)}_\textrm{aggregate}(x,y)$ from (35) (left panel) and the true aggregate classifier (right panel).
Download figure:
Standard image High-resolution image

**Figure 7.** The estimated aggregate classifier $\alpha^{(i)}_\textrm{aggregate}(x,y)$ from (35) (left panel) and the true aggregate classifier (right panel).
Download figure:
Standard image High-resolution image

3.2. The checkerboard mixture $(V = 2, C = 2)$

Now we will look at an artificial toy example which was instrumental in the conception and development of the InClass nets technique, see figures 8 and 9. Figure 8 shows the joint distribution of (x, y) for a 'checkerboard' mixture under which the datapoints are uniformly distributed on the bright squares of a $4 \times 4$ checkerboard spanning the region $0\leqslant x, y \lt 4$ , while the dark squares have zero density. For concreteness, the vertical (horizontal) boundaries between cells are assigned to the cell on the right (top). It is easy to see that x and y are, individually, uniformly distributed between 0 and 4. It can also be seen that x and y, despite being uncorrelated, are not mutually independent in the mixture, since x lies within $[0, 1) \cup [2, 3)$ if and only if y does as well.

**Figure 8.** Heatmap illustrating the joint distribution of (x, y) for the 'checkerboard' mixture example considered in section 3.2. The datapoints are uniformly distributed on the bright squares of a $4\times4$ checkerboard spanning the region $0\leqslant x, y \lt 4$ , while the dark squares have zero density.
Download figure:
Standard image High-resolution image

**Figure 8.** Heatmap illustrating the joint distribution of (x, y) for the 'checkerboard' mixture example considered in section 3.2. The datapoints are uniformly distributed on the bright squares of a $4\times4$ checkerboard spanning the region $0\leqslant x, y \lt 4$ , while the dark squares have zero density.
Download figure:
Standard image High-resolution image

As shown in figure 9, the checkerboard mixture can be separated into two equally weighted components within which x and y are mutually independent. Under the first component, x and y both lie within $[0, 1) \cup [2, 3)$ , and under the second component x and y both lie within $[1, 2) \cup [3, 4)$ . Note that each of these components has four spatially disconnected regions—the classification cannot be achieved using spatial clustering techniques. This example also naturally evokes the intuition of the variates x and y serving as each other's supervisory signal, since the value of either x or y uniquely determines the component the datapoint belongs to.

Let us now analyze this toy example using an InClass net. All the details of the network training process are identical to the analysis of the example in section 3.1, including the network architectures, the size of the training dataset, the choice of optimizer, batch size and epoch count. The estimated mixture weights are $w_1 = 0.501, w_2 = 0.499$ , which is in excellent agreement with their true values of $w_1 = w_2 = 0.5$ . Figure 10 shows, in solid red curves, the distributions of x (left panel) and y (right panel) under the first component learned by the network, using the same procedure as in section 3.1. We only show the first component in this figure for the sake of clarity—the second component fills the gaps in the univariate distributions of x and y so that $w_1\,f\,^{(1)}_{x} + w_2\,f\,^{(2)}_{x}$ and $w_1\,f\,^{(1)}_{y} + w_2\,f\,^{(2)}_{y}$ are constant. For comparison, the true 'rectangular wave' distributions are also shown as green dash-dot curves, which are also seen to agree with the estimates.

**Figure 10.** The distributions $f\,^{(1)}_x$ (left panel) and $f\,^{(1)}_y$ (right panel) for the 'checkerboard' example considered in section 3.2. The estimated (true) distributions are shown with red solid (green dash-dot) lines. We only show the first component in this figure for the sake of visual clarity (see text).
Download figure:
Standard image High-resolution image

**Figure 10.** The distributions $f\,^{(1)}_x$ (left panel) and $f\,^{(1)}_y$ (right panel) for the 'checkerboard' example considered in section 3.2. The estimated (true) distributions are shown with red solid (green dash-dot) lines. We only show the first component in this figure for the sake of visual clarity (see text).
Download figure:
Standard image High-resolution image

3.3. Semi-supervised training on MNIST data $(V = 2, C = 10)$

The biggest advantage offered by a machine learning based technique over existing non-machine-learning techniques for nonparametric mixture model estimation is the possibility of tackling high-dimensional data. As a proof of concept, in this section we will train an InClass net to classify images of handwritten digits from the MNIST database [68], with the classes corresponding to the digits $0{\text{-}}9$ . With this example, we will focus more on the data classification aspect of this paper than the mixture model estimation.

The MNIST dataset contains $28\textrm{px} \times 28\textrm{px}$ grayscale images of handwritten digits. Each image also has an associated label indicating the digit contained in the image. We will construct a bivariate mixture model out of the MNIST dataset, where each datapoint is a pair of images. A single datapoint of the dataset will be sampled by first choosing a class between 0 and 9 uniformly at random, and then sampling two images⁹ containing that digit uniformly from the MNIST dataset (with replacement). This gives us a bivariate CIMM with ten classes of equal mixture weights—note that within each component (or class), the two images are mutually independent of each other. Figure 11 illustrates the kind of data the InClass net will see, with five randomly chosen datapoints from the mixture model (one in each column).

**Figure 11.** Five representative datapoints from the dataset used to train the InClass net in the example considered in section 3.3. Each datapoint (x, y) is a pair of images containing the same digit.
Download figure:
Standard image High-resolution image

For analyzing this dataset, instead of creating two different classifiers for the variates, we use the same NN for classifying both x and y. Viewed differently, the networks classifying x and y are identical in architecture and share their weights as well, and only differ in the input (output) they receive (return). The network uses a sequential architecture and contains, in order, a layer to flatten the $28 \times 28$ image data, three dense layers each with the ReLU activation function and 32 nodes, and finally a dense layer with the softmax activation function and ten nodes (since C = 10). This network has a total of 27 562 trainable parameters.

In principle, an InClass net can be trained without supervision to distinguish the digits. However, considering the large number of input dimensions and classes, without supervision, our network is expected to have difficulties 'discovering' new classes in the data, and will end up in bad local minima of the cost function. We will discuss some ways of overcoming this difficulty in appendix B.

In this example, we addressed this issue by taking a semi-supervised approach: We 'seeded' the classes (digits) in the network by performing a supervised training over a small dataset with noisy labels. For this purpose, we used a training dataset containing 2000 images. The noisy label associated with each image matches its true label with probability 0.6, and matches one of the other nine incorrect labels (chosen uniformly) with probability 0.4. The network was trained using the categorical cross-entropy loss function with the Adam optimizer for 30 epochs (batch size 20).

After this pre-training, we trained the network further using our neg_ctc_cost function on 100 000 pairs of images from our mixture model¹⁰ . Ten percent of the 100 000 datapoints were set aside as a validation dataset to monitor the evolution of the network performance, though no hyperparameter optimization was actively performed using the validation data. The training was done using the Adam optimizer with a batch size of 100 for 20 epochs.

Finally, we evaluated the performance of the classifier on a testing dataset of 10 000 single images unseen by the network (either during training or during validation). The performance is illustrated as a confusion matrix in the left panel of figure 12. Each row of the confusion matrix shows the output of the network averaged over test images containing a given digit (true label), both as a heatmap and as numerical values within each cell of the matrix. Recall that our network output, for each image, is 10-dimensional and can be interpreted as the probabilities assigned by the network to the different classes. Because the classes were pre-seeded into the network in a supervised manner, they matched with the true classes without requiring any manual reassignment.

For completeness, in the middle panel of figure 12, we show the confusion matrix of the network after the supervised pre-training performed on the noisily labelled data. Note that the training that resulted in the performance improvement from the middle panel to the left panel was completely unsupervised. We will discuss this semi-supervised training approach in the context of real world applications in section 5.3.

For comparison, in the right panel of figure 12, we show the confusion matrix of a classifier network with an identical architecture that was trained in a fully supervised manner, using noise-free labels (training was performed until the performance on the validation dataset saturated). As can be seen, the network trained using our unsupervised technique (left panel) achieves a comparable performance to the fully supervised classifier (right panel). Their prediction accuracies are $95.07\%$ and $95.79\%$ , respectively¹¹ .

3.4. Mixture of four independent trivariate Gaussians $(V = 3, C = 4)$

In this example, we will demonstrate that the InClass nets technique works for the estimation of mixture models with more than two variates as well. We consider the mixture of four independent trivariate Gaussians, with the third variate denoted by z. Table 3 summarizes the mixture model specification. The classifier networks $\beta^{(i)}_x$ , $\beta^{(i)}_y$ , and $\beta^{(i)}_z$ have a similar architecture to the classifier architectures used in section 3.1, except that the output layer has four nodes, since C = 4. The InClass net constructed out of the classifiers has a total of 6924 trainable parameters. We trained the InClass net with the neg_ctc_cost function, using 1000 000 datapoints for 15 epochs and a batch size of 500 (the other details of the training process remained the same as in section 3.1), and estimated the mixture model. The estimated mixture weights, shown in the last column of table 3, are in good agreement with the true weights of the components (second column in table 3). The estimated distributions $f\,^{(i)}_x$ , $f\,^{(i)}_y$ , and $f\,^{(i)}_z$ of the variates x, y, and z, respectively, are shown in figure 13 as red solid curves, along with the true distributions depicted as green dash-dot curves. In all twelve cases (4 components × 3 variates) we observe good agreement between the true and estimated distribution.

**Figure 13.** The distributions $f\,^{(i)}_x$ (top-left panel), $f\,^{(i)}_y$ (top-right panel), and $f\,^{(i)}_z$ (bottom-left panel) for the example considered in section 3.4. The estimated (true) distributions are shown with red solid (green dash-dot) lines.
Download figure:
Standard image High-resolution image

**Figure 13.** The distributions $f\,^{(i)}_x$ (top-left panel), $f\,^{(i)}_y$ (top-right panel), and $f\,^{(i)}_z$ (bottom-left panel) for the example considered in section 3.4. The estimated (true) distributions are shown with red solid (green dash-dot) lines.
Download figure:
Standard image High-resolution image

Table 3. The mixture model specification for the example considered in section 3.4. The last column shows the mixture weights estimated by the InClass net technique.

i	w_i	$f\,^{(i)}_x$	$f\,^{(i)}_y$	$f\,^{(i)}_z$	Estimated w_i
1	0.22	$\mathcal{N}(\textrm{mean} = -1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = -1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = -1, \textrm{SD} = 1.5)$	0.228
2	0.28	$\mathcal{N}(\textrm{mean} = +1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = +1, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = 0, \textrm{SD} = 1.5)$	0.268
3	0.18	$\mathcal{N}(\textrm{mean} = -1.5, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = +1.5, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = +1, \textrm{SD} = 1.5)$	0.187
4	0.32	$\mathcal{N}(\textrm{mean} = +1.5, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = -1.5, \textrm{SD} = 1.5)$	$\mathcal{N}(\textrm{mean} = +2, \textrm{SD} = 2.5)$	0.318

4. Discussion

In this section, we will discuss some considerations which might be relevant in the context of science applications of the InClass nets technique introduced in this paper.

4.1. Identifiability of conditional independence bivariate mixture models

Identifiability of a statistical model is concerned with whether the parameters and functions that describe the model are uniquely identifiable from an infinite sample of datapoints produced from the model. The identifiability of mixture models is an important concept, especially in the context of using the techniques introduced in this paper for science applications.

For a given statistical model, the definition of what it means to estimate the model is usually chosen to be practically useful. In the context of estimating nonparametric CIMMs, the definition allows for the following 'leniencies':

The number of components C is assumed to be known a priori. Otherwise, any CIMM will be unidentifiable since a given component i can always be split into several new components which share the same distribution $f\,^{(i)}({\mathcal{X}})$ (with weights adding up to the weight of the 'parent' component). Similarly, zero weight components can always be added without affecting the data distribution.
The model only needs to be (and can only ever be) estimated up to permutations of the component indices.
The distribution $f\,^{(i)}_v$ is considered to be the same as the distribution $g^{(i)}_v$ if $f\,^{(i)}_v(x_v) = g^{(i)}_v(x_v)$ 'almost surely'. In other words, $f\,^{(i)}_v$ and $g^{(i)}_v$ are allowed to be different over a set of probability measure 0. This is required for the nonparametric case which allows arbitrary $f\,^{(i)}_v$ -s.

The $(V = 1, C\geqslant 2)$ case (univariate) is always unidentifiable nonparametrically. For the $(V\geqslant 3, C = 2)$ case, [33] provided certain regularity conditions under which instances of CIMMs are identifiable, and [42] generalized the result to the $(V\geqslant 3, C\geqslant 2)$ case. The result from [42] states that an instance of a CIMM with $V\geqslant 3$ and $C\geqslant 2$ is identifiable if the functions $\left\{f\,^{(1)}_v,\dots, f\,^{(C)}_v\right\}$ are linearly independent, for all $v = 1,\dots, V$ .

This leaves the bivariate case $(V = 2)$ , which is the main focus of this section. reference [33] showed that in the $(V = 2, C = 2)$ case, instances of nonparametric CIMMs are not identifiable in general. In particular it was shown that for any instance of a two component bivariate nonparametric CIMM, there exists a two-parameter family of instances which leads to the same distribution of the observed variables (x, y). The authors also noted that non-negativity conditions will introduce constraints on the allowed values for the two parameters. Extending this result from [33], we derive the following two theorems which provide a sufficient and a (different) necessary condition for instances of nonparametric CIMMs with $(V = 2, C\geqslant 2)$ to be identifiable. The two conditions coincide for the C = 2 case. We relegate the proof of the theorems to appendix A. Necessary condition

Theorem 1. A nonparametric conditional independence bivariate (V = 2) mixture model with $C \geqslant 2$ components of the form given in (28) is uniquely identifiable up to permutations of the component-identities only if the following necessary condition is satisfied:

$\begin{align} \mathop{{\mathrm{ess}}\,{\mathrm{sup}}}\left[\frac{w_i f\,^{(i)}_t(t)}{w_i f\,^{(i)}_t(t) + w_j f\,^{(j)}_t(t)}\right] & \equiv \mathop{{\mathrm{ess}}\,{\mathrm{sup}}}\left[\frac{\alpha^{(i)}_t(t)}{\alpha^{(i)}_t(t) + \alpha^{(j)}_t(t)}\right] = 1\,, \nonumber\\ & \forall (i,j) \in \{1,\dots, C\}^2~:~i\neq j\,,~~\forall t\in \{x, y\}\,, \end{align} \tag{ 36 }$

where $\mathop{\textrm{ess}\,\textrm{sup}}[\texttt{func}(t)]$ represents the essential supremum of $\texttt{func}(t)$ . $\square$

Sufficient condition

Theorem 2. A nonparametric conditional independence bivariate (V = 2) mixture model with $C \geqslant 2$ components of the form given in (28) is uniquely identifiable up to permutations of the component-identities if the following sufficient condition is satisfied:

$\begin{equation} \mathop{{\mathrm{ess}}\,{\mathrm{sup}}}\left[\frac{w_i f\,^{(i)}_t(t)}{{\mathcal{P}}_{\!t}(t)}\right] \equiv \mathop{{\mathrm{ess}}\,{\mathrm{sup}}}\left[\alpha^{(i)}_t(t)\right] = 1\,,\qquad\qquad \forall i\in \{1,\dots, C\}\,, \forall t\in \{x, y\}\,, \end{equation} \tag{ 37 }$

where, as before, $\mathop{\textrm{ess}\,\textrm{sup}}[\texttt{func}(t)]$ represents the essential supremum of $\texttt{func}(t)$ . $\square$

The essential supremum can be thought of as an adaptation of the notion of supremum of a function, allowing for ignoring the behaviour of the function over regions with a total probability measure¹² of zero. These conditions can 'roughly' be interpreted as follows: The sufficient condition (37) will be satisfied if, for every component i and variate x or y, there exists some region in the phase space of the variate where component i completely dominates the mixture, i.e. all the datapoints in that region are from component i. The necessary condition (36) will be satisfied if, for every pair of components i ≠ j and variate x or y, there exists some region in the phase space of the variate where component i 'completely' dominates the mixture of components i and j.

Let us now revisit the examples considered earlier in section 3 from the point of view of identifiablilty. Considering the successful estimation of mixture models and/or classifier training in those examples, we can expect them to be identifiable. For the mixture of two independent bivariate Gaussians, figure 5 shows how, for both x and y, the true and reconstructed classifier output for the first (second) component approaches 1 for increasingly negative (positive) values. This ensures that the sufficient condition for identifiability (37) is satisfied. Similarly, for the checkerboard mixture, by construction, there are regions in x and y which contain points from only component 1 or only component 2, see figure 9.

Recall that in our treatment, the individual variates x and y are themselves allowed to be multi-dimensional. In the special case of one-dimensional variates x and y, typically a component will only dominate the mixture in either the left tail or the right tail of the other components. This means that for most natural examples with one-dimensional x and y, it is unlikely for the mixture to be identifiable for more than two components. On the other hand, this limitation does not apply to higher dimensional variates x and y which our InClass nets specialize in. For instance, the sufficient condition (37) for the mixture model constructed out of the MNIST dataset becomes: 'For every digit d, there must exist some region in the space of images, within which the images look unmistakably like the digit d^'. This condition is naturally expected to be satisfied, considering the reliability of good handwritten communication.

4.1.1. Reduced identifiability due to limited statistics

The unique estimation of mixture model instances guaranteed by theorem 2 can only be achieved with an infinite dataset. There will always be an uncertainty associated with estimation performed using finite datasets [69]. The level of this uncertainty is related to (among other things) how close the conditions (36) and (37) are to being satisfied within the region of sample space covered sufficiently by the finite dataset at hand. In this sense, the result in theorem 2 is useful from a practical point of view. To illustrate this, we repeated the two Gaussians example from section 3.1, with the same setup, but with much fewer datapoints, namely 5000 instead of 100 000. With fewer datapoints, the dataset is less likely to probe the tails of the x and y distributions, where a single component dominates. As expected, this time the estimation of the weights is slightly worse—we obtained $w_1 = 0.44$ and $w_2 = 0.56$ , to be compared with the true values of $w_1 = 0.4$ and $w_2 = 0.6$ . The results for the classifiers $\alpha^{(i)}_{x}(x)$ and $\alpha^{(i)}_{y}(y)$ and for the component distributions $f\,^{(i)}_x$ and $f\,^{(i)}_y$ are shown in figures 14 and 15, respectively. Comparing to the analogous high statistics figures 5 and 6, we see that the estimation has generally succeeded (after all, the model was identifiable), but is not perfect and suffers from statistical uncertainties.

**Figure 14.** The same as figure 5, but using only 5000 events for the estimation of the mixture model considered in section 3.1.
Download figure:
Standard image High-resolution image

**Figure 15.** The same as figure 6, but using only 5000 events for the estimation of the mixture model considered in section 3.1.
Download figure:
Standard image High-resolution image

4.1.2. Unidentifiable situations

When a CIMM instance is not identifiable, our technique will yield one of the parameterizations (weights and functions) that best fits the available data. Note that the unidentifiability of an instance of a nonparametric CIMM is not a weakness of our InClass nets approach, but rather a statement on the impossibility of the task of unique estimation.

4.2. Uncertainty quantification

An important aspect of estimating a model (or equivalently, fitting a model to the available data), is providing an uncertainty on the estimate. Uncertainties in parametric estimation are conceptually straightforward—they correspond to the (possibly correlated) uncertainties in the estimated values of the parameters. The corresponding approach in the context of nonparametric models (which allow arbitrary functions), would be to treat either the NN outputs $\beta^{(i)}_v$ or the estimated distributions $f\,^{(i)}_v$ as Gaussian processes [70]. For a Gaussian process $G(\texttt{input})$ , the value of G at any finite set of $\texttt{input}$ values is taken to be randomly distributed according to a multivariate normal distribution. This allows us to assign uncertainty estimates on the value of G at individual $\texttt{input}$ points, while also accounting for the correlations in the uncertainties between the values at different $\texttt{input}$ points. It has been shown that Gaussian processes can be modeled using Bayesian NNs with wide layers [71–73]. Using wide Bayesian NNs as the individual classifiers of the InClass net, one can obtain robust uncertainties on the estimated mixture model. In many scenarios, one is simply interested in visualizing a band of uncertainty around the estimated $\beta^{(i)}_v$ -s, $\alpha^{(i)}_v$ -s, or $f\,^{(i)}_v$ -s and even narrow Bayesian NNs may be sufficient for this purpose.

Note that the Gaussian process approach will work when the instance of the CIMM at hand is identifiable, and the uncertainty in the estimated model arises only from the finiteness of the dataset being analyzed. It is presently unclear whether Bayesian NNs can capture the degree(s) of freedom in the model specification which are introduced by the unidentifiability of the CIMM instance.

4.3. Estimating the number of components C

In many applications, one does not a priori know the number of components in the CIMM [37, 74]. In other situations, the assumption that the distribution of the data can be written as a CIMM may not necessarily be valid. In such situations, by training different InClass nets with different values of C, one may be able to a) verify the validity of the conditional independence assumption, and b) estimate C.

Note that increasing the number of components increases the fitting ability of a CIMM. More concretely, every CIMM instance with C components can be thought of as a CIMM instance with $C^{^{\prime}}\gt C$ components (with $C^{^{\prime}}-C$ additional zero-weight components). As a result, an InClass net with more components should strictly perform better (in terms of the minimum cost value achieved), up to network training deficiencies and statistical fluctuations due to the finiteness of the training dataset. However, the improvement (in the minimum cost achieved) resulting from increasing C is expected to diminish beyond a certain point.

In particular, if the true probability distribution of the data ${\mathcal{P}}^\ast({\mathcal{X}})$ can be modeled as a CIMM, then there exists a minimum number of components $C_\textrm{min}$ required to express ${\mathcal{P}}^\ast({\mathcal{X}})$ in the form

$\begin{equation} {\mathcal{P}}^\ast({\mathcal{X}}) = \sum_{i = 1}^{C_\textrm{min}}~w_i~\prod_{v = 1}^V~f\,^{(i)}_v(x_v)\,. \end{equation} \tag{ 38 }$

Increasing C from 1 to $C_\textrm{min}$ will show an improvement in $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ value, but beyond $C_\textrm{min}$ , the performance is expected to saturate. This feature, if observed, can simultaneously a) confirm that the data is consistent with the conditional independence assumption, and b) provide an estimate of $C_\textrm{min}$ . Note that the $C_\textrm{min}$ value identified in this way is only an estimate—inferring the presence of a component with a small mixing weight, or the presence of two components with very similar distributions $f\,^{(i)}({\mathcal{X}})$ may be statistically limited by the amount of data available. If the actual number of components is a priori unknown, then the $C_\textrm{min}$ estimate can serve as an Occam's razor estimate of C.

On the other hand, if such a sharp saturation of network performance is not observed at a particular value of C, and the saturation is more gradual, this could be a sign of a Latent Factor Model—the underlying latent variable that explains the dependence of the different variates could be continuous instead of being the discrete category label i.

4.4. Minimum possible value of `neg_ctc_cost`

When estimating $C_\textrm{min}$ using the method described in section 4.3, one relies on observing a saturation in the value of the minimum cost achieved. However, such a saturation could also result from deficiencies in the architecture and/or training of the network. It it therefore useful to have an estimate of the minimum possible $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ achievable by the best fitting model. Recall from (24a ), that

$\begin{equation} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} = \mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}} \right] - C^\ast(x_1,\dots,x_V)\,, \end{equation} \tag{ 39 }$

where $\mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{P}} \right]$ is the KL divergence from the distribution represented by the InClass net ${\mathcal{P}}$ to the true distribution ${\mathcal{P}}^\ast$ and $C^\ast(x_1,\dots,x_V)$ is the total correlation of the variates under the true distribution. Since, the KL divergence is manifestly non-negative and equals 0 only when ${\mathcal{P}}^\ast$ is equivalent to ${\mathcal{P}}$ , we have the following inequality

$\begin{equation} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} \geqslant -C^\ast(x_1,\dots,x_V)\,, \end{equation} \tag{ 40 }$

where the equality is achieved when ${\mathcal{P}}^\ast$ matches ${\mathcal{P}}$ almost surely. Thus, the negative total correlation $-C^\ast$ provides a (theoretically achievable) lower-bound on the negative cross total correlation $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ . From (23) the total correlation $C^\ast$ is given by

$\begin{equation} C^\ast(x_1,\dots,x_V) = \int d{\mathcal{X}}~{\mathcal{P}}^\ast({\mathcal{X}})~\log{\left[\frac{{\mathcal{P}}^\ast({\mathcal{X}})}{{\mathcal{P}}^\ast_{\!1}(x_1) ~{\mathcal{P}}^\ast_{\!2}(x_2)~\dots~{\mathcal{P}}^\ast_{\!V}(x_V)}\right]}\,, \end{equation} \tag{ 41 }$

where ${\mathcal{P}}_{\!v}^\ast$ represents the marginal distribution of x_v in the data. For low dimensional data, $C^\ast$ can be estimated directly using this formula, after first estimating the distributions ${\mathcal{P}}^\ast({\mathcal{X}})$ and ${\mathcal{P}}_{\!v}^\ast(x_v)$ .

Alternatively, for both low and high dimensional data, one can estimate $C^\ast$ using supervised machine learning as follows. Let the distribution ${\mathcal{Q}}^\ast({\mathcal{X}})$ be defined as

$\begin{equation} {\mathcal{Q}}^\ast({\mathcal{X}}) \equiv \prod_{v = 1}^V~{\mathcal{P}}^\ast_{\!v}(x_v). \end{equation} \tag{ 42 }$

Note that $C^\ast$ is simply the Kullback–Leibler divergence $\mathrm{KL}\left[{\mathcal{P}}^\ast~\big|\big|~{\mathcal{Q}}^\ast \right]$ from ${\mathcal{Q}}^\ast$ to ${\mathcal{P}}^\ast$ . One can produce datapoints as per the distribution ${\mathcal{Q}}^\ast({\mathcal{X}})$ by independently sampling the variates $x_1,\dots,x_V$ from the available dataset. This gives us two datasets: the original one distributed as per ${\mathcal{P}}^\ast$ and a resampled one distributed as per ${\mathcal{Q}}^\ast$ . One can train a machine in a supervised manner to distinguish between these two datasets and estimate the KL divergence, and hence $C^\ast$ , from the trained classifier.

4.5. Incorporating prior knowledge

In some applications, one may have additional prior knowledge about the mixture model, beyond the conditional independence assumption. It may be possible to incorporate this knowledge into the InClass net directly. For example, in the MNIST image classification example considered in section 3.3, we used the information that the variates x and y are both images of digits to use the same classifier NN for both variates.

As a different example, if the distribution of a given variate x_v is known under a given component i, then the value of $\beta^{(i)}_v$ can be set to $f\,^{(i)}_v(x_v) / {\mathcal{P}}_{\!v}^\ast(x_v)$ up to a multiplicative weight factor which will constitute a single, trainable parameter. For the special case where the distribution of a given variate x_v is known under every component, the classifier for the vth variate can be parameterized by only the mixture weights of the components—in this way the InClass nets technique can be applied in the situations where the ${}_s\mathcal{P}lots$ technique is currently being used in high energy physics.

As yet another example, consider the case where the weights of the different components are a priori known (but not the distributions of the variates within the components). Then an extra term can be added to the cost function to force the mixture weights w_i estimated by the InClass net towards the true known weights $w_i^\textrm{true}$ . One possible form of the extra term is inspired by the cross entropy:

$\begin{equation} -\lambda \sum_{i = 1}^C~w_i^\textrm{true}~\log\left(w_i\right)\,, \end{equation} \tag{ 43 }$

where λ is a parameter that controls the relative importance of the new term in the cost function. The additional term could be added either at the beginning, or after training the network for a few epochs (and identifying the map from the true component indices to the learned component indices). The additional term may be particularly useful in estimating unidentifiable CIMMs, where the additional knowledge of the mixture weights could help rule out the observationally indistinguishable 'fake' CIMM instances.

5. Possible variations and extensions

In this section we will discuss some potential variations and extensions of the InClass nets technique introduced in this paper whose detailed exploration is beyond the scope of this work. Additionally, in appendix C, we provide some surrogate cost functions for training CIMMs, which are already implemented in the RainDancesVI package.

5.1. Regularizers

Recall that the dataset at hand could be consistent with multiple CIMM instances, either due to the unidentifiability of the instance or due to the finiteness of the dataset. In such situations, one can impose additional conditions for the learned model to satisfy. For example, depending on the application at hand, one might be interested in roughly evenly weighted components. This can be encouraged by adding (to the cost function) additional regularization terms like

$\begin{align} \texttt{tikhonov}\_\texttt{reg} & = \lambda~\sum_{i = 1}^C w_i^2, \end{align} \tag{ 44a }$

$\begin{align} \texttt{neg}\_\texttt{shannon}\_\texttt{reg} & = \lambda~\sum_{i = 1}^C w_i \log w_i, \end{align} \tag{ 44b }$

where λ is a positive constant. If one is interested in more lopsided weight distributions for the components (possibly suppressing the weights of some components), the same regularizer terms can be used with λ set to a negative value.

5.2. Unsupervised classification with multi-label InClass nets

The InClass nets architecture introduced in this paper can have more general data mining applications beyond the estimation of CIMMs. If the datapoints in a dataset are comprised of the (possibly multi-dimensional) variates $x_1,\dots,x_V$ , the joint distribution of these variates may be understandable in terms of the classes of datapoints within the dataset, even if it does not fall under a CIMM. Furthermore, the set of classes corresponding to one variate need not necessarily be the same as the set of classes corresponding to another. In the literature, the existence of different sets of classes within the dataset falls under the realm of 'multi-label classification'.

For example, consider a dataset containing paired data: each datapoint contains the identities of a book and a movie liked by a person. The working assumption could be that there exists a classification of books and a (different) classification of movies, such that the class of books liked by a person is related to the class of movies liked by the same person. In such cases, it may be possible to simultaneously train a book and a movie classifier using the InClass nets architecture, by simply maximizing the mutual information between the classes predicted by the network.

To this end, we define the 'negative total correlation function' $\texttt{neg}\_\texttt{tc}\_\texttt{cost}$ and its bivariate special case 'negative mutual information' cost function $\texttt{neg}\_\texttt{mi}\_\texttt{cost}$ as

$\begin{align} \texttt{neg}\_\texttt{tc}\_\texttt{cost} & = -\sum_{i_1 = 1}^{C_1}~\sum_{i_2 = 1}^{C_2}~\cdots~\sum_{i_V = 1}^{C_V}~E_{{\mathcal{P}}^\ast}\left[\prod_{v = 1}^V~\alpha^{(i_v)}_v\right]~\log{\left[\frac{\displaystyle E_{{\mathcal{P}}^\ast}\left[\prod_{v = 1}^V~\alpha^{(i_v)}_v\right]}{\displaystyle \prod_{v = 1}^V~E_{{\mathcal{P}}^\ast}\left[\alpha^{(i_v)}_v\right]}\right]}, \end{align} \tag{ 45a }$

$\begin{align} \texttt{neg}\_\texttt{mi}\_\texttt{cost} & = -\sum_{i = 1}^{C_x}~\sum_{j = 1}^{C_y}~E_{{\mathcal{P}}^\ast}\left[\alpha^{(i)}_x\,\alpha^{(j)}_y\right]~\log{\left[\frac{\displaystyle E_{{\mathcal{P}}^\ast}\left[\alpha^{(i)}_x\,\alpha^{(j)}_y\right]}{\displaystyle E_{{\mathcal{P}}^\ast}\left[\alpha^{(i)}_x\right]\,E_{{\mathcal{P}}^\ast}\left[\alpha^{(j)}_y\right]}\right]}, \end{align} \tag{ 45b }$

where the outputs of the InClass net $\eta^{(i)}_v$ are directly interpreted as the classifier output $\alpha^{(i)}_v$ , and C_v is the number of classes for the classifier corresponding to the vth variate—note that the C_v-s need not all be equal. We point out that the neg_tc_cost of (45a ) is a generalization of the cost function used in [50] for the case where a) there can be more than two variates in the data, b) the classifiers $\alpha^{(i)}_v$ are not necessarily the same for different variates, and c) the number of classes C_v could be different for different variates. Also, in this formulation, it is not required that the inputs x_v to the different classifiers have only non-overlapping attributes of the datapoint.

5.3. Semi-supervised classification with InClass nets

In the MNIST image classification example considered in section 3.3, we seeded the categories into the classifier network via supervised learning using a small, noisily labeled dataset. After the categories were seeded in, we used the unsupervised training of the InClass nets technique to further train the network.

This strategy has straightforward applications in semi-supervised learning scenarios where only a subset of the datapoints in the training dataset are labeled. For example, in the training of NNs to perform medical diagnosis [75], generating labeled datasets requires manual annotation by experts, and only a small number of labeled samples may be available. On the other hand, a large number of unlabeled samples are typically available for training purposes. If, say, two different aspects (or variates) of the medical records are expected to only be weakly dependent on each other, but a confounding factor like the presence or absence of a disease can influence both variates, then we can train a NN to perform the diagnosis leveraging both the labeled and unlabeled datasets. A hybrid cost function that incorporates a supervised classification cost function (for the labeled datapoints), as well an unsupervised cost function introduced in this paper (for the unlabeled datapoints) may be appropriate for the task.

Note that the medical diagnosis example considered here will not strictly be a CIMM. For instance, in addition to the presence or absence of the disease, the severity of a particular case is also likely to influence the medical record. It may be possible to accommodate this particular effect by having multiple labels for different severity levels. Despite not strictly being an example of conditional indepedence mixture model, training using the $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ or the $\texttt{neg}\_\texttt{tc}\_\texttt{cost}$ can still potentially yield useful diagnostic tools.

6. Summary

In this paper we introduced a novel approach for the 'nonparametric' estimation of CIMMs defined by (3). In this approach, the estimation of a CIMM is treated as a multi-class classification problem, which we solve with machine-learning methods. The main results of the paper are as follows.

We develop a specific machine-learning technique which we call the InClass nets technique. The basic architecture of InClass nets is illustrated in figure 1 and consists of a number of classifiers (one for each variate), which are realized as artificial NNs.
In section 2.1, we show how CIMMs can be represented using InClass nets. The ability of NNs to approximate arbitrary functions allows for the 'nonparametric' modeling of the CIMM.
We recast the problem of estimation of a CIMM as a classification problem, and construct suitable cost functions for training the individual NNs without supervision. We also provide the prescription for extracting the learned CIMM from the trained InClass nets. The efficacy of our procedure is demonstrated with several toy examples in section 3, including a high-dimensional image classification problem.
For easy adoption of the InClass nets technique, we provide a public implementation of our method as a Python package called RainDancesVI [62].
In section 4.1 we derive some new results on the nonparametric identifiability of bivariate CIMMs, in the form of a necessary and a (different) sufficient condition for a bivariate CIMM to be identifiable. The proofs of the theorems can be found in appendix A.

While performing comparative studies with previously existing (non-machine-learning) estimation techniques in the literature is beyond the scope of this work, we note that (for a given C and V) a machine-learning technique like InClass nets can potentially handle higher dimensional variates than non-machine-learning techniques. As discussed in sections 4 and 5, the InClass nets technique has many potential applications beyond the narrow focus of CIMM. Specifically, the use of machine learning opens new avenues for addressing old-standing problems in nonparametric statistics.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://gitlab.com/prasanthcakewalk/code-and-data-availability/.

Acknowledgments

The authors would like to thank M Lisanti and A Roman for useful discussions. The work of P S was supported in part by the University of Florida CLAS Dissertation Fellowship (funded by the Charles Vincent and Heidi Cole McLaughlin Endowment) and the Institute of Fundamental Theory Fellowship. This work was supported in part by the United States Department of Energy under Grant No. DE-SC0010296.

Appendix A.: Proof of theorems 1 and 2

Here we will prove theorems 1 and 2. Let us begin by noting that:

A nonparametric CIMM can be identifiable only if all the mixture weights are non-zero—if one of the mixture components has zero weight, it can be removed from the mixture and different component can be split into two.
A nonparametric CIMM with V = 2 cannot be identifiable if there exists a pair of components $i,j$ for which $f\,^{(i)}_x(x) = f\,^{(j)}_x(x)$ almost surely. Otherwise, the sub-mixture of the components i and j, $w_i\,f\,^{(i)}_x(x)\,f\,^{(i)}_y(y) + w_j\,f\,^{(j)}_x(x)\,f\,^{(j)}_y(y)$ , can be rewritten as a different combination of two components of total weight $w_i + w_j$ which have the same distribution of the variate x as the original components, but different mixture weights and distributions of the variate y.
Similarly, a nonparametric CIMM with V = 2 cannot be identifiable if there exists a pair of components $i,j$ for which $f\,^{(i)}_y(y) = f\,^{(j)}_y(y)$ almost surely.

Congruently, neither the necessary nor the sufficient condition from theorems 1 and 2 can be satisfied if one of the mixing weights is zero, or if $\exists (i, j, t): f\,^{(i)}_t(t) = f\,^{(j)}_t(t)$ almost surely. Henceforth, we will only consider instances of nonparametric bivariate CIMMs for which

$\begin{align} w_i \gt 0\,,\qquad &\forall i\in\{1,\dots, C\}\,, \end{align} \tag{ A.1a }$

$\begin{align} \Big(f\,^{(i)}_x - f\,^{(j)}_x = 0~\textit{almost surely} \Big)~~ &\Rightarrow~~\big(i = j\big)\,, \end{align} \tag{ A.1b }$

$\begin{align} \Big(f\,^{(i)}_y - f\,^{(j)}_y = 0~\textit{almost surely} \Big)~~ &\Rightarrow~~\big(i = j\big)\,.\end{align} \tag{ A.1c }$

A.1. Two component case $(C = 2)$

Let us first tackle the C = 2 case of theorems 1 and 2. Throughout this section, equality of distributions will refer to their equality almost surely. From theorems 4.1 and 4.2 of [33], for every instance of parametric bivariate CIMM of the form given in (28), all the instances with the same distribution of observed data form a two-parameter family. This family of instances identified in [33] can be parameterized in terms of $\gamma\in\mathbb{R}$ and $0\leqslant w^{^{\prime}}_1\leqslant 1$ , and can be written as

$\begin{equation} \sum_{i = 1}^2~w^{^{\prime}}_i\,g^{(i)}_x(x)\,g^{(i)}_y(y) = \sum_{i = 1}^2~w_i\,f\,^{(i)}_x(x)\,f\,^{(i)}_y(y)\,, \end{equation} \tag{ A.2 }$

where

$\begin{align} w^{^{\prime}}_2 & = 1 - w^{^{\prime}}_1\,, \end{align} \tag{ A.3a }$

$\begin{align} g^{(1)}_x(x) & = {\mathcal{P}}_x(x) + \gamma w^{^{\prime}}_2 \sqrt{\frac{w_1~w_2}{w^{^{\prime}}_1~w^{^{\prime}}_2}}~\left(f\,^{(1)}_x(x) - f\,^{(2)}_x(x)\right)\,, \end{align} \tag{ A.3b }$

$\begin{align} g^{(2)}_x(x) & = {\mathcal{P}}_x(x) - \gamma w^{^{\prime}}_1 \sqrt{\frac{w_1~w_2}{w^{^{\prime}}_1~w^{^{\prime}}_2}}~\left(f\,^{(1)}_x(x) - f\,^{(2)}_x(x)\right)\,, \end{align} \tag{ A.3c }$

$\begin{align} g^{(1)}_y(y) & = {\mathcal{P}}_y(y) + ~\frac{w^{^{\prime}}_2}{\gamma} \sqrt{\frac{w_1~w_2}{w^{^{\prime}}_1~w^{^{\prime}}_2}}~\left(f\,^{(1)}_y(y) - f\,^{(2)}_y(y)\right)\,, \end{align} \tag{ A.3d }$

$\begin{align} g^{(2)}_y(y) & = {\mathcal{P}}_y(y) - ~\frac{w^{^{\prime}}_1}{\gamma} \sqrt{\frac{w_1~w_2}{w^{^{\prime}}_1~w^{^{\prime}}_2}}~\left(f\,^{(1)}_y(y) - f\,^{(2)}_y(y)\right)\,. \end{align} \tag{ A.3e }$

Note, that the transformation $\gamma \longleftrightarrow -\gamma, w^{^{\prime}}_1~\longleftrightarrow 1-w^{^{\prime}}_1$ is equivalent to a permutation of the component indices $(1) \longleftrightarrow (2)$ . Since, we are only interested in the identifiability of the CIMM instance upto this permutation, we can restrict γ to be non-negative. The only additional constraints on γ and $w^{^{\prime}}_1$ are provided by the non-negativity of the distribution functions $g^{(i)}_x$ and $g^{(i)}_y$ .

It can be verified that γ = 1 and $w^{^{\prime}}_1 = w_1$ corresponds to the original CIMM instance with $g^{(i)}_x = f\,^{(i)}_x$ and $g^{(i)}_y = f\,^{(i)}_y$ . Furthermore, any other set of values for γ and $w^{^{\prime}}_1$ corresponds to a different instance, since the $w_1, w_2 \gt 0$ and the differences $f\,^{(1)}_x - f\,^{(2)}_x$ and $f\,^{(1)}_y - f\,^{(2)}_y$ are not identically zero. This leads us to the following lemma: The CIMM instance will be identifiable if and only if the non-negativity constraints on $g^{(i)}_x$ and $g^{(i)}_y$ only allow γ and $w^{^{\prime}}_1$ to be 1 and w₁, respectively.

The non-negativity conditions on the functions $g^{(i)}_x$ and $g^{(i)}_y$ can be written using (A.3) as

$\begin{align} \frac{1}{\gamma}\sqrt{\frac{w^{^{\prime}}_1}{w^{^{\prime}}_2}} \geqslant \sqrt{w_1~w_2}~~\mathop{\textrm{ess}\,\textrm{sup}}\left[\frac{f\,^{(2)}_x(x) - f\,^{(1)}_x(x)}{{\mathcal{P}}_{\!x}(x)}\right] = \frac{\mu^{(2)}_x - w_2}{\sqrt{w_1~w_2}}\,, \end{align} \tag{ A.4a }$

$\begin{align} \frac{1}{\gamma}\sqrt{\frac{w^{^{\prime}}_2}{w^{^{\prime}}_1}} \geqslant \sqrt{w_1~w_2}~~\mathop{\textrm{ess}\,\textrm{sup}}\left[\frac{f\,^{(1)}_x(x) - f\,^{(2)}_x(x)}{{\mathcal{P}}_{\!x}(x)}\right] = \frac{\mu^{(1)}_x - w_1}{\sqrt{w_1~w_2}}\,, \end{align} \tag{ A.4b }$

$\begin{align} \gamma\sqrt{\frac{w^{^{\prime}}_1}{w^{^{\prime}}_2}} \geqslant \sqrt{w_1~w_2}~~\mathop{\textrm{ess}\,\textrm{sup}}\left[\frac{f\,^{(2)}_y(y) - f\,^{(1)}_y(y)}{{\mathcal{P}}_{\!y}(y)}\right] = \frac{\mu^{(2)}_y - w_2}{\sqrt{w_1~w_2}}\,, \end{align} \tag{ A.4c }$

$\begin{align} \gamma\sqrt{\frac{w^{^{\prime}}_2}{w^{^{\prime}}_1}} \geqslant \sqrt{w_1~w_2}~~\mathop{\textrm{ess}\,\textrm{sup}}\left[\frac{f\,^{(1)}_y(y) - f\,^{(2)}_y(y)}{{\mathcal{P}}_{\!y}(y)}\right] = \frac{\mu^{(1)}_y - w_1}{\sqrt{w_1~w_2}}\,, \end{align} \tag{ A.4d }$

where

$\begin{equation} \mu^{(i)}_t = \mathop{\textrm{ess}\,\textrm{sup}}\left[\frac{w_i\,f\,^{(i)}_t(t)}{w_1\,f\,^{(1)}_t(t)+w_2\,f\,^{(2)}_t(t)}\right]\,,\qquad\qquad \forall i\in\{1,2\}\,,\forall t\in\{x,y\}\,. \end{equation} \tag{ A.5 }$

It can seen from (A.4) that the $\mu^{(i)}_t$ -s satisfy the constraints $w_i \leqslant \mu^{(i)}_t \leqslant 1$ , since the essential supremum of the difference between two normalized distributions is non-zero—normalized equations have to cross or be equal almost surely. Now, multiplying (A.4a ) with (A.4c ), and (A.4b ) with (A.4d ) we get the following constraint in the ratio $w^{^{\prime}}_1/w^{^{\prime}}_2$

$\begin{equation} \frac{\left(\mu^{(2)}_x - w_2\right)\left(\mu^{(2)}_y - w_2\right)}{w_1~w_2} \leqslant \frac{w^{^{\prime}}_1}{w^{^{\prime}}_2} \leqslant \frac{w_1~w_2}{\left(\mu^{(1)}_x - w_1\right)\left(\mu^{(1)}_y - w_1\right)}\,. \end{equation} \tag{ A.6 }$

Note that all values $w^{^{\prime}}_1/w^{^{\prime}}_2$ allowed by this constraint are allowed by (A.4) and vice versa. This implies that the CIMM instance will be identifiable only if the constraint (A.6) only allows $w^{^{\prime}}_1 = w_1$ . The upper and lower bounds on the ratio $w^{^{\prime}}_1/w^{^{\prime}}_2$ from (A.6) both equal $w_1/w_2$ iff $\mu^{(1)}_x = \mu^{(2)}_x = \mu^{(1)}_y = \mu^{(2)}_y = 1$ (which would also set γ = 1 in (A.4)). This completes the proof of theorems 1 and 2 for the two component case.

A.2. Necessary condition for the C > 2 case

The necessary condition from theorem 1 for the C > 2 case can be seen as a corollary of the same theorem 1 for the C = 2 case, since a nonparametric CIMM instance with more than two components can be identifiable only if for every pair of components, the two component mixture formed by the pair (after appropriately scaling their weights to add up to 1) is identifiable.

A.3. Sufficient condition for the C > 2 case

Let $\Omega_x$ and $\Omega_y$ be the sample spaces of x and y, respectively, and let $\mathbb{P}[\,\cdots]$ represent the probability of an event. Let us consider a bivariate CIMM instance with C > 2 components which satisfies condition (37), i.e. the sufficient condition for identifiability according to theorem 2 (which is to be proved here)¹³ . Let w_i, $f\,^{(i)}_x$ , $f\,^{(i)}_y$ , $\alpha^{(i)}_x$ , and $\alpha^{(i)}_y$ have the same meanings as in the rest of the paper.

From the definition of $\mathop{\textrm{ess}\,\textrm{sup}}$ , we can see that for all $0 \lt \epsilon \lt 1$ , there exist disjoint sets $X_1, \dots, X_C \subset \Omega_x$ and disjoint sets $Y_1, \dots, Y_C \subset \Omega_y$ such that¹⁴

$\begin{align} \mathbb{P}[x\in X_i] \gt 0\,,& \qquad \forall i\in\{1,\dots, C\}\,, \end{align} \tag{ A.7a }$

$\begin{align} \mathbb{P}[y\in Y_i] \gt 0\,,& \qquad \forall i\in\{1,\dots, C\}\,, \end{align} \tag{ A.7b }$

$\begin{align} (1-\epsilon) \leqslant \alpha^{(i)}_x(x) \leqslant 1\,,& \qquad \forall x\in X_i\,, \forall i\in\{1,\dots, C\}\,, \end{align} \tag{ A.7c }$

$\begin{align} (1-\epsilon) \leqslant \alpha^{(i)}_y(y) \leqslant 1\,,& \qquad \forall y\in Y_i\,, \forall i\in\{1,\dots, C\}\,. \end{align} \tag{ A.7d }$

From (A.7c ) and (A.7d ), we can see that

$\begin{align} \alpha^{(i)}_x(x) \leqslant \epsilon\,,& \qquad \forall x\in X_j\,, ~~\forall (i,j) \in \{1,\dots, C\}^2~:~i\neq j\,, \end{align} \tag{ A.8a }$

$\begin{align} \alpha^{(i)}_y(y) \leqslant \epsilon\,,& \qquad \forall y\in Y_j\,, ~~\forall (i,j) \in \{1,\dots, C\}^2~:~i\neq j\,. \end{align} \tag{ A.8b }$

As ε is made arbitrarily small, the region $x\in X_i$ and the region $y\in Y_i$ become arbitrarily close to being populated exclusively by the component i. This induces a block diagonal structure, with the probability $\mathbb{P}_{ij}\equiv\mathbb{P}\left[(x,y)\in X_i\times Y_j\right]$ becoming arbitrarily small if i ≠ j. More concretely, from (13), we can write

$\begin{equation} \mathbb{P}_{ij} = \mathbb{P}[x\in X_i]~\mathbb{P}[y\in Y_j]~\sum_{k = 1}^C~\frac{E_{x\in X_i}\left[\alpha^{(k)}_x(x)\right]~E_{y\in Y_j}\left[\alpha^{(k)}_y(y)\right]}{w_k}\,. \end{equation} \tag{ A.9 }$

Using (A.7c ), (A.7d ), (A.8), and (A.9), we can show that

$\begin{equation} \mathbb{P}_{ij} \leqslant \epsilon~~\mathbb{P}[x\in X_i]~\mathbb{P}[y\in Y_j]~\sum_{k = 1}^C~w_k^{-1}\,,\qquad \forall i\neq j\,. \end{equation} \tag{ A.10 }$

Similarly, using (A.7c ), (A.7d ), and (A.9) we can show that

$\begin{equation} \mathbb{P}_{ii} \geqslant (1-\epsilon)^2~~\frac{\mathbb{P}[x\in X_i]~\mathbb{P}[y\in Y_i]}{w_i}\,, \qquad \forall i\,. \end{equation} \tag{ A.11 }$

Now, let us consider a different CIMM instance with weights $w^{^{\prime}}_i$ , distributions $f^{^{\prime}\,(i)}_x$ and $f^{^{\prime}\,(i)}_y$ , and classifiers $\alpha^{^{\prime}\,(i)}_x$ and $\alpha^{^{\prime}\,(i)}_y$ which has an observationally equivalent distribution ${\mathcal{P}}(x,y)$ as the original CIMM instance. We will refer to this as the 'primed CIMM instance'. We will prove that the original CIMM is identifiable by showing that the primed CIMM instance must be equivalent to the original, up to permutations of the component index i.

The key observation is that in the small ε limit, no component of primed CIMM instance can have non-vanishing contributions in the region $(x,y)\in X_i\times Y_i$ for more than one i. If some component of the primed instance, say the kth component, has non-vanishing contributions to $\mathbb{P}_{ii}$ and $\mathbb{P}_{jj}$ for i ≠ j, then the 'off-diagonal probabilities' $\mathbb{P}_{ij}$ and $\mathbb{P}_{ji}$ will also receive non-vanishing contributions (due to conditional independence), which is not allowed by (A.10). To make this argument more carefully, it can be shown, from (A.9), that for all $i,j,k\in\{1,\dots, C\},$

$\begin{equation} \begin{split} \mathbb{P}_{ij}\,\mathbb{P}_{ji} & \geqslant \mathbb{P}[x\in X_i]~\mathbb{P}[y\in Y_i]~\mathbb{P}[x\in X_j]~\mathbb{P}[y\in Y_j]\\ & \quad \times \frac{E_{x\in X_i}\left[\alpha^{^{\prime}\,(k)}_x(x)\right]\,E_{y\in Y_i}\left[\alpha^{^{\prime}\,(k)}_y(y)\right]\,E_{x\in X_j}\left[\alpha^{^{\prime}\,(k)}_x(x)\right]\,E_{y\in Y_j}\left[\alpha^{^{\prime}\,(k)}_y(y)\right]}{w^{^{\prime}\,2}_k}\,. \end{split} \end{equation} \tag{ A.12 }$

Using (A.10) and (A.12), we can show that for all $i,j,k\in\{1,\dots, C\}$ with i ≠ j,

$\begin{equation} \frac{E_{x\in X_i}\left[\alpha^{^{\prime}\,(k)}_x(x)\right]\,E_{y\in Y_i}\left[\alpha^{^{\prime}\,(k)}_y(y)\right]}{w^{^{\prime}}_k}~\frac{E_{x\in X_j}\left[\alpha^{^{\prime}\,(k)}_x(x)\right]\,E_{y\in Y_j}\left[\alpha^{^{\prime}\,(k)}_y(y)\right]}{w^{^{\prime}}_k} \leqslant \left[\epsilon \sum_{k = 1}^C~w_k^{-1}\right]^2\,. \end{equation} \tag{ A.13 }$

This equation captures the constraint that no component k of the primed CIMM instance can have non-vanishing contributions in two different regions $(x,y)\in X_i \times Y_i$ and $(x,y)\in X_j \times Y_j$ with i ≠ j. On the other hand, each of the 'diagonal regions' must receive a non-vanishing contribution from at least of the components of the primed instance. More concretely, from (A.9) and (A.11), one can see that for all $i\in \{1,\dots, C\}$ , there exists a $k\in \{1,\dots, C\}$ such that

$\begin{equation} \frac{E_{x\in X_i}\left[\alpha^{^{\prime}\,(k)}_x(x)\right]\,E_{y\in Y_i}\left[\alpha^{^{\prime}\,(k)}_y(y)\right]}{w^{^{\prime}}_k} \geqslant \frac{(1-\epsilon)^2}{C\,w_i}\,. \end{equation} \tag{ A.14 }$

From (A.13) and (A.14) and the fact that both the original and the primed CIMM instances have the same number of components, one can see that as ε is made arbitrarily small, there exists a permutation σ of the component indices such that $f\,^{(i)}_x$ -s are observationally equivalent to the corresponding $f^{^{\prime}\,\sigma(i)}_x$ -s in the region $x\in \bigcup\limits_{i = 1}^C\, X_i$ , and similarly $f\,^{(i)}_y$ -s are observationally equivalent to $f^{^{\prime}\,\sigma(i)}_y$ -s in the region $y\in \bigcup\limits_{i = 1}^C\, Y_i$ .

The equality of the weights w_i and $w^{^{\prime}}_{\sigma(i)}$ and the equivalence of the distributions $f\,^{(i)}_x$ and $f^{^{\prime}\,\sigma(i)}_x$ in the entire sample space $\Omega_x$ follows from the fact that both the original and primed CIMM instances have observationally equivalent distribution ${\mathcal{P}}(x,y)$ in the region $(x,y)\in \Omega_x\times Y_i$ —note that Y_i has a non-zero measure for all ε > 0. A symmetric argument establishes the equivalence of the distributions $f\,^{(i)}_y$ and $f^{^{\prime}\,\sigma(i)}_y$ in the entire sample space $\Omega_y$ . This completes the proof of theorem 2 for $C\geqslant 2$ components.

Appendix B.: Functional gradient of `neg_ctc_cost`

In this section we will discuss a strategy that can speed-up and improve the training of InClass nets using the $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ of (24b )

$\begin{equation} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} = - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\frac{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{\frac{1-V}{V}}\right]}{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]}\right\}~\right]\,. \end{equation} \tag{ B.1 }$

Note that there are multiple expectations $E_{{\mathcal{P}}^\ast}$ in the expression for the cost function. The outermost expectation is similar to the one in a cost function which can be written as an expectation over a per-datapoint loss function. For such cost functions (which only have an outermost expectation), one can use stochastic or mini-batch gradient descent for faster or more efficient training of the network. However, the presence of the inner expectations $\varphi^{(i)}_v \equiv E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]$ in our cost function means that the batch size used in the training should be large enough to estimate the pseudo weights $\varphi^{(i)}_v$ well. In particular, the batch size should be large enough to pick up subtle changes in the value of $\varphi^{(i)}_v$ caused by changes to the network weights θ . The need for large batch sizes will only be exacerbated as the number of components increases.

However, we can overcome this difficulty, and facilitate the use of stochastic gradient descent to optimize the $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ as shown below. We will begin by deriving the expression for the functional derivative of the cost function with respect to the NN outputs. For convenience, let is define N and D as

$\begin{align} N &\equiv \sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{\frac{1-V}{V}}\right]\,, \end{align} \tag{ B.2a }$

$\begin{align} D &\equiv \sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]\,. \end{align} \tag{ B.2b }$

This lets us write

$\begin{equation} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} = -E_{{\mathcal{P}}^\ast}\left[\log\left(\frac{N}{D}\right)\right]\,. \end{equation} \tag{ B.3 }$

Taking the functional derivative with respect to $\beta^{(j)}_u(x^{^{\prime}}_u)$ , one gets

$\begin{align} &\frac{\delta\,\texttt{neg}\_\texttt{ctc}\_\texttt{cost}}{\delta\,\beta^{(j)}_u(x^{^{\prime}}_u)} = - E_{{\mathcal{P}}^\ast}\left[~\frac{1}{N}\frac{\delta\,N}{\delta\,\beta^{(j)}_u(x^{^{\prime}}_u)}~\right] + \frac{1}{D}\frac{\delta\,D}{\delta\,\beta^{(j)}_u(x^{^{\prime}}_u)} \end{align} \tag{ B.4 }$

$\begin{align} \qquad\qquad& = -~\frac{{\mathcal{P}}^\ast_u(x^{^{\prime}}_u)}{\beta^{(j)}_u(x^{^{\prime}}_u)}~E_{{\mathcal{P}}^\ast}\left[\left.~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v(x_v)~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v(x_v)~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]~} ~\right|~ x_u = x^{^{\prime}}_u ~\right] \nonumber\\ & \quad -~\frac{{\mathcal{P}}^\ast_u(x^{^{\prime}}_u)}{E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_u\right]}~E_{{\mathcal{P}}^\ast}\left[~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]~}~\right]\nonumber\\ & \quad +~\frac{{\mathcal{P}}^\ast_u(x^{^{\prime}}_u)}{E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_u\right]}~\frac{~\displaystyle\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{1/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]~}\,. \end{align} \tag{ B.5 }$

Using this, we can write the gradient of the cost function with respect to the NN weights θ as

$\begin{align} \nabla_{\boldsymbol{\theta}}\,&\texttt{neg}\_\texttt{ctc}\_\texttt{cost} = \sum_{j = 1}^C~\sum_{u = 1}^V~\int d x^{^{\prime}}_u~ \frac{\delta\,\texttt{neg}\_\texttt{ctc}\_\texttt{cost}}{\delta\,\beta^{(j)}_u(x^{^{\prime}}_u)}~\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u(x^{^{\prime}}_u) \end{align} \tag{ B.6 }$

$\begin{align} & = -\sum_{j,u}~E_{{\mathcal{P}}^\ast}\left[~\frac{\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u}{\beta^{(j)}_u}~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]~}~\right] \nonumber\\ & \quad -\sum_{j,u}~\frac{E_{{\mathcal{P}}^\ast}\left[\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u\right]}{E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_u\right]}~E_{{\mathcal{P}}^\ast}\left[~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]~}~\right] \nonumber\\ & \quad +\sum_{j,u}~\frac{E_{{\mathcal{P}}^\ast}\left[\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u\right]}{E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_u\right]}~\frac{~\displaystyle\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{1/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]~}\,. \end{align} \tag{ B.7 }$

This expression allows us to approximate the gradient of the cost function as

$\begin{align} \nabla_{\boldsymbol{\theta}}\,\texttt{neg}\_\texttt{ctc}\_\texttt{cost} &\approx -\sum_{j,u}~E_{{\mathcal{P}}^\ast}\left[~\frac{\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u}{\beta^{(j)}_u}~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v~\left(\hat{\varphi}^{(j)}_v\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(\hat{\varphi}^{(i)}_v\right)^{(1-V)/V}\right]~}~\right] \nonumber\\ & \quad -\sum_{j,u}~\frac{E_{{\mathcal{P}}^\ast}\left[\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u\right]}{\hat{\varphi}^{(j)}_u}~\times~\texttt{aux}^{(j)} \nonumber\\ & \quad +\sum_{j,u}~\frac{E_{{\mathcal{P}}^\ast}\left[\nabla_{\boldsymbol{\theta}}\,\beta^{(j)}_u\right]}{\hat{\varphi}^{(j)}_u}~\frac{~\displaystyle\prod_{v = 1}^V\left(\hat{\varphi}^{(j)}_v\right)^{1/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V\left(\hat{\varphi}^{(i)}_v\right)^{1/V}\right]~}\,, \end{align} \tag{ B.8 }$

where $\hat{\varphi}^{(i)}_v$ represents a moving estimate of $E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]$ maintained throughout the network training process and $\texttt{aux}^{(j)}$ represents a moving estimate of

$\begin{equation} E_{{\mathcal{P}}^\ast}\left[~\frac{~\displaystyle\prod_{v = 1}^V~\beta^{(j)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(j)}_v\right]\right)^{(1-V)/V}~}{~\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]~}~\right]\, . \end{equation} \tag{ B.9 }$

Maintaining the moving estimates $\hat{\varphi}^{(i)}_v$ and $\texttt{aux}^{(j)}$ is comparable to maintaining a discriminator in the training of a Generative Adversarial Network (GAN). The discriminator can be used to evaluate (and improve) the generator using mini-batches of data, instead of evaluating the (gradient of the) statistical distance between the training dataset and the GAN-dataset from scratch at every training step. Likewise, $\hat{\varphi}^{(i)}_v$ and $\texttt{aux}^{(j)}$ faciliate the use of stochastic or mini-batch gradient descent for the $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ using (B.8)—all the expectations in that expression are amenable to replacement with stochastic or mini-batch estimates. We note that this strategy was not needed for the studies performed in this paper.

The strategy employed in this section to facilitate the use to stochastic gradient descent for optimizing $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ is applicable to a number of cost functions which cannot be written as expectations of per-datapoint loss functions. We will expand on this idea in future publications, and may implement it in future versions of RainDancesVI.

Appendix C.: Surrogate cost functions

For the cost function $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ from (24b ), we can define a surrogate cost function $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ as

$\begin{equation} \texttt{unnorm}\_\texttt{neg}\_\texttt{ctc}\_\texttt{cost} = - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\displaystyle\sum_{i = 1}^C~\left[\prod_{v = 1}^V~\beta^{(i)}_v~\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{(1-V)/V}\right]\right\}~\right]\,. \end{equation} \tag{ C.1 }$

This surrogate cost function is an alternative cost function whose minimization will also lead to the minimization of the $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ . More concretely, if the true distribution ${\mathcal{P}}^\ast$ does correspond to a CIMM, then the surrogate cost function will be minimized only when the network outputs (pseudo classifiers) $\beta^{(i)}_v$ match the classifiers $\alpha^{(i)}_v$ that correspond to a best fitting CIMM instance. This can be proved as follows: From (24b ) and (C.1), we have

$\begin{align} \texttt{neg}\_\texttt{ctc}\_\texttt{cost} & = \texttt{neg}\_\texttt{ctc}\_\texttt{cost} - E_{{\mathcal{P}}^\ast}\left[\,\log\left\{\displaystyle\sum_{i = 1}^C\left[\prod_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)^{1/V}\right]\right\}\right] \end{align} \tag{ C.2a }$

$\begin{align} &\geqslant \texttt{neg}\_\texttt{ctc}\_\texttt{cost} - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\displaystyle\frac{1}{V}\sum_{i = 1}^C~\sum_{v = 1}^V\left(E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_v\right]\right)\right\}~\right] \end{align} \tag{ C.2b }$

$\begin{align} & = \texttt{neg}\_\texttt{ctc}\_\texttt{cost}\,. \end{align} \tag{ C.2c }$

In (C.2b ), we have used the inequality of arithmetic and geometric means and in step (C.2c ), we have used the constraint $\displaystyle\sum_{i = 1}^C~\beta^{(i)}_v(x_v) = 1$ satisfied by the NN outputs. Note that setting the pseudo classifiers $\beta^{(i)}_v$ to be equal to the classifiers $\alpha^{(i)}$ corresponding to a best fitting CIMM instance both a) minimizes $\texttt{neg}\_\texttt{ctc}\_\texttt{cost}$ , and b) satisfies the condition for equality in (C.2b ). This completes the proof that the unnorm_neg_ctc_cost is a suggogate cost function for the neg_ctc_cost when the data does correspond to some CIMM. The bivariate special case unnorm_neg_cmi_cost which is a surrogate cost function for the neg_cmi_cost can be explictly written as

$\begin{equation} \texttt{unnorm}\_\texttt{neg}\_\texttt{cmi}\_\texttt{cost} = - E_{{\mathcal{P}}^\ast}\left[~\log\left\{\displaystyle\sum_{i = 1}^C~\frac{\beta^{(i)}_x\,\beta^{(i)}_y}{\sqrt{\displaystyle E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_x\right]\,E_{{\mathcal{P}}^\ast}\left[\beta^{(i)}_y\right]}}\right\}~\right]\,. \end{equation} \tag{ C.3 }$

These surrogate cost functions are also implemented in the RainDancesVI package.