Learning latent functions for causal discovery

Causal discovery from observational data offers unique opportunities in many scientific disciplines: reconstructing causal drivers, testing causal hypotheses, and comparing and evaluating models for optimizing targeted interventions. Recent causal discovery methods focused on estimating the latent space of the data to get around a lack of causal sufficiency or additivity constraints. However, estimating the latent space significantly increases model complexity, compromising causal identifiability and making it hard to compare models that correspond to different causal hypotheses. We propose a kernel, non-parametric latent-space modelling approach and deal with the difficulty of comparing causal directions by measuring and controlling for the level of causal assumption fulfilment. We introduce a latent noise causal inference framework to estimate latent factors associated with the hypothesized causal direction by optimizing a loss function with kernel independence criteria. We extend the framework to work with time series using an additional time-dependent kernel regularizer. We discuss the additivity assumption and model complexity and give empirical evidence of performance in a wide range of synthetic and real causal discovery problems.


Introduction
Causal inference provides the proper mathematical framework to discover and explain the causal structure of complex systems (Peters et al 2017a, Zhang et al 2018, Runge et al 2019a. Within this framework, a variable X is the cause of another variable Y if intervening on X, necessarily affects the variable Y. Often, interventions in the system are impossible for ethical, practical or economic reasons. Then, going beyond the commonly adopted correlation approach, observational causal inference comes into play to extract cause-effect relationships from multivariate data. Causal discovery is nowadays an active field of research in many disciplines: neuroscience (Ding et al 2006, Barack et al 2022 and epidemiology (Rothman and Greenland 2005), econometrics (Hünermund and Bareinboim 2019), remote sensing (Pérez-Suay and Camps-Valls 2019), ecology (Sugihara et al 2012a) and Earth and climate sciences (Runge et al 2019a, Diaz et al 2022. Causal discovery methods from observational data can broadly be classified into score-based, constraint-based, and asymmetry-based methods (Glymour et al 2019). Score-based methods generally rely on a parametric model for the density of the data and incorporate a penalty term to account for the complexity of the model. The latter two families generally assume data is generated by mechanisms governed by a set of equations (or at least mechanical sampling processes) and then study the probabilistic properties of the system under this assumption. Constraint-based methods, such as the PC algorithm (Spirtes et al 2000), aim to learn the structure of the system by extracting the conditional independence relations of the system. Asymmetry methods (Mooij et al 2016, Peters et al 2017a estimate the causal direction by assuming modularity (Pearl 2009), algorithmic independence (Janzing and Schölkopf 2010) or ICM (Daniušis et al 2010). We here focus on the asymmetry-based approach, which has recently shown very good performance in causal-discovery tasks for both additive and non-additive data (Mitrovic et al 2018, Tagasovska et al 2020. Some of these methods circumvent the causal sufficiency assumption or non-additivity of the data through estimating a latent space (Stegle et al 2010. Such an approach is compelling, especially for causal inference, since it allows the estimation of every element of the data generating mechanism (i.e. even latent, noise variables) instead of only the resulting joint distribution (i.e. the joint of observed variables). However, modelling the latent space increases the model's complexity, compromising identifiability whenever assumptions are not enforced and complicating the comparison of the models that represent the different causal hypotheses. Empirically, this has resulted in poor causal discovery performance for these methods (Tagasovska et al 2020).

Related work
Asymmetry-based methods exploit additional assumptions on the data-generating mechanism to identify the correct causal hypothesis, even among those belonging to the same Markov equivalence class. Different assumptions are used to break the asymmetry within Markov equivalent classes. Still, they all share a conception of the data-generating process as a sampling mechanism according to some program, albeit one that admits stochasticity. Modularity (Pearl 2009) refers to the assumption that the sampling is carried out one variable at a time, ancestrally ('causes before effects') and that changing the process of any one variable does not affect the processes of the others. A stronger assumption beyond modularity is to assume that the sampling process of the system's variables can be described by a functional causal model (FCM), i.e. a set of functional equations e i = f(c i , n i ) that govern how effects e i are sampled as the interaction of the causes c i and exogenous stochastic factors n i . Several methods have been developed by assuming additive interactions, i.e. e i = f(c i ) + n i (Shimizu et al 2006, Hoyer et al 2008, Bühlmann et al 2014, Hernández-Lobato et al 2016. In Zhang and Hyvärinen (2009), the additive model assumption was somewhat weakened by allowing for an invertible function g, plausibly describing measurement error on the effect variable, g(e i ) = f(c i ) + n i , . Another group of methods makes more general assumptions to avoid making specific hypotheses on the functional form of the sampling process. These methods focus on the joint density of the observed variables and their factorization. One such assumption, made by Lemeire and Dirkx (2006), is that the algorithmic complexity of the program that samples effects given causes (natures' program) is simpler than what would be necessary to sample causes given effects. This assumption veers slightly from the modularity assumption. In Janzing and Schölkopf (2010), the minimal complexity idea is related to modularity through the concept of algorithmic independence. The two ideas are related in that work by postulating that the causal hypothesis with minimal complexity is also the one corresponding to a factorization of the joint distribution when algorithmic independence between the conditionals is satisfied. To measure algorithmic independence empirically, (Daniušis et al 2010) measured the correlation between the input and the conditional distribution. To measure deviance from the equivalent minimal complexity assumption, Vreeken 2019a, 2019b) used a minimum description length (MDL) technique. A further assumption in this direction is that of independence of cause and mechanism (ICM) (Daniušis et al 2010, Chen et al 2014, Bloebaum et al 2018 by which changes in the level of the cause do not have any influence on the algorithm or program. Therefore, the conditional distribution should be algorithmically independent from the distribution of the cause. In this case, one uses the concept of algorithmic independence not to verify that minimal complexity is achieved by a certain factorization of the joint distribution but to verify the ICM. In (Mitrovic et al 2018), the ICM was assumed (yet verified) by checking that the algorithmic complexity of the conditionals is the same, up to the noise, for differing levels of the cause. Many asymmetry-based methods check this criterion modelling the data using kernel methods (Schölkopf and Smola 2002); the conditional densities and their complexity are indirectly estimated through the conditional mean embedding and its norm, respectively. Recently, in Tagasovska et al (2020), the complexity of different factorizations of the joint distribution is measured using quantile regression with promising empirical results.
Another approach to deal with non-additivity is to estimate the latent space of the data such that the observed and estimated latent variables satisfy an additive model. Instead of focusing on the factorization of the joint density of the observed variables, one goes back to the FCM sampling mechanism and estimates the joint of an augmented but simpler (additive) set of variables. Estimating the latent space is the approach used in Stegle et al (2010), Monti et al (2020): in Stegle et al (2010) this is done through a Bayesian framework, while in Monti et al (2020) it is done through autoregressive flow models.

Contributions
In this paper, we adopt a non-parametric kernel approach to estimate the latent space and model the observed variables. We aggregate an ensemble of models across hyperparameters and causal hypotheses to obtain a causal prediction.
We propose a novel method for latent noise (LN) estimation based on kernels and introduce a bivariate causal discovery algorithm (LNc) based on this method. Using LN we may obtain different models for the data corresponding to different causal hypotheses. However, comparing these models is not straightforward. To address this problem, we propose a weighted majority voting rule that allows us to aggregate the models into a causal prediction. Theoretically, we make the same assumptions as (Stegle et al 2010) but we use a non-probabilistic kernel method approach to do so. Besides, unlike (Lopez-Paz et al 2015, Ton et al 2020, where causal discovery is carried out in a supervised manner, our approach avoids overfitting and lack of explainability pitfalls of supervision by building principled descriptors of model assumption fulfilment, including causal sufficiency or additivity. We empirically prove that estimating LN becomes increasingly important with the non-additivity of the data. We illustrate performance empirically and show that the methodology is superior to other latent-space-based methods and comparable to the state-of-the-art methods concerning benchmark datasets and involving real-world problems. Finally, we extend the framework to work with time series data.

Outline
The remainder of the paper is organized as follows. Section 2 details our latent variable estimation method and the bivariate causal discovery method, which builds upon it. This section also discusses how we detect non-additive or non-causally sufficient data by using an additivity hypothesis test. In section 3, we explore the artificially generated and real-world, i.i.d. and time series data that we will use to evaluate the performance of our method. In section 4, we show empirical evidence of performance for these datasets and compare against the state-of-the-art methods, including two latent space methods. We also show performance as a function of the degree of non-additivity in the data to illustrate the effectiveness of the latent variable approach. We provide some concluding remarks in section 6.

LN causal inference
Let us assume a dataset of observed pairs D = {(x i , y i )} n i =1 ∈ R 2×n generated according to the mechanism where it is unknown if x is the cause (c) and y the effect (e) or vice versa. Owing to Reichenbach's common cause principle (Reichenbach 1991), we approach this discovery problem by performing two sequential and related tasks: (1) estimating the latent factor z, and (2) performing bivariate causal discovery, namely predicting a label from {x → y, y → x}. Since for arbitrary x and y there is generally a z and f such that y = f(x, z) (see theorem 1 in Hyvärinen and Pajunen 1999) we must make further assumptions to force an asymmetry.

Specific FCMs vs ICM
Recently there have been two approaches to arguing theoretically for causal identifiability in the bivariate setting: (1) assume a specific FCM, that is, make specific assumptions about the class F of f and the distributions of x and z and then use probabilistic arguments to show why both y = f(x, z) and x = f(y, z) are not possible under these assumptions; (2) make ICM (Daniušis et al 2010) modularity assumption and argue that under this assumption the model that implies a factorization of the joint pdf with least Kolmogorov complexity (KC) 3 is the true causal model 4 . Generally, works in bivariate causal discovery justify why causal identifiability is possible using one of the two above approaches and then implement a causal discovery algorithm that leverages the ideas or results therein. Methods that follow the first approach estimate models for the two competing factorizations of the joint distribution 5 building in the functional and distributional assumptions to their algorithm and score both alternatives. Methods that follow the first approach can be found in Hoyer et al (2009), Hyvärinen (2009), Khemakhem et al (2020). Those from the second approach will try to implement an algorithm that estimates the complexity of the two competing factorizations of the joint distribution. Methods that follow the second approach include (Marx and Vreeken 2019a, Tagasovska et al 2020. We adopt the second approach to avoid making specific assumptions about the generating FCM.

Strengths, weaknesses and relationship
The first approach could be interpreted as a subset of the second: if the true causal model comes from a relatively simple FCM, then by the ICM principle, the anti-causal model comes from a more complex one. By using a priori information (assumptions) about the true causal model, we can prove, in certain cases, that any admissible anti-causal model does not belong in the simple class. The second approach is flawed because proving identifiability in practice is impossible since it relies on KC. There is no way of computing the KC of FCMs, either when we are working with data that they generate or even when we have knowledge of the probability distributions that specify them (Li and Vitányi 2008). However, following the second approach, one could use a proxy for KC and then prove causal identifiability under specific distributional and functional assumptions as in (Tagasovska et al 2020). This kind of result would still hold to the extent that the chosen proxy is a good enough estimate of KC. In bivariate causal discovery, we are usually in a situation where we assume there is no a priori knowledge of the true causal structure. While the first approach puts causal identifiability on sound theoretical footing, it is undesirable in the context of (observational) causal discovery since we need to make concrete assumptions about the data-generating process. It is possible that, in practice, knowing such specific properties of the data-generating process but not having any a priori knowledge of the causal structure is rare. On the other hand, in the ICM approach, there are no proper identifiability results. Still, if we accept this assumption and are willing to use a proxy for KC, its applicability is very general.

Our assumptions
We make the following assumptions as in Stegle et al (2010): 1. Deterministic process: f is a deterministic function representing the mechanism through which the (possibly non-additive) interaction of causes c and z results in effect e. Since there generally exists an f and z such that y = f(x, z) for any x and y, then it is also true that there exist f, z and ϵ such that y = f(x, z) + ϵ. We restrict the class of models to those where ϵ = 0. 2. Exogenous noise z. The causes c and z are independent. If we make z := y then with function f(w, z) := z we trivially satisfy y = f(x, z). By restricting z to be exogenous from (x, y), we avoid this degenerate case and restrict z further, 3. Gaussian z. z i are unobserved i.i.d. samples from a standard Gaussian distribution. Again this choice further restricts the solution space for z. Finally, 3 Informally, the KC of a pdf is the least amount of information needed to encode it in the most concise form possible. 4 The non-causal factorization could, at worst, have equal KC, in which case there is no causal identifiability, but never less. 5 a causal link is assumed so that the p(x, y) = p(x)p(y) factorization is ruled out a priori.
4. Algorithmic independence causes c and z are algorithmically independent (Janzing and Schölkopf 2010) from the mechanism f.
Following Stegle et al (2010), we make these assumptions to restrict the model class enough that causal identifiability is possible. We do not contribute a theoretical result to prove that these assumptions lead to identifiability, but empirical performance results shown in section 4 will show their usefulness. We do not enforce such assumptions strictly, and later, we empirically analyse their necessity, based on causal prediction performance. Not enforcing them strictly will enable us to use the additive noise model (ANM) techniques, which require non-zero residuals that violate assumption 1. The rationale is that, by enforcing these assumptions, we will have restricted the model class enough. In this case, an ANM of the form e = f(c,ẑ) + ϵ is only possible for one of the two considered causal directions.

An extended ICM assumption
In sections 2.1.1 and 2.1.2 we discussed and contrasted the ICM approach vs. the approach of making specific FCM assumptions. In section 2.1.3, we state our assumptions (those of Stegle et al (2010)). In this section, we establish that these assumptions are an extended ICM assumption. Specifically, the exogenenous noise and algorithmic independence assumptions are implicit in the ICM assumption. The deterministic process and Gaussian assumptions are additional assumptions which is why we refer to an extended ICM assumption.
The ICM assumption states that if x causes y, then p(y|x) and p(x) should be algorithmically independent. This requires that f and p(x) be algorithmically independent since p(y|x) is the pdf of the random variable f x (z) where x is fixed (since in p(y|x), x is given and so not random.) Additionally, the ICM assumption implies that x be statistically independent of z. To see this, take the converse view that for a given realization of z, y = f z (x) with z fixed, so that the noise variable z selects the specific deterministic mechanism. Then, if x and z are statistically dependent, knowing the distribution of x gives information about f (cf section 2.1 of Peters et al (2017b) for further discussion) and so violates algorithmic independence. This is especially the case for non-additive regimes where the fact that x and z interact non-additively means that if they are dependent more information is known about p(y|x) by knowing p(x).
Additionally, we adopt the deterministic process and Gaussian assumptions. The deterministic mechanism assumption is a simplicity assumption that ensures our data model y = f(x, z) is an FCM with only one source of random noise. The Gaussianity assumption has helped in certain non-additive settings such as that dealt with in Khemakhem et al (2020) to prove identifiability. Still, as is hypothesized in that work, and as we will see empirically in this work, it is potentially not necessary.

Existing identifiability results
Three existing identifiability results are relevant for non-additive data. We reproduce their non-formal versions here to contextualize our problem setting better. Suppose x and z are statistically independent: 1. Post non-linear causal model (PNL) (Zhang and Hyvärinen 2009): data models of the form y = g(h(x) + z), where g is invertible, and h is non-constant, are generically (excepting certain pathological conditions) causally identifiable 2. Affine flow (AF) : data models of the form y = e h(x) * z + g(x) where g is nonconstant and non-linear, and x and z are Gaussian, are generically (excepting certain pathological conditions) causally identifiable. 3. Location-scale (LS) (Immer et al 2022): data models of the form y = e h(x) * z + g(x) where h, g are twice differentiable, are generically (excepting certain pathological conditions) causally identifiable.
The first result means that models such as multiplicative models of the form y = f(x) * z, where f (x) and z are positive, are (generically) identifiable since they can be re-expressed as ln(y) = ln(f(x)) + ln(z). However, relatively simple models, such as a location-scale model of the form y = h(x) * z + g(x), are not PNLs. The second result expands the classes of non-additive models since it shows location-scale models-provided that h(x) takes positive values, g is non-linear, and x, z are Gaussian-are identifiable. The third result generalizes the second in the sense that x and z are not restricted to be Gaussian. Although this result is already quite general, it still restricts the FCM to be of the form f(x, z) = h(x) * z + g(x). By using the more abstract assumptions adopted in Stegle et al (2010) regarding the non-additive data model y = f(x, z), than those presented above, the aim is to gather empirical evidence that the ICM modularity assumption is enough to guarantee generic identifiability of non-additive models. We also contribute a method to infer causal discovery in the bivariate case. We admit that this method is only theoretically valid to the extent that ICM is a valid causal identifying principle and that our proxy for complexity, model entropy, is a good proxy for KC.

Notation
We now fix the notation that we will use henceforth. We refer to random variables using lowercase letters x, y, z; to univariate observations of random variables using indexed lowercase x i , y i , z i ; and to the sets of n observations using upper case calligraphic letters X , Y, Z. For simplicity, we abuse notation by referring to the probability density function of random variable x as p(x). The upper case letter K will be reserved for Gram matrices, which in this work will always correspond to the Gaussian radial basis function (RBF) kernel (Schölkopf and Smola 2002). A lowercase subscript, as in K x , K y and K z , will be used to specify the domain sample space. While a superscript, as in K σ , will be used to denote kernel hyperparameters.

Kernel measures of independence and normality
Consider random variables x and y, taking values in metric spaces X and Y. Let K x and K y be kernel functions respectively on X and Y, with reproducing kernel Hilbert spaces (RKHSs) H x and H y . The cross-covariance operator is defined as the linear operator Σ xy : The Hilbert-Schmidt independence criterion (HSIC) measuring dependence between x and y is then defined as the Hilbert-Schmidt norm Σ xy . Given a dataset (X , Y) with n pairs drawn from the joint distribution P(x, y), an empirical estimator of HSIC is defined as Gretton et al (2005a): where Tr(·) is the trace operation (the sum of the diagonal entries), K x , K y are the kernel matrices computed on observations {x i } n i =1 and {y i } n i =1 using kernel function k x (x, x ′ ) and k y (y, y ′ ) respectively, and H = I − 1 n 11 ⊤ has the role of centring the data in the feature space. For a broad family of kernels (including, e.g. the Gaussian RBF), the population HSIC equals 0 if and only if x and y are statistically independent (Gretton et al 2005a). Hence, non-parametric independence tests, which test for any form of dependence, can be devised using HSIC estimators with such kernels. Note that HSIC is sensitive to the scale appearing in the marginal distributions of x and y and their units of measurement and hence needs an appropriate normalization. This problem is well known in the literature, and a normalized version of HSIC, called NOrmalized cross-covariance operator, was introduced in Fukumizu et al (2008). Since our objective does not need relative dependence comparisons across problems, we use a simpler strategy to normalize HSIC by the self-dependencies over X and Y, which leads to the nHSIC score (Rojo-Álvarez et al 2018).
Another important measure in our algorithm is the maximum mean discrepancy (MMD) (Gretton et al 2006), which measures the distance between two distribution sample means (µ H . This can be empirically estimated as L ij = 1/n 2 if x i and x j belong to the (potential) cause domain and L ij = −1/n 2 if x i and x j belong to the cross-domain. MMD tends asymptotically to zero when the two distributions P x and P y are the same. HSIC can be understood as an MMD (Gretton et al 2006(Gretton et al , 2012 between the joint probability measure of x and y and the product of their marginals, and hence has been widely used in many learning frameworks (Li et al 2018, Liang et al 2018, 2019. MMD can be used as a normality test for distributions. To do this, let us denote SMMD the estimated MMD (Gretton et al 2006) between the distribution of X and a standard Gaussian: which, assuming a RBF kernel function with lengthscale σ, has a closed-form formula (Rustamov 2019) is the variance operator over MMD estimates.

Estimating LN with kernels
If the random variable z were observed up to a deterministic invertible functional dependency (i.e. we observe z ′ where z ′ = f(z)), the data generating mechanism could be considered an ANM with zero variance errors. Because we do not have access to the true Z but will instead estimate it, we do not expect to be able to approximate this deterministic scenario closely. Instead, we fit a more flexible model (than the one we assume generated the data) of the form y = f(x, z) + e. Still, we aim to curtail the degree of non-determinism by penalizing large and systematic errors e (statistical dependencies of e on (x, z)). We can then apply bivariate causal discovery techniques for ANMs to determine in which causal direction this is better achieved. This will be our approach: simultaneously estimate the latent factor Z and f (from the more flexible model) such that the mechanism is additive. We approach the deterministic scenario as we arrive at a more additive model (less systematic errors) with smaller errors. We do not strictly enforce assumptions 1-4 of the previous section and instead explore their relative importance to our approach.
We propose to estimate the latent factor Z, associated to the hypothesized causal direction x → y, by solving the following optimization problem, where the loss is defined as (3) MSE is the mean square error computed from model output residuals R x→y ; and (4) SMMD is the estimated MMD (Gretton et al 2006) between the distribution of Z and a standard Gaussian. We refer to this latent variable estimation method as LN.
Assuming that (1) the assumptions of section 2.1.3. ensure causal identifiability and that (2) the LN method provides a consistent way for finding f and Z in the causal, lets say x → y direction, then in this direction, any dependence between residuals R and (X , Z) will be due to errors in estimating f and Z. If LN is indeed consistent then these errors will disappear in the large sample limit. However, in anti-causal direction, even in the large sample case, because of identifiablity, there exists no f and Z that do not violate assumptions 2-4 and such that y = f (x, z). In this instance we would have a case of model misspecification. The intuition for relaxing the deterministic process assumption in order to determine the causal direction is that the impact of model misspecification will be more severe than that of estimating error, both in terms of the MSE(R) (size of the residuals) and of the HSIC(R, X a ) (dependence of residuals).
Each of the four terms in the loss function L(Z) relates to the first three assumptions made in the previous section. The fourth assumption will be enforced by choice of optimization technique. We now analyse how each of the four assumptions relates to the three loss function terms, the choice of hyperparameters and the optimization techniques used.

Assumption 1. Deterministic process.
Since a deterministic process is assumed, there exists a Z such that r i = y i − f(x i , z i ) = 0 for all i = 1, . . . , n so we penalize solutions Z with large residuals using the mean square error. The exact Z will not be recovered, and so residuals will not be zero. However, since the model is additive in the causal direction, we encourage the additivity of the model by penalizing solutions where the residuals depend on the cause variables (X , Z): the additional restriction to Z implicit in the penalty terms are introduced to create an asymmetry between the two possible causal directions. As mentioned before, our method is not able to find a latent variable Z, which makes the process deterministic. Rather the idea is to make any residuals small and non-systematic. To estimate the dependence between the residuals and the cause variables, we use the normalized HSIC (nHSIC) (Gretton et al 2005b).

Assumption 2. Exogenous latent variable z.
A crucial part of creating the conditions for the asymmetry between the modelling directions to manifest is to ensure that the estimated variable Z brings something new to the variable being considered as the effect. This is made obvious by the trivial solutions Z x→y = Y and Z y→x = X both of which have exactly zero mean square error and where the residuals are independent of the cause variables. We use nHSIC to penalize Z, which are dependent on the variable playing the role of cause.
Assumption 3. Gaussian latent variable z. It is clear that we will only be able to estimate unknown invertible functions of Z, not Z itself. For any f and Z such that . Enforcing restrictions on Z, such as Gaussianity, will reduce the identifiability problem. Additionally, ensuring that Z is approximately Gaussian throughout the optimization process may have beneficial statistical properties. We include it here following Stegle et al (2010) and later in section 4.2 and appendix C.5 analyse its necessity from empirical results. We use MMD (Gretton et al 2006) to penalize the degree of non-Gaussianity by measuring the discrepancy to a standard normal distribution.
Assumption 4. Algorithmic independence. Assumption 4 states that the mechanism f should be algorithmically independent from the cause (c, z). This assumption relates to the modularity of any system generated from FCMs. If we intervene on the cause (c, z) by altering its distribution, the mechanism f would be unchanged. The distribution p(c, z) should contain no information about the causal mechanism f. Although we do not specifically penalize algorithmic dependence of p(c, z) and f, the stochastic gradient descent (SGD) optimization strategy, which we implement, will help break any possible dependence between p(c, z) and f : by sampling a subset of the data the distribution will change somewhat at each step and so disrupt the flow of information from p(x, z) to f. In appendix C.4, we use a more aggressive strategy to avoid algorithmic dependence between p(c, z) and f to explore the importance of this assumption within our framework.
Notice that the regression framework proposed here cannot be used for prediction since we have not parameterized the latent variable Z and so cannot estimate it for out-of-sample non-training data. However, it is useful in this causal discovery setting and may be useful for discovering unobserved factors. Since we cannot perform prediction, regularization and hyperparameter selection cannot be done in a classical way using some form of expected risk surrogate, such as cross-validated MSE. Instead, we focus on the degree to which the assumptions are fulfilled across different choices of the penalization coefficients ζ, η and ν; the smoothing penalization coefficient λ; and the kernel parameter σ. It is not straightforward how one should aggregate over the asymmetry measures corresponding to the hyperparameter grid in both directions to obtain one decision function.

LN causal discovery algorithm
Before we go into more details, let us revise the general bivariate discovery method called LNc imputation: 1. Obtain latent Z using LN. For each θ = (ζ, η, ν, λ) in a chosen parameter grid G of size q estimate Z θ x→y and Z θ y→x . 2. Fit ANMs on augmented input X a . Use Z θ x→y and Z θ y→x to fit an ANM in both directions obtaining residuals R θ x→y and R θ y→x respectively. Notice that the aggregation procedure above (step 4) allows us to map 2q models, one for each hyperparameter grid point and causal direction, into one decision measure. In classical regression, cross-validation would be used to choose one out of q models in each direction. Instead, we here use an ensemble approach, weighted majority voting with the weights a function of the dependence of residuals, nHSIC(R, X ) and the votes a function of the entropy of model H(X a ) + H(R). This is the procedure briefly described in step 4 and detailed in full in appendix A. The pair-up and weighting scheme serves several purposes, including a parametrization that allows us to test the importance of identifying assumptions; a causal classifier robust to outliers and hyperparameter grid selection; and an ensemble of classifiers that exploit different sources of causal signals. This is also further explored in appendix A. Hoyer et al (2009) proved that based on the assumption that the data follow an ANM, the causal direction is generally identifiable (certain non-identifiable exceptions exist, such as the case of linear-Gaussian models). Mooij et al (2016) showed that using the dependence of residuals is a consistent method for finding the true causal direction under the ANM assumption and that using the entropy of the model is equivalent: with the former method, the causal prediction corresponds to the least dependent residuals, and with the latter, to the model with least entropy. An additive model is assumed in both cases, so latent factors are not considered. We also use these measures to identify which of the augmented datasets is closer to an additive regime.

Time series extension
To explore and illustrate the use of LNc on time series, we consider a simplified setting where we assume that either time-series x(t) causes y(t) or vice-versa. We thus exclude the possibility that a current time series observation can be the effect only of its past and the feedback scenario where both time series cause each other. To take into account the fact that data is time ordered, we add the following extra regularization term to our loss function: where T contains the time-step index of the observations. This will encourage Z solutions that result in residuals with low auto-dependence. We refer to this version of the algorithm as LN causality for time series (LNc-ts). As before, we used the joint kernel k((x, x ′ ), (z, z ′ )) = k(x, x ′ )k(z, z ′ ) with k being the Gaussian RBF kernel function. We also show results with the additive kernel (LNc-ak-ts) which we hypothesize will work better for additive data. Besides, we use a modified definition for the entropy of model: is the joint entropy of L lagged residuals. We fixed the maximum number of lags in our experiments to 5 for simplicity.

Additivity hypothesis test
As we will see, the relative additivity will play an important role in our analysis and the final proposed method. For this reason, we define an additivity hypothesis test for bivariate datasets and an accompanying empirical p-value calculating procedure.
resulting from fitting an ANM to pairs of points (x i , y i ) to be independent from x i and thus nHSIC(X , R x→y ) = 0 in the large sample case. Because the true causal direction is unknown, we take as our test statistic the best case for additivity, namely S = min{nHSIC(X , R x→y ), nHSIC(Y, R y→x )). We empirically construct the distribution under the null by calculating the statistic using randomly generated datasets from the AN model class (see section 3.1), which we know to be additive. We also tried using the standard permutation procedure (on the same data as is used to calculate the test statistic) to generate a null hypothesis. Still, we found the resulting p-value to be less discriminative. We do not use this additivity test in our proposed causal algorithm but only to illustrate how the algorithm's performance varies as a function of additivity.

Time series
In the case of time series data, we can no longer expect the residuals r i = y i −f(x i ) ∈ R x→y resulting from fitting an ANM to pairs of points (x i , y i ) to be independent of x i since the unaccounted auto-correlation in z i . will induce correlation in r i . However for data pairs (( x→y resulting from fitting an ANM to pairs of points ((x i , z i ), y i ) to be independent of x i and nHSIC(X , R Z x→y ) = 0 in the large sample case. We use the superscript Z for the residuals to emphasize that, in this case, we model y on both x and z. Although we do not observe z, we can use our estimated Z instead. As will be seen when we discuss implementation in section 4, in finding Z, we use a product kernel to model f(x, z) to include non-additive relationships. However, to test for additivity, we use a summation kernel k ((x, z), (x ′ , z ′ )) = k(x, x ′ ) + k(z, z ′ ) to exclude non-additive relationships from the null case. As before we take as our test statistic the best case for additivity namely S = min{nHSIC(X , R Zx→y x→y ), nHSIC(Y, R Zy→x y→x )). We construct the null distribution empirically by calculating the above statistic using randomly generated datasets.

Data
We now introduce the data used for testing the methods described in the previous section.

The i.i.d. case
We use data introduced in Mooij et al (2016), Tagasovska et al (2020) to test our method, which includes additive and non-additive data. SIM-x datasets (SIM, SIMc, SIMG, and SIMln) are bivariate datasets where the effect is a Gaussian process (GP) whose covariance function depends on both noise and the cause variable. The model is technically non-additive. However, because of the distribution from which the lengthscale is sampled, the resulting datasets are often close to additive. SIMc involves an additional confounding noise variable which affects both cause and effect. SIMG has an approximately Gaussian cause distribution in contrast with the irregular cause distributions for the rest. SIMln is a dataset with a low noise regime. Following Tagasovska  . The latter functions, being injective, generate harder causal discovery tasks. All the above artificial data have 1000 observations per dataset and 100 datasets per class. Finally, for real data, we use the Tübingen (Tub) benchmark (Mooij et al 2016) consisting of 108 pairs from various domains, of which we consider only 102 consisting of bivariate datasets. We consider the weight provided by the authors of this dataset intended to give equal weight to different sources of data. Figure 1(left) shows scatter plots for one example from each data class. Figure 2(left) shows the p-value (in log-scale) of this additivity test applied to all 1002 datasets. The additivity p-value separates known additive datasets (AN, AN-s) from known non-additive datasets (LS, LS-s MN-U) well. We also note that the SIM-x datasets seem to be mixed with a bias toward additivity while Tub datasets are mixed with a bias toward non-additivity.

Time series data
We generate time-series data using some of the classes of models from Tagasovska et al (2020), namely AN-s, MN-U, and LS-s, with the difference that x and z are generated not as i.i.d. data, but as random noisy functions of time. We chose these classes since they represented the most extreme cases of additivity and non-additivity in our i.i.d. setting. In this case, y is an instantaneous effect of x and z under different additive (AN-s) and non-additive (LS-s, MN-U) relationships. Although this setting does not include lagged effects, auto-correlation or auto-dependence of the time-series is present and makes the time-series problem difficult.
As real data, we focus on geoscience data and include time series of radiation and gross-primary-product (GPP). The uptake of atmospheric carbon dioxide by vegetation through photosynthesis is commonly referred to as GPP and is the largest carbon flux in the carbon cycle. As such, in the true generating physical mechanism radiation causes GPP, however, there are other non-observed causes, such as precipitation which interact in complex ways meaning these datasets are a priori likely to be non-additive. Both GPP and radiation are global gridded products, which are collected and curated in the Earth System Data Lab (ESDL). As in Díaz et al (2022) we use the FLUXCOM GPP (remote sensing dataset) which is the result of an upscaling of flux tower measurements based on multiple machine learning algorithms and satellite data as input. GPP is measured in Cm −2 d −1 and the product spans from 2001 to 2012, with a spatial resolution of 5 arc-minutes and a temporal resolution of 8 days. We use the incoming surface shortwave radiation data (RAD) of the Japan Aerospace eXploration Agency (JAXA) Satellite Monitoring for Environmental Studies product for the 2001-2015 period (available at ftp://suzaku.eorc.jaxa.jp/pub/GLI/glical/Global_05km/ repro_v6/). Spatial and temporal averaging was conducted by converting the original 5 km grid to 0.0833 • grids and daily to 8-day temporal resolution. Missing data in the original 5 km were replaced by mean daily values of available years. After harmonization, the GPP and RAD datasets share a common spatio-temporal grid of 0.25 • in space and 8 d in time, spanning 11 years from 2001 to 2011. In Díaz et al (2022) the causal discovery methodology, robust convergent cross mapping (rCCM) was presented, which combines information geometric causal inference (IGCI) (Janzing and Schölkopf 2010) and CCM (Sugihara et al 2012a). In that work, the method was applied globally, and in most cases, radiation was inferred to be the cause. However, there were certain locations where the reverse relationship was wrongly found. In total, we sample 300 locations globally and obtain radiation and GPP time series spanning 504-time steps (11 years of 8 d frequency observations). The sampling was stratified so that 150 of the locations correspond to places where rCCM found radiation to be the cause (RAD-GPP-r) and 150 of the locations correspond to places where rCCM found GPP to be the cause (RAD-GPP-g). Figure 1 (right) shows an example from each data class.
Figure 2(right) shows the p-value (in log-scale) of this additivity test applied to all 600 datasets. The additivity p-value somewhat separates known additive datasets (AN-s) from known non-additive datasets (LS-s, MN-U). The separation is not as clear as for the i.i.d. data. On the one hand, because we use the noisy Z estimate of z to calculate our test statistic, and possibly because for certain datasets, either x or z dominate so much that it renders the mechanism close to additive. In any case, as we will see when we look at the causal algorithm's performance as a function of additivity, this p-value will be useful in identifying which causal discovery method to use. We also note that RAD-GPP datasets show a wide ranging degree of additivity. The p-value distributions resulting from naively applying the i.i.d. additivity test to time series data are shown in a lighter red trace: all datasets are deemed non-additive in this case.

Experiments
This section gives empirical evidence of the performance of the proposed method in several benchmark synthetic and real datasets. Results are compared to standard methods in the literature.

Results on i.i.d. data
We compare against benchmark accuracy results for 17 state-of-the-art methods. We report results for 15 methods available from Tagasovska    on these benchmark datasets, two relevant bivariate causal discovery methods, the former since it is a latent space method and the latter since it is a kernel method. We optimized the loss of section 2 using stochastic gradient descent (SGD). For each dataset and causal direction, we used 500 optimization steps, and at each step, we sampled 100 points with replacement. We performed 7 repetitions of the SGD procedure. The solution Z is initialized as the target variable: i.e. for x → y as Y and for y → x as X . To simplify the number of hyperparameters we use the kernel is the Gaussian RBF kernel. We used the median heuristic for the x-kernel lengthscale, σ x . For the z-kernel lengthscale instead of using only the median we used 7 different quantiles, one for each repetition, corresponding to the following percentiles: {0.1, 0.3, 0.5, 0.7, 0.9, 0.99, 1} A hyperparameter grid of 72 points was used, corresponding to the Cartesian product of the following sets: θ = (ζ, η, ν, λ) such that ζ ∈ {0.1, 1.0, 10.0}; η ∈ {0.1, 1.0, 10.0}; ν ∈ {0, 10.0}, λ ∈ {0.001, 0.01, 0.1, 1.0}. Additionally, a constraint was introduced in the optimization problem which penalizes large and/or independent-from-the-target variable solutions. More details are given in appendix D.
We show here accuracy results, while in appendix C we show AUC of the ROC and PR curves. Table 1 shows accuracy results for LNc and ANMh where we use the KRR as our regression technique (Shawe-Taylor and Cristianini 2004). ANMh corresponds to fitting additive models as in Hoyer et al (2008) but scoring residuals with the model entropy function as in Mooij et al (2016).
As we can see, the additive method ANMh performs well on additive datasets AN and AN-s, while the non-additive method LNc performs well on non-additive data. Additive methods perform better on SIM-x data confirming our observation, based on additivity p-values, that these datasets are biased toward additivity. Figure 3 gives a more detailed description of how model performance varies depending on the degree of additivity of the data.
In general LNc performance is higher for non-additive data and ANMh is higher for additive data. We also note that the ANMh method is surprisingly robust to the type of non-additivity displayed in SIM-x models. We now define a criterion for combining additive and non-additive methods, namely improved additivity (denoted hsicx in results tables): datasets where The diagram in figure 4 shows accuracy results for the method LNc-ANMh, which combines ANMh and LNc. As figure 4 and table 2 show, LNc-ANMh is superior to the other latent variable methods, GPI and CAREFL, and is competitive with the state of the art for these benchmark datasets. Table 3 shows accuracy results for our LNc and LNc-ANMh methods, as well as the ANMh method, aggregated over additive and non-additive datasets. Additive sets are those where the additivity test p-value p > 0.001.

Gaussianity assumption along the additivity spectrum
In this section, we look at the role of enforcing Gaussianity on the performance of the LNc algorithm for data with different degrees of additivity. To this end, we generated 300 datasets, each with 1000 data points from the random variable Y = αX + (1 − α)W where X ∼ AN-s and W ∼ LS-s. Since AN-s corresponds to an additive data model and LS-s to a non-additive data model, the α ∈ [0, 1] parameter modulates the level of additivity in the data (α = 1 corresponds to additivity and α = 0 to high non-additivity). For each of the 300   to e −100 . We denote this set of datasets ANLS-s. We applied the LNc methodology in two ways: 1. LNc-Gauss. Using the hyperparameter grid of size 72 described in appendix D.3 where the parameter controlling the degree of Gaussianity enforcement is ν ∈ {0.0, 10.0} 2. LNc-free. Using the hyperparameter grid of size 36 described in appendix D.3 where the parameter controlling the degree of Gaussianity enforcement is ν = 0.0 Figure 5 shows the AUC of these two variants for different levels of additivity for these 300 datasets. Figure 5 shows that when we include models where Gaussianity has been enforced (LNc-gauss), the performance deteriorates as the data becomes more additive, at a much faster rate than when Gaussianity is not enforced (LNc-free). Not enforcing Gaussianity leads to better performance. Is this generally true? In the following, we try to gather more evidence for this hypothesis by performing this analysis on the data introduced in section 3.1. Figure 6 shows the AUC performance by data model class and degree of additivity when enforcing Gaussianity (LNc-gauss) and when not enforcing Gaussianity (LNc-free). The LNc-free performance shows a milder level of deterioration as the degree of additivity increases. Figure 6 shows that for all the model classes considered, not enforcing Gaussianity leads to better performance, especially for the more additive datasets. In general, we conclude that not enforcing Gaussianity leads to better performance of

Importance of identifying assumptions
Not all assumptions are equally important. In the following, we describe our qualitative findings on each assumption's relative importance, providing references to the empirical evidence.
• Deterministic. We have implemented this assumption as that the estimated (augmented) ANM should be the one giving the least importance to the residuals. In other words, the residuals should be as small and less systematic as possible. The good empirical performance of the LNc method, especially for non-additive data, supports this claim. However, this seems to be more due to the non-systematic errors assumption than the small errors assumption. This is shown in section A.2 (figures 13 and 14, top panel), which shows that nHSIC(R, X ) makes a significant contribution to the accuracy of the LNc method for all data classes while MSE(Z) does not. This indicates that one may allow for additive and non-additive sources of random noise (while still assuming deterministic functions as per the FCM paradigm). On the other hand, the relatively weaker performance for the SIM-x data classes, which are generated with both additive and non-additive noise (as well as observation noise on the cause), suggests that the method's performance deteriorates under these conditions. • Exogenous z. Some exogeneity is needed as the counter example, Z x→y := Y and Z y→x := X shows. In this case all residuals would be zero and the augmented cause variables would be (X , Y) in both cases, implying the same model entropy. Hyperparameter sensitivity (appendix C.3, figure 19); analysis of nHSIC(X , Z) for the model ensembles (appendix C.6, figure 23); analysis of pair-up and weighting using nHSIC(X , Z) as a criterion (appendix A.2, figure 11); and analysis of penalized model entropy (appendix B, figures 15-18) did not show it to be critical, for non-additive data. There is evidence that for some types of additive models (Gaussian, high variance, AN-s, AN, SIMG), forcing exogeneity or penalizing non-exogeneity helps to identify the correct model. This is seen in appendix B (figures 15-18), where we look at a version of model entropy that penalizes a lack of exogeneity. It is also supported in appendix A.2 ( figure 20) where we see that including nHSIC(X , Z) as a weighting variable in the majority voting scheme would contribute significantly to better accuracy and in appendix C.3 (figure 12), where we see that selecting subsets of the hyperparameter grid that penalize nHSIC(X , Z) more severely also improves accuracy. • Gaussian z. The empirical evidence in appendix 4.2 (figures 5 and 6) shows that enforcing Gaussianity is unnecessary and may be detrimental, especially for additive data. The sensitivity analysis of appendices A.2 and C.3 also confirms this: using SMMD(Z) in the pair-up or weighting scheme did not contribute significantly to accuracy (figures 11 and 12). Enforcing Gaussianity also led to poor results for non-additive data ( figure 19). • Algorithmic independence of f and p(x). The empirical evidence in appendix C.4 (figure 21) did not show this to be a critical assumption at all, at least with respect to the particular data and learning methods used.

Method robust to small samples
A question arises about the consistency of the causal estimator and how stable the decisions are depending on the sample size. Figure 7(left) shows that our LN-c algorithm has good empirical consistency for the three non-additive data classes. With 250 (or more) observations, the algorithm achieves 95% accuracy or better across the three data classes. Additionally, with as little as 50 observations, the algorithm achieves 85% accuracy (or better) across these classes. The LNc method relies on an ensemble of models indexed by the hyperparameter vector θ. Figure 7(right) shows the proportion of models that correctly infer causal direction to establish the baseline by data class. The LNc methodology uses this ensemble of models to obtain a causal prediction. So these results are consistent with LNc accuracy results, which are poor for additive data classes and good for non-additive data classes.

On the accuracy of the latent factor estimation
We further evaluate how well the proposed method identifies/estimates the latent factor. Figure 8 shows the dependence between the true latent (Z) and the estimated one (Ẑ), as measured by the normalized HSIC. This figure illustrates that there is a greater dependence between the real and estimated Z in the true direction (x → y), with the gap being greater for non-additive data classes (LS-s and MN-U). Interestingly, when we separate the nHSIC(Ẑ, Z) not only by causal direction but by the right/wrong outcome of the entropy criterion, we see that for non-additive models Z estimation is good whether this led to a correct  causal prediction or not. In contrast, for additive data a good estimate of Z leads to correct causal classification. This indicates that estimating Z for additive data is harder, and this is the reason why LNc performs poorly for this data.

Results on time series data
We compare against the non-linear (kernel-based) generalization of Granger causality (GC) (Bueso et al 2019) and naively implement ANM and LNc: i.e. we do not take into account that data are time-ordered in any way. For the case of Granger causality, we adapt the method in two ways: 1. Since in our generated data y is the instantaneous effect of x and y, and GC uses lagged variables as inputs, we lag y by one time-step so that the past, actually includes the corresponding contemporaneous observation, and 2. As our GC causal criterion we define S GC x→y = MSE(R x ) + MSE(R x→y ) where, in this case, R x are the residuals resulting from modelling x on its past and R x→y are the residuals resulting form modelling y on the past of x and y.
The maximum number of lags used to represent the past is five. Other than the extra regularizer implementation details are the same as detailed in section 4.1 and appendix D, with the exception that we only used one value for the hyperparameter related to SMMD(Z) (ν = 0) to cut down on computation time. Table 5 shows accuracy results for our additive (LNc-ak-ts, additive kernel) and non-additive (LNc-ts, product kernel) time series methods. As we can see, LNc-ts has a good performance for non-additive datasets, while LNc-ak-ts has a reasonable performance for additive datasets. Figure 9 gives a more detailed description of how method performance varies depending on the degree of additivity of the data. The non-additive LNc-ts method behaves as we would expect, with performance increasing dramatically as the data is less additive. The additive LNc-ak-ts method shows more erratic behaviour. However, for certain ranges closer to the additive end of the spectrum, the method does behave as expected: i.e. better Table 6. Accuracy of LNc-ts, LNc-ak-ts, LNc-ts-LNc-ak-ts and benchmarks by data class.

Method
AN-s-ts LS-s-ts MN-U-ts RAD-GPPg RAD-GPPr performance on more additive data. We use an analogous criterion for combining additive and non-additive methods, as used for i.i.d. data, namely improved additivity (denoted hsicx in results tables): datasets where S LN = min{nHSIC(X , R LNc−ts x→y ), nHSIC(Y, R LNc−ts y→x )} < min{nHSIC(X , R LNc−ak−ts x→y ), nHSIC(Y, R LNc−ak−ts y→x )} = S LNc−ak−ts are considered non-additive. The combined method is referred to as LNc ts-LNc ts ak. As figure 10 and table 6 show, all LNc variants are superior to benchmark methods for non-additive datasets (LS-s-ts, MN-U-ts) and comparable on radiation-gpp datasets (RAD-GPP-r, and RAD-GPP-g).

Reproducibility and data availability
All code and data used to carry experiments are available at https://github.com/IPL-UV/latentNoise_krr.

Discussion and conclusions
Conditions for identifiability for the bivariate causal problem under the assumption of additivity are quite general and well-established. Additionally, methods to infer the right causal direction, assuming additivity, have good empirical performance when the assumption is fulfilled. In the case of non-additive data, conditions of similar generality have not been established. However, non-additivity in the data may be a sign of a lack of causal sufficiency, and so is an important data class to tackle for causal discovery since one seldom observes all meaningful variables in a system. Some methods are revealing good empirical performance across a wide range of non-additive causal mechanisms. However, latent-variable estimation approaches are not among these state-of-the-art approaches. Latent-variable estimation approaches are desirable because they allow us to augment the data and achieve causal sufficiency, aid in dealing with structured data such as time series and most importantly allow us to learn more about the underlying system.
In this work, we proposed a latent variable estimating method and enforced non-strict assumptions that allow for identifiability. We use additivity in the augmented system as a causal signal. Other latent variable methods may not be achieving good performance because they do not focus on this signal: they either enforce the deterministic assumption strictly (GPI, (Stegle et al 2010)) or use loss functions which do not minimize non-additivity (CARFL, ). We provide empirical evidence of the method's performance for benchmark datasets. The performance proved superior to other latent estimating methods and comparable to state-of-the-art. We also extend the method to time series and illustrate the method's competitive performance in this scenario with artificial benchmark data. We showed that the proposed method allows learning latent functions and showed some appealing empirical properties such as robustness and consistency.
We also analysed each identifying assumption's relative importance. It was found that the deterministic mechanism assumption may be relaxed to allow for non-additive and additive sources of randomness. Exogeneity of z does not need to be strict, especially for non-additive data. Gaussianity of z is not necessary either for additive or non-additive data. The algorithm independence of f and p(x) does not seem important, probably because ICM relies more on the exogeneity assumption. This could mean that by relaxing the enforcement of assumptions, particularly the deterministic mechanism and exogenous z assumptions, we can explore the space of plausible models more efficiently. Under the ICM assumption, if we enforce assumptions and find the optimal model under each causal direction, the complexity of the causal model will be inferior. However, if the optimal model in each causal direction is not found, then comparing the complexities of sub-optimal models might be spurious. Our findings suggest that by relaxing assumptions, we improve estimation without compromising the level of assumption fulfilment necessary to generate the asymmetry in complexity.
There are still many facets of the method that may be further explored. Possible lines of future research include extending this framework to deal with multi-variate causal discovery in the i.i.d. case and to a more general framework that includes self-causation and feedback mechanisms in the case of time series. Our results suggest that the proposed method can be used for causal discovery even when the causal-sufficiency assumption is not met. We anticipate a wide use and theoretical development of the proposed methodology.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/IPL-UV/latentNoise_krr.

A.2. Sensitivity to parameterization
In this section, we show that the majority voting scheme, used in the LNc method is quite robust to different parameterizations of the pair-up and weighting procedures. Figure 11 shows that, for non-additive data, as long as we use a lengthscale of 1 or lower (0 or lower log-lengthscale), the weighted majority voting algorithm is robust. Figure 12 similarly shows that, for additive data, for values of lengthscale of 1 or lower (0 or lower log-lengthscale), the weighted majority voting algorithm is robust. There is a drop-off in accuracy for larger values, which result in weights concentrated on fewer models. Figure 13 shows the contributions to accuracy, for the LNc method, that each of the four variables considered for use in the pair-up or weighting procedures, nHSIC(X , Z), nHSIC(R, X ), SMMD(Z) and MSE(R), would make if used as such. To measure the contribution to the pair-up procedure of each variable, we compute the accuracy using our weighted majority voting procedure and we vary the set of variables used within the pair-up step. We tried all subsets of the four variables in the powerset of those four variables.  Contribution to the accuracy of a certain variable is then defined as the median of the accuracy of all the subsets that include said variable minus the median of all the subsets that exclude said variable. We performed an equivalent procedure to obtain the contribution to the accuracy of each variable in the weighting procedure. The results show that, for non-additive data, nHSIC(R, X ) is the only variable with positive contributions across all data classes for the pair-up procedure. The contributions to the accuracy of the weighting variables are close to zero across the board, confirming the results of figure 11, which seem to indicate equal weighting of models (very low RBF lengthscale) leads to best results. Despite this, we use some weighting in our algorithm since when used with other data or hyperparameter grids, it may be more important to exclude bad model pairs from consideration. The results shown in figure 14 show that, for additive data, nHSIC(R, X ) is also the only variable with positive contributions across all data classes for the pair-up procedure. The contributions to the accuracy of the weighting variables for nHSIC(X , Z) are significant for two of the additive classes shown, confirming observations of section 4.2 that for certain additive classes penalizing a lack of exogeneity may be beneficial.

A.3. Role of weighted majority voting
The weighted majority voting scheme described in this section serves several purposes: 1. A way to test the importance of identifying assumptions. To start with, it is not immediately clear in this context how one should carry out hyperparameter selection. In normal settings, one would choose the hyperparameter with the best out-of-bag loss. However, it is impossible to test the model out-of-bag because we estimate Z directly instead of parameterizing it. So we have an ensemble of models in each direction, which we want to compare to obtain the final causal-discovery classifier. One could simply do majority voting, but that ignores that some models fulfil the identifying assumptions to a greater degree than others. Our approach is to test the impact of the different assumptions on performance by using them within the pair-up and weighting voting scheme. The pair-up part is a way to control for different levels of assumption fulfilment. If we use nHSIC(X , Z) in the pair-up function, for example, we ensure that when we compare model entropy between the two models, we are comparing models with a similar level of nHSIC(X , Z). The weighting part of the scheme allows one to give more weight to model pairs with a higher level of assumption fulfilment. 2. Robust causal classifier. Out of the four measures (and their combinations) used to check the fulfilment of assumptions 1-3 (nHSIC(X , Z), nHSIC(X , R), MSE(R) and SMMD(Z)), we obtained the best results using nHSIC(X , R) for both the pair-up and weighting. Figures 13 and 14 show that the pair-up function provides most of the contribution to predictive power, not the weighting scheme. Indeed sensitivity analysis of the weighting parameter RBF lengthscale, shown in figures 11 and 12, shows that the best results are obtained for lower lengthscales corresponding to more homogenous weights. We believe that the final pairup and weighting scheme selected, lends robustness to the final classifier. Our reasoning is the following: • Premise 1: nHSIC(X , R) and H(X a ) + H(R) (model entropy) are roughly equivalent: Lemma 10 in (Mooij et al 2016) shows that using Shannon information of cause and residuals model entropy is equivalent. • Premise 2: The mean model entropy of the ensemble of models, in each direction x → y and y → x, is a good proxy for the model entropy of the KC optimal model in each direction. • Under premises 1 and 2 and if we are only interested in knowing which model entropy is larger (and not in the magnitude), as we are, ordering ensembles by nHSIC(X , R) and then comparing model entropy element by element is a good way because it is more robust than comparing ensemble model entropy mean. We are, in effect, comparing quantiles of both empirical model entropy distributions, which is roughly equivalent to using the median. Why not use the median? Because the pair-up and weighting scheme contributes to points 1, 3 and 4. 3. Allow for an ensemble of classifiers that exploit different causal signal sources. It is possible that in using nHSIC(X , R) as a proxy for I(X , R) and estimating entropy with k-nearest neighbours (see appendix D.5) we use two types of estimators that are good in different settings and complement each other. Besides, we use nHSIC(X , R) in the majority voting scheme and not nHSIC(X a , R) which, as far as it is a proxy for I(X a , R) would be the actual equivalent to the model entropy H(X a ) + H(R). We also tried using nHSIC(X a , R) but got slightly better results with nHSIC(X , R). This means that the pair-up and weighting criteria concentrate on non-systematic errors with respect to X while the model entropy does so with respect to X a . This also contributes to an ensemble of classifiers that exploit different information sources. 4. Robustness to hyperparameter grid selection. The pair-up and weighting scheme protects against a bad hyperparameter selection in case certain values produce very bad models (they violate the assumptions to a large degree, for example).

Appendix B. Penalized model entropy
In this section, we look at the role of enforcing exogeneity on the performance of the LNc algorithm for data with different degrees of additivity. To this end, we use the data presented in section 4.2. To get a better idea of why, even for the LNc-free version (Gaussianity not enforced), performance deteriorates with the degree of additivity, figure 15 shows the AUC performance for each of the components of the entropy of the model. Recall that the entropy of the model is: The components of the entropy of the model then are H(X ), H(Z), I(X , Z) and H(R). We use nHSIC(X , Z) as a proxy for the Shannon information I(X , Z). Figure 15 shows that the deterioration in performance can be attributed to a deterioration in the performance of H(R) and − nHSIC(X , Z) (recall that the Shannon entropy appears with a negative sign in the model entropy). If exogeneity were strictly enforced then I(X , Z) = 0 and H(X a ) = H(X ) + H(Z). Figure 15 indicates that, for more additive models, a lack of exogeneity could potentially be a source of causal signal: the direction that achieved better exogeneity was usually the causal direction as the AUC of nHSIC(X , Z) for more additive model shows. Perhaps, then, we should penalize models that do not achieve exogeneity by calculating the entropy H(X a ) assuming such exogeneity exists: in other words, by adding I(X , Z) to the model entropy to obtain a penalized model entropy: In figure 16, we explore this possibility by comparing the performance between model entropy and penalized model entropy. Figure 16 shows that for the experiments considered when we do not enforce Gaussianity, penalized model entropy achieves better performance than model entropy.
To summarize, we have seen that penalizing a lack of exogeneity leads to better performance. Is this generally true? In the following, we try to gather more evidence for this hypothesis by performing this analysis on the data introduced in section 3.1. Figure 16. Running mean of AUC performance, of LNc-free and LNc-gauss methods, for ANLS-s datasets ordered by the additivity p-value (in log-scale). Results are separated according to whether model entropy is penalized (blue) or not (red). The right side of the spectrum corresponds to additive data. Figure 17. Running mean of AUC performance, of components of model entropy, for datasets ordered by the additivity p-value (in log-scale). Results are grouped into four subsets of i.i.d. datasets and separated by whether Gaussianity is enforced (blue) or not (red). The right side of the spectrum corresponds to additive data. Figure 17 shows the performance of the different model-entropy components. Figure 17 shows that, for these model classes, when Gaussianity is not enforced (LNc-Gauss), the lack of exogeneity is not a source of causal signal as shown by the performance of nHSIC(X , Z). Figure 18 shows that penalized model entropy does not improve performance for any of the data classes considered. In general, penalizing a lack of exogeneity in Z does not always lead to better performance. Figure 18. Running mean of AUC performance, of LNc-free and LNc-gauss methods, for datasets ordered by the additivity p-value (in log-scale). Results are grouped into four subsets of i.i.d. datasets and separated according to whether model entropy is penalized (blue) or not (red). The right side of the spectrum corresponds to additive data.

Appendix C. Further results
This section presents additional results for experiments on i.i.d. data. Table 7 shows the accuracy results for our LNc and LNc-ANMh methods and other bivariate causal inference methods. Tables 8 and 9 show AUC-ROC performance for the LNc variants and selected benchmarks for individual data classes, and aggregated data classes, respectively. Similiarly, tables 10 and 11 show AUC-PR performance for the LNc variants and selected benchmarks for individual data classes, and aggregated data classes, respectively.

C.2. Area under the curve (AUC)
The following tables show performance in the AUC for ROC and PR curves. Figures 19 and 20 show, the accuracy of the LNc method when we include only a subset of the hyperparameters described in appendix D.3. For example, the top-left panel shows that if we use only models in the subset η > 0.1, the accuracy is 98% while the accuracy for the subset of models such that η > 1.0 is 97%. For non-additive data, and the hyperparameter ranges explored, the algorithm is relatively insensitive to the choice of values for all hyperparameters except ν, associated with SMMD(Z). If we only include the hyperparameter value ν = 10, penalizing non-Gaussian Z solutions, the accuracy for the LS-s datasets decreases dramatically. This shows the LNc causal method seems to be quite robust to the particular choice of hyperparameters, for non-additive data.

C.3. Hyperparameter sensitivity
For additive data and the hyperparameter ranges explored, the algorithm shows much more sensitivity to the choice of values for all hyperparameters. Consistent with previous results (in section 4.2 and figure 14), for high values of the η parameter, associated with nHSIC(X , Z), and which penalize a lack of exogeneity, better accuracy is obtained for AN-s and SIMG classes. Generally, small values of λ and high quantiles used to choose σ z lead to better results. A high quantile used to choose σ z corresponds to lower lengthscales. This means that less weight is given to z relative to x, which makes sense for additive regimes where z is unnecessary to satisfy the primary nHSIC(X a , R) objective. Figure 19. Accuracy ,of LNc method, when we include only a subset of the hyperparameters. Results shown for non-additive data..

C.4. Algorithmic independence assumption
The algorithmic independence assumption (assumption 4) states that the mechanism f should be algorithmically independent of the cause (c, z). Although we do not specifically penalize algorithmic dependence of p(c, z) and f in the loss function L, the SGD optimization strategy should help to break possible dependence between p(c, z) and f. However, given that at each SGD step, we sample 100 data points from the dataset (which are usually of size 1000), the empirical distributionp(x, z) of the sample may still be too close to the population distribution to disrupt the flow of information fromp(x, z) tof. In this section, we look at the role of algorithmic independence on the performance of the LNc method. To this end, we ran a modified version of the LNc method that emphasizes disrupting any flow of information, from p(x, z) to f, that may occur during learning. This modified version consists of replacing the KRR estimate 6 introduced in section 2.4, with a weighted KRR estimate 7 . At each SGD learning step, we randomly generate a new set of weights to use in our weighted KRR, thus ensuring the estimate of f is not contingent on the distribution of p(x, z). Figure 21 shows the accuracy results for the LNc when using SGD (labelled SGD) and when using SGD with a weighted KRR estimate (labelled SGDw). There does not appear to be any significant gain from implementing this modification to the algorithm. This is possibly because since the KRR estimate is non-linear, the effect of p(x, z) onf is minimal. Figure 22 shows the degree to which fitted models adhere to the assumption that the latent variable z is Gaussian as measured by SMMD(Z). Models are grouped according to the data class and whether they led (once paired) to correct causal conclusions. If the Gaussianity assumption is necessary, we would not expect 6 α = (Kxz + nλI) −1 y. 7 αw = (Kxz + nλD) −1 y where D is a diagonal matrix such that the entries correspond to the (re-scaled) weight assigned to each sample d ii = w i where ∑ n i =1 w i = n.  that differences in Gaussianity across model pairs would generate an exploitable (in terms of finding true causal direction) asymmetry. We see that for non-additive data, models that led to finding the right causal direction (in the left) tend to have a more non-Gaussian variable Z for the anti-causal y → x direction. As regards non-additive data, a fairly Gaussian variable Z was estimated in the causal direction, and a more non-Gaussian variable Z was estimated in the anti-causal direction. In terms of the model entropy causal criterion, this resulted or coincided with very good performance. If we had enforced Gaussianity strictly, the estimated latent factor would be Gaussian for both models, but the asymmetry, in terms of model entropy, perhaps would not be as marked. Allowing non-Gaussian variables does not hinder and may encourage a certain asymmetry. We cannot assert that allowing non-Gaussianity generated or allowed for asymmetry in model entropy. However, analysis of the sensitivity of performance to the hyperparameter ν associated to SMMD(Z), shown in figure 19 of appendix C.3 is instructive. It shows that penalizing non-Gaussianity in all  causal prediction outcome (true/false). The absolute complexity difference across the causal direction is measured on the right by model class and causal prediction outcome. Models paired and sampled according to weight using the majority voting scheme in appendix A. models (as opposed to including models in the ensemble with no penalization, i.e. ν = 0) leads to a large drop in performance for the LS-s non-additive class. Other assumptions (deterministic process and exogenous latent variable, respectively) are similarly analysed in section C.6. Figure 23 shows the degree to which pairs of models adhere to the first three assumptions: 1. deterministic process measured by nHSIC(X , R) and MSE(R); 2. exogenous latent variable z measured by nHSIC(X ,Ẑ); and 3. Gaussian latent variable z measured by SMMD(Z). Model pairs are grouped according to the data class and whether they led to the correct causal conclusion. On the left, the level of assumption fulfilment of the different groups of models is shown. Models which adhere to the respective assumptions are closer to zero on the y-scale. On the right, we see if the difference between the level of model assumption in model pairs generates part of the asymmetry we exploit in identifying the causal direction. If these assumptions are necessary, we would expect low levels in the true x → y direction for plots on the left-hand side. Additionally, we do not expect that differences in the level of assumption fulfilment across model pairs would generate an exploitable (in terms of finding true causal direction) asymmetry. As such, the second and third assumptions are not necessary for our framework. For the case of the second assumption (exogenous z), models that allow for some dependence between z and x contribute to finding the right causal direction. Additionally, we see that for non-additive data, model pairs that led to finding the right causal direction tend to have a more non-Gaussian variable z for the anti-causal y → x model. These conclusions are supported by results from the appendix C.3 where performance is insensitive to enforcing independence of z and x and get worse when we enforce the Gaussianity of z.

D.1. Optimization
Using SGD, we optimized the loss of section 2. For each dataset and each causal direction, we used 500 optimization steps. At each step, we sample 100 points with replacement, evaluate the gradient ∇ Z L ∈ R 100 and take a step in the direction of the negative gradient according to a learning rate of 0.1. Apart from helping generalization, this makes the method computationally feasible since calculating and inverting larger kernel matrices becomes very expensive, especially for many experiments. We performed seven repetitions of the SGD procedure. The solution Z is initialized as the target variable: i.e. for x → y as Y and for y → x as X .

D.2. Choice of kernel
To simplify the number of hyperparameters we use the kernel k θ ((x, z), (x ′ , z ′ )) = k σx (x, x ′ )k σz (z, z ′ ) where k σ (x, x ′ ) is the RBF kernel. Instead of using the median heuristic (quantile = 0.5) to calculate the lengthscale, we sampled a different quantile for each repetition. We did this both for the σ x and σ z lengthscales, thus obtaining models for the data where X and Z are of different relative importance. Quantiles were sampled uniformly from the following set: {0.1, 0.3, 0.5, 0.7, 0.9, 0.99, 1}.
The first term helps numerical stability by penalizing large solutions, while the second penalizes solutions without dependence on the target variable.

D.5. Entropy estimation
Since both the ANMh and LNc methods use the entropy of the model as the final causal criterion, we need a way to estimate entropy. We used k-nearest neighbours entropy estimation as we needed a multivariate estimation method to calculate the joint entropy of X and Z.

D.6. Reproducibility
All code and data used to carry out experiments is available at https://github.com/IPL-UV/latentNoise_krr.