To Sample or Not to Sample: Retrieving Exoplanetary Spectra with Variational Inference and Normalizing Flows

Current endeavours in exoplanet characterization rely on atmospheric retrieval to quantify crucial physical properties of remote exoplanets from observations. However, the scalability and efficiency of said technique are under strain with increasing spectroscopic resolution and forward model complexity. The situation has become more acute with the recent launch of the James Webb Space Telescope and other upcoming missions. Recent advances in machine learning provide optimization-based variational inference as an alternative approach to perform approximate Bayesian posterior inference. In this investigation we developed a normalizing-flow-based neural network, combined with our newly developed differentiable forward model, Diff-τ, to perform Bayesian inference in the context of atmospheric retrievals. Using examples from real and simulated spectroscopic data, we demonstrate the advantages of our proposed framework: (1) training our neural network does not require a large precomputed training set and can be trained with only a single observation; (2) it produces high-fidelity posterior distributions in excellent agreement with sampling-based retrievals; (3) it requires up to 75% fewer forward model calls to converge to the same result; and (4) this approach allows formal Bayesian model selection. We discuss the computational efficiencies of Diff-τ in relation to TauREx3's nominal forward model and provide a “lessons learned” account of developing radiative transfer models in differentiable languages. Our proposed framework contributes toward the latest development of neural network–powered atmospheric retrieval. Its flexibility and significant reduction in forward model calls required for convergence holds the potential to be an important addition to the retrieval tool box for large and complex data sets along with sampling-based approaches.

At its core, a retrieval strives to find an atmospheric model that can best explain a given observation.Most contemporary retrieval frameworks formulated the inverse problem in terms of Bayesian statistics, where free parameters of the physical model are framed as random variables.The probability densities of these random Yip et al. variables (θ) given the observed data (D) are collectively referred to as the posterior distribution, P (θ|D).The Bayes' Theorem provides a way to calculate the posterior distribution via the following relation: where P (D|θ) and P (θ) represent the likelihood and the prior distributions respectively.However, the denominator, P (D), or the Evidence is intractable in most cases.
The community has thus far relied on sampling techniques such as Markov Chain Monte Carlo (MCMC) or Nested Sampling to compute the approximate posterior distribution (see Madhusudhan 2018, for a recent review).Nevertheless, sampling-based techniques are prohibitively slow when the dimensionality of the problem and quantity of the observed data are large (Zhang et al. 2019).This issue will become increasingly pressing with the increased data volume originating from the recently launched James Webb Space Telescope (JWST, Greene et al. 2016) and other future missions such as, Ariel (Tinetti et al. 2021), Twinkle (Edwards et al. 2019) and ELTs from the ground (e.g.Udry et al. 2014).These telescopes are designed to provide hundreds of higher resolution spectroscopic measurements with wide wavelength coverage.On the other hand, recent investigations have highlighted potential biases associated with some commonly used modelling assumptions, such as isothermal atmospheres, constant with altitude chemistry, or 1D plane parallel approximation (e.g.Rocchetto et al. 2016;Changeat et al. 2019Changeat et al. , 2020;;MacDonald et al. 2020;Pluriel et al. 2020a;Ih & Kempton 2021), prompting the need for more realistic treatment of the atmosphere (e.g.Irwin et al. 2020;Feng et al. 2020;Changeat & Al-Refaie 2020).This will inevitably increase both the computational cost and model complexity (Changeat et al. 2021).The increase in both quantity of data and model complexity signals the need for alternative approaches to computing the posterior distribution.
Variational Inference (VI) is a widely studied and scalable approach in the field of machine learning (ML) to provide approximate posterior distributions for a large and high dimensional data set within a reasonable amount of time (Blei et al. 2016;Zhang et al. 2019).However, VI requires a fully differentiable model.The field has recently explored different applications of differentiable physical models.Differentiable models open up the possibility to construct "physics-aware" neural network, a type of network that respects physical laws (Raissi et al. 2019).For instance, Kawahara et al. (2021) used Hamiltonian Monte Carlo (HMC), a gradient-informed Monte Carlo sampling algorithm (Duane et al. 1987;Hoffman & Gelman 2011), to perform atmospheric retrieval of exoplanets on high resolution spectroscopic data.Others have also applied HMC to speed up lightcurve-fitting (e.g.Foreman-Mackey et al. 2021;Agol et al. 2021).
Here we present the following contributions: 1. We present Diff-τ , a Tensorflow-based fully differentiable atmospheric forward model, based on the implementation of TauREx3(Al-Refaie et al. 2021b,a).

2.
For the first time, we introduced Variational Inference as a more efficient alternative to perform atmospheric retrieval.
3. Our framework only requires a single data instance during training time, there is no need for a large library of spectra for pre-training.
4. We show that our framework takes into account the uncertainty associated with the observation and is able to reproduce physically motivated correlations between atmospheric parameters.

Our Bayesian Neural
Network is capable of producing posterior distributions on par with distributions produced from sampling process.

OVERVIEW
In this investigation our core aim is to explore an alternative approach to conventional, sampling approach with the use of modern deep learning techniques.For simplicity we will denote conventional, sampling based approach simply as NS-retrieval and our proposed approach as VI-retrieval.Our approach involves three core components -a differentiable physical model, Variational Inference and Normalising Flow.Here we will provide a top-level overview of how the three components interact with each other, also see Figure 1 for a schematic overview of the VI-retrieval.
Instead of relying on sampling to map the actual, underlying posterior distribution (as one normally does Right panel shows the schematic of our proposed VI-retrieval.We provided a simple noise distribution ξ, observed data and our prior bounds as inputs to the optimisation loop (Train).The Normalising Flow-based neural network, i.e.NF, is tasked to transform ξ to a surrogate distribution q(z).We employed variational inference to approximate the typically intractable posterior distributions, where the network (NF) must learn the best transformation to optimise ELBO (controlled by log(L) and DKL).To calculate log(L), we used Diff-τ to transform samples from q(z) into spectra and compared with the observed data.Once the training is completed, the trained NF is able to provide a q * (z) that best approximate p(z|x) .
with e.g.MCMC based approaches), VI-retrieval relies on finding a best-fit surrogate distribution to the actual posterior distribution through optimisation.The use of optimisation-based techniques means that VI-retrieval can be orders of magnitudes faster than sampling-based approaches, especially on highdimensional problems.However, there are two implementation difficulties that prevented the wide-spread use of variational inference in the field of exoplanetary atmospheres.These are: (1) Gradients: many optimisation procedures demand complete knowledge of the gradient flow within a computational graph (i.e. a neural network), contemporary atmospheric forward models break the flow as they are in-differentiable (2) Surrogate distributions: the chosen distribution (usually a multinomal Gaussian distribution) is often too simplistic to represent the actual, underlying posterior distribution.
To complete the "circuit" we built Diff-τ with Tensorflow libraries, and utilise its automatic differentiation capabilities to compute gradients of the forward model.At the same time, we implemented a normalising flow-based neural network, a deep learning approach that can transform a simple, "seed" distribution (such as a multinomal Gaussian) to arbitrarily complex distributions.
In the following sections we will explain the theoretical background behind each technique, and in the latter part of the paper we demonstrate how these concepts are linked to each other in practise by providing retrieval examples in two different scenarios.

Variational Inference
The mathematical theory of variation inference (VI) and its application in the field of machine learning have been extensively discussed in (Blei et al. 2016).There are on-going efforts to investigate the statistical implication of using VI for parameter estimation (e.g.Pati et al. 2018;Chérief-Abdellatif & Alquier 2018).Here we provide a brief overview of the methodology.
Given an observed spectrum x defined by the transit depths (x i ) and associate uncertainty (σ i ) at each spectral bin i , the goal of an atmospheric retrieval is to estimate the posterior distribution p(z|x) of the set of (latent) variables (z) that can best describe the observation under a specific atmospheric model assumption M. Instead of approximating the (unnormalised) p(z|x) via sampling, VI approximates the distribution by finding a best-matching surrogate distribution via optimisation.
Suppose we have a family of probability distributions Q parameterised by some latent variables z.We can measure the statistical distance, e.g. the Kullback-Leibler divergence (D KL ), between q(z) and p(z|x), and find an optimal distribution q * (z) that minimises the divergence from p(z|x), i.e. (2) However, this optimisation procedure is impossible as we do not have any knowledge of p(z|x) to begin with.To negate the dependence on the unknown true posterior distribtion, we can instead optimise the Evidence Lower BOund (ELBO), where the first term is the expected value of the loglikelihood log(L) and the second term is the KL divergence between q(z) and the prior distribution p(z).This formulation can be understood, in terms of Bayesian Statistics, as a 'tug of war' between the likelihood function and our prior belief on the distribution.We included a detailed discussion on ELBO and its link to Equation (2) in Appendix A. Unlike Equation ( 2), the ELBO does not require any knowledge on the intractable evidence p(x) and therefore it can be computed analytically.We can hence find the optimal distribution q * (z) by minimising the ELBO.This optimisation technique is well-studied in the field of deep learning (Kingma & Welling 2013;Kingma et al. 2016).Most contemporary neural networks are trained to minimise a given loss function (e.g.ELBO in our case) by relying on a combination of gradient descent and back-propagation algorithms.Modern deep learning libraries such as Tensorflow (Abadi et al. 2015a), PyTorch (Paszke et al. 2019) and JAX (Bradbury et al. 2018) provide easy access to model training and evaluation.Our implementation will be solely based on the Tensorflow framework, other deep learning frameworks can also be used (Kawahara et al. 2021).

Differentiable forward model, Diff-τ
Diff-τ is an atmospheric forward model built entirely within the Tensorflow framework (Abadi et al. 2015a).We followed the forward model formulation as specified in Al-Refaie et al. (2021b) to construct Diff-τ , with only very minor modifications to comply with the Tensorflow framework.As the code is largely based on TauREx3, it provides excellent agreement between the two forward models, see Figure 2 for an empirical comparison between the two.The advantage of having it entirely in a deep learning framework is to leverage the built-in automatic differentiation functionality (Gunes Baydin et al. 2015), which has the ability to automatically differentiate (almost) any functions without the need to explicitly specify the corresponding derivative form.This is immensely helpful as atmospheric models are a mixture of different physical processes, and deriving the respective derivative forms for each can be a time consuming and non-trivial task.As discussed above, having a differentiable models opens up many interesting avenues to explore, in this investigation we aim to showcase how differentiable models enable us to perform Variational Inference to approximate Bayesian posterior distribution.
While the Autograd library has lessened the overhead to translate the forward model from one framework to another, translating between the two frameworks remains a non-trivial task.Section 6.3 will discuss the authors' comments on implementing physical models within the Tensorflow deep learning frameworks.

Normalising Flows
The stringent requirement of a pre-defined family of distributions, Q, presents a major limitation in using VI.For many real-life scenarios, the desired posterior distributions are rarely Gaussian nor well-defined.We implemented a normalising flow-based neural network to break the limitations of normality in q * (z) in Equation (3) by transforming it into an arbitrarily complex probability distribution.
Normalising Flow (NF) describes a mechanism to "craft" a complex, multi-model distribution from a simple, "seed" distribution (Rippel & Adams 2013;Rezende & Mohamed 2015;Kobyzev et al. 2021).This can usually be a distribution in the exponential family (e.g. a Gaussian) or an Uniform distribution.
Suppose we have an invertible function g, such that we can transform a random variable ξ ∼ p ξ into another random variable y ∼ p y using y = g(ξ), the probability density p y of the random variable y can be computed using the change of variable formula, i.e.
where f is the inverse of g, i.e. f ≡ g −1 , Df (y) is the Jacobian of f , i.e.Df (y) = ∂f ∂y and Dg(ξ) is the Jacobian of g, Dg(ξ) = ∂g ∂ξ .In terms of generative models, the invertible function g(.) is a generator that pushes forward the seed distribution to a more complex density (the generative direction).On the other hand, the function f (.) moves in the opposite direction, transforming a complex density back to a simple, "normalised" distribution (normalising direction).It has been shown that one can generate or craft any form of distribution p y from any base distribution p ξ , given that the generator g is arbitrarily complex (Bogachev et al. 2005;Medvedev 2008).
Up until now we have only shifted the problem from crafting an arbitrarily complex density function to an arbitrarily complex generator function.Fortunately, invertible functions (or bijections) have a nice propertythe composition of invertible functions is itself invertible, meaning that one can build a successively more complicated function by chaining non-linear invertible functions together, i.e.
where g 1 ...g N is a set of N bijective function.Similarly, g has an inverse, and conveniently, the determinant of the Jacobian Df (y) is the product of individual determinant of the Jacobians Df i (y), i.e.

Df (y) =
we denote ϕ i as the resultant vector of the i-th intermediate flow, i.e.
where ϕ N = y.However, these intermediate functions must by definition be diffeomorphic, meaning they must be bijective and differentiable (including their inverses).There has been a significant focus on constructing non-linear bijectors that conform with these restrictions but remain sufficiently expressive and computationally efficient even in high-dimensional problems (such as images), notable bijector architectures include MADE (Germain et al. 2015), Masked Autoregressive Flow (Papamakarios et al. 2017), NICE (Dinh et al. 2015), RealNVP (Dinh et al. 2016), FFJORD (Grathwohl et al. 2018), Glow (Kingma & Dhariwal 2018) and NSF (Durkan et al. 2019).Normalising Flow has met with critical success in a wide range of applications, including audio (Oord et al. 2018), text (Jin et al. 2019) and image generation (Kingma & Dhariwal 2018;Grathwohl et al. 2018).
In the context of our investigation, we will transform our seed distribution ξ ∼ p ξ into y ∼ p y and treat it as our surrogate distribution z ∼ q(z).

ELBO Formulation
We will begin by describing our implementation to the general ELBO formulation in Equation (3).The ELBO consists of the log-likelihood term and the prior term.The latter term can be computed analytically, i.e.
However, an analytical form of q(z) may not always be possible (i.e. a distribution crafted by Normalising Flow).Here we have approximated the KL divergence Yip et al.
via Monte Carlo sampling.Consistent with existing literature in exoplanetary retrieval, we imposed a uniform prior on the physical parameters, but it can also be any other kind of prior.Additionally, we define our log-likelihood as an additive log-Gaussian probability density function, i.e.
where μ = Diff-τ (z) is a atmospheric forward model generated by Diff-τ and σ is the observed uncertainty.This formulation is the same as the likelihood formulation currently employed by many retrieval frameworks.Instead of optimising the ELBO objective in its actual formulation, we adopted the weight annealing approach from Sun et al. (2022) to optimise a modified ELBO objective, i.e.
where β = max(1, β 0 epoch τ ).β 0 and τ represent the weight constant and decay constant respectively.This formulation prevents the neural network from converging to bad local minima at the start of the training, and encourages it to explore different solutions before converging.During the training stage, the objective function will slowly converge back to the original ELBO formulation.

Training Procedure
At training time, our Flow-based neural network (the generator, for details on the architecture please refer to Section 4.1) is tasked to transform a simple noise distribution (we used Uniform noise here) ξ into a latent surrogate distribution q(z) that can best approximate p(z|x).To calculate the ELBO loss, samples are drawn and transformed via Diff-τ into synthetic spectra.These spectra are then compared with the observed data to estimate the ELBO values.The entire process is repeated until the optimisation procedure is completed or terminated.
The role of Diff-τ can be seen simply as a deterministic transformation, i.e. from physical parameters to spectra.However, as part of the optimisation process, the network must learn to adhere to physical laws imposed by the forward model.In the end, the network will produce outputs and correlations that are physically constrained.In other words, the outputs and the correlations are no longer black boxes but are physically motivated.
Once trained, the generator is now capable of transforming simple distributions in to the desired posterior distribution.

APPLICATION
In this section we will perform atmospheric retrieval with NS-retrieval and VI-retrieval.We will demonstrate the technique through two cases: 1. Real Hubble/WFC3 (HST/WFC3) observation of the hot-Jupiter HD209458 b.
2. Simulated Ariel Tier-2 observation of the same planet.
Note that, while we have demonstrate our results using HD209458 b, VI-retrieval is not confined to any planet nor planet classes.

Normalising Flow model setup
We implemented Inverse Autoregressive Flow (IAF, Kingma et al. 2016) as a default bijector unit in our Normalising Flow based neural network.To perform the transformation, we chained 10 bijector units together, each controlled by a 2-layer densely connected neural network with 64 hidden units and ReLU activation (He et al. 2015) in each layer.To stabilise the network, we followed Kingma et al. (2016) and added a Batch Normalisation layer (Ioffe & Szegedy 2015) after each bijector unit.We used Adam as our optimiser with the learning rate kept at 0.001, the rest of the settings are kept as default.We used a multi-dimensional uniform distribution as our "seed" distribution.At each iteration, we will sample 5 times from the surrogate distribution and compute the average (modified) ELBO objective, Equation (11).In both cases we set β 0 and τ as 500 and 200 respectively.

Atmospheric model setup
We have taken values from Tsiaras et al. (2016a) as our reference values on the HD209458 system.In both cases, we assume a blackbody emission for the host-star at T ef f = 6065 K and stellar radius R * = 1.155R .We also assume a primary atmosphere (dominated by H and He, ratio = 0.175) at solar abundance.We divide the atmosphere into 70 atmospheric layers over a 10 −5 Pa and 10 6 Pa pressure range and furthermore assume an iso-thermal T-P profile and an iso-abundance chemical profile.Table 1 shows the fitted input parameters for each case and their corresponding ground truth, prior bounds and formalism/linelist references.Case I aims to demonstrate our method's applicability to actual data.The transmission spectrum is observed with the HST/WFC3 G141 grism and processed by Iraclis (Tsiaras et al. 2016a;Tsiaras et al. 2019).The detrending process is described in details in Tsiaras et al. (2016a).The basic atmospheric setup follows Section 4.2.For this case we are assuming a hydrogen/helium dominated atmosphere, with trace water and absorption from Mie scattering clouds 1 (Lee et al. 2013).We ran the optimisation procedure for 1500 epochs.

Case II: Ariel Tier-2 observation on HD209458 b
In the second case we would like to understand the ability of our proposed framework to retrieve the ground truth value and showcase the flexibility of our framework to switch to different atmospheric assumptions and spectral resolutions.We used ArielRad (Mugnai et al. 2020) to simulate the expected noise level for each wavelength channel for HD209458 b at Ariel Tier 2 resolution.In terms of atmospheric chemistry, we adhered to the same setup as described in Section 4.2 and include five trace gases H 2 O, CH 4 , CO and CO 2 and NH 3 .We chose this set of molecules due to their expected contribution in the wavelength range considered, and because they have been successfully detected in hot Jupiter atmospheres.We ran the optimisation procedure for 3000 epochs.

RESULTS
In both cases (see Figure 3 and Figure 4 and, the VI-retrieval (in yellow) is able to converge to similar 1 We note that the alternative and commonly used flat cloud model (i.e. a constant in pressure opacity cut-off) is inherently not differentiable and not physically viable, we therefore chose to include the differentiable Mie scattering formalism in Diff-τ .
results to NS-retrieval (in red), with .Table compares the quartile values return from both retrievals.We can see a very good agreement between both methods and the median values are within 1-σ to each other.As for the shape of the posterior distribution, the surrogate distribution is both able to reproduce peaked distributions whenever there is sufficient constrain from the observation (i.e.radius of the planet at 1 bar pressure), and at the same time produce a uniform-like distribution (with upper limit) when the free parameters are unconstrained by the observation, aligning with our uniform prior.The surrogate distribution also displayed similar covariances between the free parameters, it is most notable in Case II, when the high S/N observation of Ariel simulated observations allows better constraints of the free model parameters.
There are also notable differences between the two retrievals, for instance VI-retrieval is less adept at capturing "cliff" like distribution (e.g.log 10 (χ lee mie )), when there is a sharp transition from presence to absence of a physical phenomenon.
Our objective function (ELBO) is a balance between the log-liklihood term and the prior term.Minimising one term will always come at the expense of maximising the other.This situation is worsen by the fact that the variability of the log-likelihood term is orders of magnitude higher than the prior term, which means that the training will always be driven by the former term and produce sharply peaked distributions.Our modified objective function (Equation ( 11)) aims to alleviate the imbalance between the two terms by lowering the contribution from the log-likelihood term during the start of the training.However, the underdispersed situation will likely to return as the formulation slowly converge back to the ELBO formulation, as seen from both cases.An alternative remedy is to increase permanently the Figure 3. distribution of HD209458 b WFC3/G141 observation obtained from NS-retrieval (red) and VI-retrieval (yellow).The top right corner is an empirical comparison between the two best fitted spectra as obtained by the two methods.The shaded area shows the 1-σ spread of the respective retrieval approach.
contribution from the prior term, which will inevitably increase the estimated uncertainty.

A Flexible and Interpretable Framework
Despite numerous recent developments in applying ML techniques to infer atmospheric properties from spectroscopic data, the applicability of modern deep learning models in atmospheric retrieval remains limited.This is mainly due to two limiting factors: 1) their generalisability and training requirements and 2) their interpretability.
Models are currently trained on a large data sets, encapsulating a fixed set of model assumptions and predefined input grids.Any changes to the spectral range,  ).The method presented here presents an alternative approach to train deep learning model.Instead of generating a big data set for model training, our network is specific (or in other words, overfitted) to the observed data, and similar to Markovian sampling algorithms, it is not generalisable and must be re-run for any changes in observed data and/or model assumptions.However, using an optimisation algorithm (i.e.VI) over a sampling one drastically reduces the number of forward model computations required and results in significant run-time savings when forward models are computationally expensive.Table 6.1 shows a comparison of the number of forward model calls between NS-and VI-retrieval on the same atmospheric model (as Case I) in two different spectral resolutions.By dropping the goal of training a generalisable model, we forego the need to create vast training sets and expensive pre-training of our model.It furthermore affords us with the flexibility of easily changing our input data and forward models.This flexibility has allowed us to explore the performance of our model at different spectral resolutions, wavelength ranges, observational uncertainties and model assumptions with relative ease (as demonstrated above).On the other hand, generalisable models must be re-trained for any changes in observed data and/or model assumptions.
The second major limitation of ML and in particular deep models is their inherently un-interpretable architecture.For instance, conventional Variational Autoencoders (made up of neural network based encoder and decoder, Kingma & Welling 2013), do not produce naturally interpretable latent variables.Our implementation, on the other hand, naturally produce interpretable latent variables by using a physical, deterministic forward model as our "decoder".Here we constrain the neural network to produce physically viable solutions, in other words, the network is constrained explicitly by the physics.This is demonstrated through our examples, the surrogate distributions are able to provide physically plausible correlations and are aligned with correlations produced from standard Bayesian Sampling retrievals (NS-retrieval, red contour).

Objective Function
The objective (loss) function is a crucial factor that governs the learning behaviour of a deep learning model.Deep learning models in the literature are usually trained to best-match the respective ground truth values of the physical parameters (parameter values used to generate the forward model).Here we opt to align our objective to that of our retrieval counterpart, in other words, we are explicitly asking our model to look for solutions that can best explain our observation.
In an ideal world, both approaches will agree with each other.However, it is not the case for inverse problems, where our observation is inherently corrupted2 and we may never be able to recover the ground truth in some cases due to the loss of information and the inverse processing being ill-defined.The conventional approach will therefore be asking the neural network to pursue ground truth(s) that may no longer be possible to retrieve, which will cause the neural network to exhibit fictitious behaviour if it results in the lowering of the loss function values (Yip et al. 2020).Our proposed objective, on the other hand, naturally takes into account these observational uncertainties and is inherently constrained by our deterministic physical model.
In terms of model development, our framework bypasses the need to train the network with a large library of synthetic spectra before applying to actual data, as we are training the network directly on actual observations.This move avoids the problem of data shift (Quionero-Candela et al. 2009) -where our training distribution is different from our test distribution.
Our approach proposed in this paper grants us better interpretability of the results, compared to other machine learning retrival architectures, due to the inclusion of the physical forward model Diff-τ .However it does of course not guard from the intrinsic retrieval biases induced by the atmospheric forward itself (Rocchetto et al. 2016;Feng et al. 2020;MacDonald et al. 2020).
These biases can only be alleviated through increasing the complexity of the atmospheric forward model to better represent the physical/chemical processes leading to the observed spectra.

Model Selection
Model selection is a key part of model evaluation cycle.So far, the machine learning retrieval literature has largely omitted the issue by training networks under one or several fixed atmospheric assumptions.Our flexible framework and similar objective function to NS-retrieval means that we can, for the first time, utilise some of the tools frequently used by samplingbased retrievals to compare the retrieval results from different models.
Given that our surrogate distribution is a good approximation of the underlying posterior distribution, our ELBO, despite being a lower bound, should closely approximate the Bayesian evidence.We can therefore use the ELBO as a proxy of the evidence, and compare our models by estimating the Bayes' factor.
where P(x|M k ) represents the Bayesian Evidence attained from model M k , P (M k ) represents our prior belief on a particular model.As an empirical example, we took our example from Case II and performed VI-retrieval and NS-retrieval with different atmospheric assumptions, including a flat line model, an incomplete model (without methane), a complete model (as specified in Case II) and an overspecified model (Case II + TiO and VO).Table 6.2.1 compares the corresponding ELBO from VI-retrieval and Bayesian Evidence from NS-retrieval.The ELBO retrieved from each model closely follows, but always smaller than, the corresponding Bayesian Evidence, as set out from the definition of ELBO.The Bayes' factor displayed in Table 6.2.1 al-lows us to differentiate different models.Following Jeffrey's guideline scale (Jeffreys 1998;Hobson et al. 2002;Padilla et al. 2019), the Bayes' factor strongly favours our complete model over other competing models.For more discussion on using Bayesian evidence as a model selection tool, please refer to the Appendix in Changeat et al. (2021).

A word on differentiable frameworks
The ability to differentiate and provide gradients with respect to some quantities is central to modern deep learning algorithms.The growing popularity of Machine Learning in recent years has accelerated the development of differentiable frameworks, among them Tensorflow, PyTorch and JAX have amassed a substantial user bases.
There have been several recent attempts within the field of exoplanets to develop differentiable physical models within these frameworks.Initial results show that these differentiable models may hold the keys to overcome the curse of dimensionality brought by our increasingly complex models (Kawahara et al. 2021) and may one day enable us to perform population study without placing significant demand on computational resources.
However, these frameworks are not without issues.Here we would like to provide our "user experience" for readers who are interested and/or would like to implement their own models.
1. Significant overhead: Current development in differentiable programming mandates that any implementation must be written entirely in terms of a chosen framework.While it has become relatively straightforward to translate most operations from one framework to another, it is not a trivial and error-free process.These frameworks come with their own programming restrictions and conventions.Users are expected to adhere to these conventions or otherwise they might risk losing the ability to differentiate.
2. Differentiability is not guaranteed: Not all operations are naturally differentiable.One good example is the grey cloud model.This model assumes the planet becomes completely opaque below a certain pressure level/altitude.This model, while easy to implement in any computational language, is not differentiable.
3. Sub-optimal Performance: Many modern differentiable languages are built and designed around deep learning based applications.They are not designed to handle complex computational models.In other words, a differentiable model may suffer from reduced performance compared to its un-differentiable counterpart (Hu et al. 2020(Hu et al. , 2019)).
4. Rapidly evolving language: Differentiable languages are under constant development and rapid release cycles in response to latest researches.Some of these developments may not be backwardcompatible3 and may impact the long term sustainability of the developed model.

CONCLUSION
In this paper we have introduced the differentiable forward model (Diff-τ ) based on TauREx3 and implemented in Tensorflow.We further introduced a Variational Inference neural network combined with a density-alternating (Normalising Flow) approach.We showed that it is possible to compute the approximate Bayesian posterior distribution without resorting to computationally more expensive sampling based technique.Through our examples we have demonstrated three advantages of our framework: 1. Speed: Our VI-retrieval only requires 10% or less forward model calls compared to NS-retrieval to converge.
2. Flexibility: Unlike many deep learning based framework, which relies on a large pre-computed library of spectra and somewhat obscure loss objectives, our proposed framework resembles more closely a traditional retrieval set up by explicitly including the physical forward model in the deep learning architecture.Consequently, it retains many of the advantages of a traditional retrieval codes, such as the freedom to choose the model assumptions, number of free parameters, spectral wavelength range and observed uncertainty, without having to produce a separate training dataset each time.

Model Selection:
We demonstrated, for the first time, that we can compare the adequacy of our neural network model to a given observation by computing the Bayesian Evidence and Bayes Factor.
Our proposed framework presents a major step towards the wider adoption of neural network powered atmospheric retrievals.With significantly higher spectral From the above expression we can see that the KL divergence is determined by the interaction between the ELBO and the Bayesian Evidence term.While the former term depends on the quality of the surrogate distribution q(z), the latter term is a constant as it depends only on the data x.This, combined with the fact that D KL ≥ 0 , the ELBO term is inherently constrained by logp(x), i.e.ELBO ≤ logp(x) (A9) Hence the term is known as the Evidence Lower BOund (ELBO).We can therefore use ELBO as our objective function, since maximising the ELBO is equivalent to minimising the KL divergence.In our implementation, the objective function is set to minimise the negative ELBO.

Figure 1 .
Figure1.Overview of our investigation.Left panel shows the structure of Diff-τ .It takes in a set of physical parameters and outputs the corresponding theoretical spectrum.Right panel shows the schematic of our proposed VI-retrieval.We provided a simple noise distribution ξ, observed data and our prior bounds as inputs to the optimisation loop (Train).The Normalising Flow-based neural network, i.e.NF, is tasked to transform ξ to a surrogate distribution q(z).We employed variational inference to approximate the typically intractable posterior distributions, where the network (NF) must learn the best transformation to optimise ELBO (controlled by log(L) and DKL).To calculate log(L), we used Diff-τ to transform samples from q(z) into spectra and compared with the observed data.Once the training is completed, the trained NF is able to provide a q * (z) that best approximate p(z|x)

Figure 2 .
Figure 2. Top panel: comparing outputs from Diff-τ in native (pink) and binned (Ariel) resolution (orange) and TauREx3 binned to the same resolution (blue dots and dashed line).The lower panel shows the residual between the two binned spectra.The grey lines represent the mean (solid line) and 1-σ standard deviation (dashed lines).

Figure 4 .
Figure 4. Posterior distribution of HD209458 b in Ariel Tier2 resolution obtained via NS-retrieval (red) and VI-retrieval (yellow).The top right corner is an empirical comparison between the two best fitted spectra as obtained by the two methods.The shaded area shows the 1-σ uncertainty of the respective retrieved methods et al. 2022; ArdevolMartinez et al. 2022).The method presented here presents an alternative approach to train deep learning model.Instead of generating a big data set for model training, our network is specific (or in other words, overfitted) to the observed data, and similar to Markovian sampling algorithms, it is not generalisable and must be re-run for any changes in observed data and/or model assumptions.However, using an optimisation algorithm (i.e.VI) over a sampling one drastically reduces the number of forward model computations required and results in significant run-time savings when forward models are computationally expensive.Table6.1 shows a comparison of the number of forward model calls between NS-and VI-retrieval on the same atmospheric model (as Case I) in two different spectral

Table 1 .
Fitted parameters in each scenario, along with their corresponding ground truth, prior bounds, scales and molecular line list references for each trace gas species.The checkmark ( ) indicates whether the parameter is included in Case I and II.The ground truth is only available for Case II.

Table 2 .
Comparing the number of forward model calls by each method for the same atmospheric models (as in Case I) in WFC3/G141 and Ariel spectral resolution.In both cases, the number of forward model calls by VI-retrieval is significantly lower than NS-retrieval.