On the Application of Bayesian Leave-one-out Cross-validation to Exoplanet Atmospheric Analysis

Over the last decade exoplanetary transmission spectra have yielded an unprecedented understanding about the physical and chemical nature of planets outside our solar system. Physical and chemical knowledge is mainly extracted via fitting competing models to spectroscopic data, based on some goodness-of-fit metric. However, current employed metrics shed little light on how exactly a given model is failing at the individual data point level and where it could be improved. As the quality of our data and complexity of our models increases, there is a need to better understand which observations are driving our model interpretations. Here we present the application of Bayesian leave-one-out cross-validation to assess the performance of exoplanet atmospheric models and compute the expected log pointwise predictive density (elpdLOO). elpdLOO estimates the out-of-sample predictive accuracy of an atmospheric model at data-point resolution, providing interpretable model criticism. We introduce and demonstrate this method on synthetic Hubble Space Telescope transmission spectra of a hot Jupiter. We apply elpdLOO to interpret current observations of HAT-P-41 b and assess the reliability of recent inferences of H− in its atmosphere. We find that previous detections of H− are dependent solely on a single data point. This new metric for exoplanetary retrievals complements and expands our repertoire of tools to better understand the limits of our models and data. elpdLOO provides the means to interrogate models at the single-data-point level, which will help in robustly interpreting the imminent wealth of spectroscopic information coming from JWST.


INTRODUCTION
Our understanding of exoplanet atmospheres has fundamentally changed in the last decade due to the plethora of space-and ground-based spectroscopic observations of transiting exoplanets from facilities such as the Hubble Space Telescope (HST, e.g., Deming et al. 2013;McCullough et al. 2014;Sing et al. 2016), the Spitzer Space Telescope (e.g., Désert et al. 2011;Pont et al. 2013;Stevenson et al. 2017), the Very Large Telescope (VLT, e.g., Nikolov et al. 2016;Sedaghati et al. 2017), among others (e.g., Chen et al. 2017; Rackham luis.welbanks@asu.edu* NHFP Sagan Fellow et al. 2017; Chen et al. 2018;Espinoza et al. 2019;von Essen et al. 2019).In particular, the transmission spectra of exoplanets, which encode the apparent change in planetary size as a function of wavelength, have provided us with important constraints on the chemical composition of exoplanet atmospheres (e.g., Charbonneau et al. 2002;Deming et al. 2013;Sing et al. 2016), information regarding the prevalence of clouds and hazes (e.g., Pont et al. 2008;Nikolov et al. 2016;Kreidberg et al. 2014a;Benneke et al. 2019), and other atmospheric properties at the day-night terminator of the planet (see e.g., Madhusudhan 2019, for a review).
The interpretation of these observations is routinely performed by means of model fitting and comparison techniques, i.e., atmospheric retrievals.Currently, atmospheric retrievals using Bayesian inference methods are widely employed to obtain parameter estimates from transit spectra (see e.g., Madhusudhan 2018, for a review).Through comparing competing atmospheric models, multiple chemical species have been identified and proposed as the cause of observed absorption features (e.g., Nikolov et al. 2016;Sedaghati et al. 2017;Chen et al. 2017).The retrieved constraints have enabled initial results on the abundances of these chemical species (e.g., Kreidberg et al. 2014b;Barstow et al. 2017;MacDonald & Madhusudhan 2017), as well as comparative studies searching for trends in the composition of exoplanet atmospheres (e.g., Madhusudhan et al. 2014;Pinhas et al. 2018;Fisher & Heng 2018;Welbanks et al. 2019).Fundamentally, the conclusions drawn from exoplanetary atmospheric retrievals always depend on a comparison between competing models.
However, the inferences and conclusions made using these techniques can be conflicting.For instance, while a chemical species may be inferred using a specific data set and modeling strategy (e.g., Sedaghati et al. 2017;Chen et al. 2017), observations of the same planet obtained with different facilities and interpreted with different models, may not lead to the same conclusions (e.g., Espinoza et al. 2019;Spake et al. 2021).Likewise, the resulting best-fit solutions can sometimes be incompatible with the observations according to some goodness-of-fit metrics (e.g., MacDonald & Madhusudhan 2019; Colón et al. 2020).Interpretations of the same observations using different models and model fitting techniques have also led to opposing conclusions (e.g., Sing et al. 2016;Barstow et al. 2017;Pinhas et al. 2019).Overall, these apparent disagreements highlight the need for a thorough understanding of the limits of our models and the sensitivity of our inferences on the spectroscopic observations used.
It is desirable to have a method to assess the robustness of model inferences and their sensitivity to individual observations.However, existing model assessment metrics such as the Bayesian evidence, χ 2 , p-value, or Bayesian Information Criterion (BIC), provide a single global summary value for the performance of the model conditioned on the entire dataset (e.g., Madhusudhan & Seager 2009;Benneke & Seager 2013;Colón et al. 2020;Alderson et al. 2022), making it difficult to shed light on which specific observations affect the model performance.Furthermore, these commonly used metrics can be hard to interpret as their assessment of the model performance may be at odds with each other (e.g., Colón et al. 2020;Lewis et al. 2020;Sheppard et al. 2021).Besides these metrics, some studies have used an information content based approach (i.e., quantifying the change in the state of knowledge by making a measurement, e.g., Line et al. 2012;Batalha & Line 2017) to estimate the wavelengths and the precisions required to better inform the parameters in atmospheric models of transmission spectra.However, these efforts have been limited to simple atmospheric models with a small number of parameters.Additionally, the information content based approach has yet to be successfully implemented as a tool for model comparison and model assessment in atmospheric retrievals of transmission spectra.
There is a need for complementary model performance metrics that can assess performance both at the dataset and individual data point level.These metrics must take into account the sources of uncertainty in our models, and be generally applicable to non-linear atmospheric models of increasing complexity.Crucially, for their successful application to atmospheric retrievals, these metrics must be computationally inexpensive.The advent of such tools is paramount upon the imminent data release from the James Webb Space Telescope (JWST) at wavelengths inaccessible by existing facilities and at unprecedented precisions (e.g., Greene et al. 2016).Such a model performance metric is also key to ensure reliable and robust inferences and conclusions from current and upcoming observations.
In this paper, we explore the application of the expected log pointwise predictive density estimated by Bayesian leave-one-out cross-validation (elpd LOO ) as an interpretable model selection, comparison, and criticism tool.While elpd LOO is starting to be applied in some areas of astronomy (e.g., Morris et al. 2021;Meier Valdés et al. 2022;McGill et al. 2022;Neil et al. 2022), has yet to be used in the context of atmospheric retrievals of transmission spectra.elpd LOO (e.g., Vehtari & Ojanen 2012) is a metric that quantifies the out of sample predictive accuracy of a model, e.g., how well does the model predict unseen data.This Bayesian metric can provide an indication of model performance at the data point or dataset level.Besides single-model assessment, elpd LOO is a powerful tool for model comparison as it can highlight the relative performance of two models at the data point resolution.Using the Pareto Smoothed Importance Sampling (PSIS; Vehtari et al. 2015) approximation for elpd LOO we demonstrate that this metric is computationally feasible for exoplanet atmospheric models.
We begin by summarizing the method of atmospheric retrievals as well as the advantages and disadvantages of the statistical metrics currently employed for model selection in Section 2. In Section 3 we introduce the concept of Bayesian cross validation and the algorithm to compute elpd LOO .We explain our atmospheric re-trieval setup in Section 4 and in Section 5 we demonstrate the power of elpd LOO in atmospheric retrievals of transmission spectra by testing elpd LOO on synthetic observations of a hot Jupiter.Then, we use elpd LOO to assess the robustness of recent inferences of the hydrogen anion H − in the atmosphere of the hot Jupiter HAT-P-41b in Section 6.We conclude by summarizing our findings and exploring future research directions in Section 7.

MODEL ASSESSMENT IN EXOPLANET ATMOSPHERIC RETRIEVALS
Atmospheric retrieval frameworks are powerful tools to infer the atmospheric properties of an exoplanet from its spectrum.At their core these tools couple an atmospheric model describing the physical state of the planetary atmosphere, a prior distribution over the possible values of the model parameters, and a parameter estimation scheme.Although a wide range of parameter estimation schemes have been used (see e.g., Madhusudhan 2018, for a review), nested sampling (e.g., Skilling 2006;Feroz et al. 2009;Buchner et al. 2014) prevails in the literature.

Frequentist Hypothesis Testing
Fit quality metrics like χ 2 and corresponding p-values are commonly used in retrieval studies to assess whether or not a model adequately explains the data (e.g., Colón et al. 2020;Lewis et al. 2020;Sheppard et al. 2021;Alderson et al. 2022).Studies generally compute the reduced χ 2 , χ 2 red = χ 2 M /ν where ν is the degrees of freedom, and values greater than one are considered "poor" fits, while values smaller than one are considered an overfit (Andrae et al. 2010).Similarly, the p-value is interpreted as the probability that random noise can result in the observed data given that the model being tested is true, with some studies considering p-values < 10 −3 indicative of "poor" fits (e.g., Colón et al. 2020).
However, these fit quality metrics have two main pitfalls within the context of atmospheric retrievals.First, the atmospheric models used in exoplanetary retrievals are not linear in their parameters.As a result, estimating the true number of degrees of freedom of a given model to compute χ 2 red is not possible.Second, current studies ignore the uncertainty in the estimate of χ 2 which in practice can be difficult to estimate.These limitations affect estimates of the p-value as this metric relies on the assumption that χ 2 statistics are appropriate for the model being tested.Furthermore, the pvalue penalizes complex models with a large number of parameters, likely resulting in a preference for simpler and possibly incomplete models.

Bayesian Hypothesis Testing
More recently, model comparison/selection using the Bayesian Evidence1 (e.g., Benneke & Seager 2013;Sedaghati et al. 2017;Welbanks & Madhusudhan 2021a) has become customary due to nested sampling techniques which permit its computation (e.g., Skilling 2006;Feroz et al. 2009;Buchner et al. 2014).This metric is computed by taking the ratio of evidences (also know as the Bayes' factor) between two models.Additionally, the difference in evidence between models has been mapped to a commonly used 'sigma' scale and introduced to the field of exoplanetary retrievals by Benneke & Seager (2013).Besides its straightforward computation, Bayesian model evidence has the appealing property of penalizing model complexity that is not supported by the data effectively applying Occam's razor to model selection.
Bayesian metrics allow the incorporation of prior information about a parameter when calculating the probability distribution of interest.Under pathological circumstances, this key feature could be misused by a practitioner.For instance, a different choice of prior, e.g., to one more or less informative, will result in a different Bayesian evidence due to the associated change in prior space, and may lead to a different interpretation regarding the model preference.This presents a challenge in the context of exoplanet atmosphere because commonly, the priors for the model parameters (e.g., chemical abundances, P-T profile parameters, etc.) have somewhat arbitrary boundaries and vary significantly different between studies.For instance, changing the priors on a given chemical species (e.g., H 2 O) from uniform in log-space (e.g., -1 to -12, see e.g., Welbanks & Madhusudhan 2019) to a prior motivated by the species' thermo-chemical equilibrium expectations (e.g., logarithmic from -2 to -4) would significantly reduce the prior space and could result in a very different model evidence and associated interpretation, without necessarily changing any posterior inferences.Therefore, improper use of these Bayesian metrics through e.g., inappropriate prior choices, can propagate through to the Bayes factor, and potentially affect the reported detection significance.

LEAVE-ONE-OUT CROSS-VALIDATION
The model assessment metrics described in Section 2 provide a single value shedding little light on exactly where and how each model is failing or performing well.From this single number, it is non-trivial to estimate whether the models would perform better or worse at other wavelengths, or to estimate which parts of the data are driving our inferences.To overcome these challenges, we propose the application of Bayesian Leave-one-out cross-validation to compute the expected log pointwise predictive density (elpd LOO ) to aid interpretation of atmospheric retrievals.In the rest of this section we closely follow Vehtari et al. (2017), and outline the theory behind elpd LOO .
Consider an atmospheric model M, which is being fit to a spectroscopic data set D.
is the set of N spectroscopic data points where, D i = {∆ i , σ ∆i } comprised of a transit depth (∆ i ), and transit depth error (σ ∆i ) pair, at a wavelength λ i .Then, using Bayes' theorem the posterior distribution for each of the model parameters ( ⃗ θ) for a given model (M) is . (1) Here, p(D| ⃗ θ, M) is the likelihood and represents the probability of observing the data (D) given a specific set of model parameters ( ⃗ θ).The likelihood is multiplied by the prior distribution p( ⃗ θ|M) and divided by the Bayesian model evidence p(D|M).A retrieval uses the likelihood and prior to compute estimates for the Bayesian model evidence and posterior distributions for the model's parameters.
To complement existing model selection and criticism, we introduce the leave-one-out expected log posterior predictive density elpd LOO obtained by Bayesian crossvalidation.Bayesian leave-one-out cross-validation is one way to estimate the out of sample predictive accuracy of a model (i.e., the expected log pointwise predictive density, see e.g., Gelman et al. 2014, for a review).Bayesian leave-one-out cross-validation works by fitting a given model M to D −i , the full data set (D) with the ith data point, D i , left out.The posterior distribution of the fit it used to assess how well the left out data point is predicted.This procedure is then repeated for each data point in turn.Ultimately, this allows each data point to be scored on how well it is predicted by a given model conditioned on the rest of the dataset.These data point scores can then be compared pairwise between competing models or totaled over the data set to give an indication of overall model performance.The elpd LOO for D i is the expected log posterior predictive density, elpd LOO,i,M = log p(D i |D −i , M). (2) The posterior predictive density can be calculated by taking the expectation of the likelihood of the D i with respect to the posterior of the model fitted to The elpd LOO score quantifies how well the model fit to the data (i.e., trained model under the chosen prior), performs on unseen data where each data point is left out in turn.On the other hand, the Bayesian evidence quantifies the probability of the data agreeing with the specified model under the chosen prior.Therefore, they are fundamentally different quantities that when used together can enrich our understanding of our models and their ability to explain the data.
In practice, if we have S samples from the posterior distribution obtained by fitting the model to ), we can estimate the posterior predicted density of D i with, All data point scores can be totaled over the full data set to give a indication of overall model performance, Two competing model's (M 1 , M 2 ) overall performance can be computed by taking the difference in elpd score with, ∆elpd LOO,M1,M2 = elpd LOO,M1 − elpd LOO,M2 .(6) Here, a positive ∆elpd LOO,M1,M2 would mean that M 1 has the better out of sample predictive performance.
In principle, calculating the exact elpd LOO score would require fitting the model to infinite data.Therefore, the assumption here is that the N data points used to calculate the elpd LOO score come from a larger population making it possible to calculate their standard error (SE).The SE of the estimated elpd LOO score is while the SE for the difference between two scores is given by, Here, V is the variance operator.
Naive computation of all elpd LOO,i,M terms for a given model would be computationally expensive.This is because each term requires a full Bayesian refit of the model to obtain posterior samples.Overall, N full Bayesian fits would be required.In the case of exoplanet atmospheric modeling, N ≈ 100 − 1000 and one Bayesian refit, typically performed with nested sampling, can take ≳ 27 Central Processing Unit (CPU) hours (e.g., Zhang et al. 2020).This required computation is clearly prohibitive, therefore, we have to turn to the Pareto Smoothed Importance Sampling (PSIS; Vehtari et al. 2015) approximation outlined in (Vehtari et al. 2017) to calculate elpd LOO,i,M2 .We defer a detailed explanation of the PSIS approximation to Appendix A.1, but briefly outline the method here following Vehtari et al. (2015) and Vehtari et al. (2017).PSIS allows us to approximately compute all elpd LOO,i,M for a model without having to refit the model N times.When using the PSIS approximation, the model is fit to the whole data once (e.g., one retrieval) and the posterior samples from this single inference are re-weighted using importance sampling to approximate the effect of leaving each data point out in turn.Instead of raw importance weights being using in the approximation, the distribution of importance weights is fitted with a Pareto distribution (Zhang & Stephens 2009) and a set of smoothed importance weights is drawn from this distribution and used.This is called PSIS and it serves two purposes.Firstly, PSIS acts to stabilize the importance sampling approximation by removing extreme weights that can dominate and make the importance sampling approximation unreliable (Vehtari et al. 2015).Secondly, the fitted shape parameter of the Pareto distribution, k, traces the number of finite moments of the importance weights distribution and acts as diagnostics on the reliability of the importance sampling approximation.For LOO, k identifies cases when the posterior distribution changes significantly upon removing a data point and, therefore, the importance sampling approximation would be unreliable and should not be used.Vehtari et al. (2015) determined empirically that PSIS estimates were reliable for k ⪅ 0.7.In the case of LOO, the PSIS approximation is used for each data point where k < 0.7.For data points where k ≥ 0.7, the PSIS approximation is likely to be unreliable, so a full refit of the model is performed and used to calculate elpd LOO,i,M .Overall, this allows all elpd LOO,i,M terms to be computed with less than N refits of the model.For brevity, in what follows elpd i,M ≡ elpd LOO,i,M .

Limitations and other considerations
Bayesian leave-one-out cross-validation presents an additional tool for model criticism and comparison within the context of atmospheric retrievals.This tool allows us to compare two models on a per-data-point scale, but as with any other model comparison metric, it cannot be used in isolation to argue definitively for or against a detection of a spectral feature in an exoplanet atmosphere.The reliability of an inference can only be robustly established by considering multiple model comparison and criticism tools and metrics.
For instance, elpd LOO,M does not penalize for model complexity.Therefore, when trying to assess two models with varying degrees of complexity, Bayesian leaveone-out cross-validation could be used alongside metrics that have a built in Occam's razor such as the Bayesian evidence.Additionally, as is the case for existing analysis using the Bayesian evidence, when the set of possible models is large, model comparison with elpd LOO,M may result in a combinatorial explosion of possible pairwise combinations.
Additionally, the difference in elpd LOO,M score (Equation 6) between different pairs of models can be difficult to interpret and has not been commonly mapped to a 'sigma' scale as other model comparision metrics have.Moreover, the standard error for the difference between two scores (Equation 8) comes from the consideration that we are estimating the LOO predictive accuracy of a model using a finite sample of N data points (see e.g., Vehtari et al. 2017).As such, the standard error estimate may be inaccurate if the number of observations is limited (see e.g., Sivula et al. 2020, for a discussion on the uncertainty in elpd estimates).Therefore, we caution against using the difference of elpd LOO,M scores in units of standard errors as a proxy for sigma significance, but it can be useful to visualize the relative performance of different models.We further explore this and other future considerations in Section 7.2.

ATMOSPHERIC RETRIEVAL SETUP
First, we demonstrate the use of elpd on synthetic HST STIS and HST WFC3 observations, the common observational setup in the pre-JWST era, to obtain intuition for the information that this metric can provide as well as demonstrate its conformance with prior expectations.Then, we apply this tool to the interpretation of current observations of the transiting hot Jupiter HAT-P-41b (Hartman et al. 2012) obtained with HST WFC3 UVIS G280, HST WFC3 G141, and Spitzer IRAC from Wakeford et al. (2020) and Lewis et al. (2020) spanning the wavelength range 0.2 to 5.0 µm.Particularly, we assess the previous detection of H − absorption in the planet's atmosphere -a result in significant discrepancy with thermo-chemical expectations.

Synthetic Observations
We generate synthetic HST-STIS and HST-WFC3 synthetic observations assuming spectral resolutions and precisions comparable to existing observations (e.g., Sing et al. 2016), generally following the procedure described in Welbanks & Madhusudhan (2021b).First, we generate spectra at a higher resolution of R ∼ 13, 000 on average, from 0.3-2.0µm sampled line-by-line, and solving radiative transfer in transmission geometry.The resulting spectra is used to produce a binned spectrum following the methods outlined in Pinhas et al. (2018).The resulting binned spectra has a resolution and precision of R HST−STIS = 20 and 100 ppm, respectively, for HST-STIS, and R HST−WFC3 = 60 and 50 ppm for HST-WFC3.The synthetic observations include Gaussian scatter.
The synthetic observations are generated for a planet with the same system parameters as HD 209458b (e.g., R p = 1.359RJ , M p = 0.685M J , and R star = 1.155R ⊙ Torres et al. 2008).The model generating the synthetic data assumes a clear, H 2 -rich, isothermal atmosphere at the equilibrium temperature of the planet T eq = 1450 K, with solar abundances of H 2 O, Na, and K (i.e., log(X H2O ) = −3.3,log(X Na ) = −5.76,log(X K ) = −6.97,e.g., Asplund et al. 2009).We assume a reference pressure for the planetary radius of 10 bar.
First, to test the implementation of elpd, we use a similar model to the one that generated the synthetic observations.Namely, the atmospheric model considers absorption due to H 2 O, Na, and K, with the volume mixing ratio of each species as a free parameter.The Pressure-Temperature (P-T) profile is assumed to be isothermal, with 1 free parameter for the atmospheric temperature and 1 free parameter for the reference pressure (P ref. ) at the assumed planet radius of R p = 1.359RJ .Instead of assuming a clear atmosphere, the model considers the presence of inhomogeneous clouds and hazes following the generalized parameterization in Welbanks & Madhusudhan (2021a) using one fraction for the combined effects of clouds and hazes.In total, this model has 9 free parameters: 3 chemical species, 1 isothermal temperature, 1 reference pressure, and 4 parameters for inhomogeneous clouds and hazes.
Second, we implement the same atmospheric model used by Lewis et al. (2020) to interpret the HAT-P-41b observations and used to infer a ≳ 2.6σ detection of H − (i.e., their "minimal" model setup).This atmospheric model has an isothermal, clear atmosphere with absorption due to H 2 O, Na, CrH, AlO, VO, and H − .Following Lewis et al. (2020), we retrieve the planetary radius for a reference pressure at 10 bar.In total, this model has 8 parameters: 6 chemical species, 1 for the isothermal temperature of the atmosphere, and 1 for R p,ref. .We follow the retrieval setup from Lewis et al. (2020) to enable a direct comparison with their results suggesting the presence of significant H − opacity in the atmosphere of HAT-P-41b.As commonly performed in the exoplanetary atmospheric retrieval literature, we compare a 'reference model' with the highest degree of complexity, to a series of less complex models for which one or more parameters, or model components, have been removed.Here, our 'reference model' is described in Section 4.2 and is similar to the model that generated the synthetic observations.Namely, our reference model considers absorption due to H 2 O, Na, K, an isothermal P-T profile, and the possibility of inhomogeneous clouds and hazes.Then, we compare this reference model to four other models: 1) same as reference model but without H 2 O absorption, 2) same as reference model but without Na absorption, 3) same as reference model but without K absorption, and 4) same as reference model but without the presence of inhomogenous clouds and hazes.
Performing the model comparison using the difference of the Bayesian evidence and converting it to a 'detection significance' as commonly performed in the literature (e.Despite the three chemical components being present in the true model generating the data, their associated 'detection significances' are substantially different.This may be due to the fact that a large number of data points cover the broad H 2 O absorption features while the main features due to Na and K are only encompassed by a few spectral bins.However, this information is not provided by the model comparison metric by itself.Additionally, the interpretation of a 'detection significance' higher than 10σ (i.e., a probability of 1-7×10 −24 , e.g., MacDonald & Madhusudhan 2019) is difficult as it represents the relative preference between two models which may be equally incorrect.Finally, the difference in the Bayesian evidence does not provide any information at the data-point (or spectral feature) resolution about the strengths or deficiencies of the model.
Detailed information about the model performance at the per-data-point level can be teased out using the elpd metric.Figure 1 shows the retrieved median model and the confidence intervals of the retrieval using the reference model on the synthetic observations.The synthetic observations have been color coded by their elpd score.The darker blue spectral points in Figure 1 have a lower elpd score than the bright yellow points, and are comparatively worse predicted by the atmospheric model, when left out of the fit.For instance, the elpd score suggests that the spectral point at 1.27µm, which has the lowest elpd score of 6.7, is the worst explained by the model as shown by the data fit, where it is far away from the model.The primary utility of elpd lies in model comparison.As such, Figure 2 shows the per-data-point difference in elpd score (∆elpd score) between the reference model and the less complex models used3 .Additionally, Figure 2 shows the retrieved confidence interval for each of the models compared.The top left panel, showing the comparison of the models including and not including H 2 O absorption, has the highest difference in elpd scores.The magnitude of the difference is then followed by the models comparing the presence of Na, then K, and finally the presence of inhomogeneous clouds and hazes.
In agreement with intuition, the top left panel of Figure 2 shows that the spectral signatures of H 2 O in the near-infrared are responsible for the better performance of our model in the HST-WFC3 region (1.1 − 1.7µm) of our synthetic observations.Similarly, the top right and bottom left panels of Figure 2 show that the models with Na and K absorption improve the model performance near 0.59µm and 0.77µm, respectively.Here, and unlike the H 2 O case for which the largest difference in elpd scores closely follow the influence of the broadband molecular features, the largest difference in elpd score follows the points near the sharp Na and K doublets.Finally, the presence of inhomogeneous clouds and hazes do not improve the model performance significantly and have the smallest difference in elpd scores per data point of all models considered.
The difference in elpd scores as shown in Figure 2 allows us to diagnose the performance of the models on a  6) over the entire dataset divided by their standard error (SE, Equation 8) see Table 1.The ∆elpd scores are computed by the difference in elpd score between the reference model and models without H2O (blue), Na (orange), K (green), and inhomogeneous clouds and hazes (red).Positive values indicate an increase in the predictive performance of the reference model due to the addition of the specific model component.On the other hand, negative values indicate a decreased predictive performance of the model due to the inclusion of the specific model component.
per-data-point level and understand the regions of the data where our model may be under-performing relative to other models.A smaller difference in elpd scores indicates a similar performance between the models considered.Additionally, this metric allows us to infer which subset of the data are responsible for the preference of one model over the other, e.g., infrared HST-WFC3 observations are responsible for the preference of models with H 2 O absorption.
Finally, we can compare the difference in elpd scores between two or more pairs of models.Figure 3 shows the total difference in elpd scores between the reference model and the less complex models in units of their standard error (i.e., Equations 6 normalized by Equation 8). Figure 3 shows that the model component responsible for the largest increase in the predictive performance of our reference model is H 2 O absorption.Relative to no improvement in model performance (i.e., a elpd difference of zero), the inclusion of H 2 O absorption improves the out of sample predictive performance of the model by ∼ 5 standard errors.This improvement is followed by Na (∼ 2 standard errors) and K (∼ 2 standard errors).On the other hand, the inclusion of inhomogeneous clouds and hazes decreased the predictive performance of the model at ∼ 1.5 standard errors.
Table 1 summarizes these results alongside other common model comparison metrics for our analysis of these synthetic observations.Despite the similarities between the Log Evidence and the elpd score, or the ∆elpd score divided by its standard error and the 'detection significance' in σ-units, we caution against interpreting them as interchangeable quantities as explained in Section 3. Importantly, we compute the standard error (i.e., the uncertainty) of the reported ∆elpd score, unlike the detection significance which is rarely quantified by its numerical uncertainty4 (see Section 3).
The information provided by the elpd metric for this atmospheric model for which we know the true input agrees with our expectations.Using this metric enables us to identify deficiencies in the models as a function of wavelength, e.g., the lack of H 2 O absorption fails to explain the observations in the near-infrared.Additionally, it provides us a way to quantify the improvement in our models by quantifying the out of sample predictive performance of each model and comparing them.This demonstration on a well understood atmospheric scenario showcases the power of this metric to provide us information at the per-data-point level and highlight the strengths and deficiencies of our models.Besides its application to well understood atmospheric scenarios, elpd can help us diagnose the impact of less understood model considerations such as planet inhomogeneities (e.g., clouds, hazes, and multi-dimensional effects), chemical detections of species with unclear spectroscopic features (e.g., metal hydrides in the optical, or H-opacity), and the effects of instrument systematics (e.g., correlated and non-Gaussian noise).

THE CASE OF H − IN HAT-P-41B
Having built intuition for the elpd metric, we proceed to implement it to the transmission spectrum of the hot Jupiter HAT-P-41b spanning the UV to infrared wavelengths from 0.2-5.0µm.HAT-P-41b is the first hot Jupiter for which significant H − absorption has been invoked as the most robust explanation for its transmission spectrum by Lewis et al. (2020).With a retrieved H − abundance several orders of magnitude larger than thermochemical equilibrium expectations (e.g., Kitzmann et al. 2018;Parmentier et al. 2018), understanding the robustness and reliability of this inference is paramount.By implementing the elpd metric to the Note-DS means Detection Significance.The DS is defined for a pair of models with a ratio of evidences (e.g., Bayes factor) ≥ 1 (e.g., Trotta 2008).N/A means that the DS is not defined since the reference model has a smaller evidence than the model being tested, e.g., B<1.The clear atmosphere model is preferred over the reference model at 2σ.The p-value for the model without H2O is too small and tends to zero.
same data and atmospheric setup used to infer the presence of H − , we seek to identify which data points drive this possible detection.We perform a retrieval following the "minimal" model in Lewis et al. (2020), as explained in Section 4.2, as this is the setup for which H − was inferred at a ≳ 2.6σ level.We perform the retrieval on the combination of HST WFC3 G141, Spitzer IRAC observations and the HST UVIS G280 data from the jitter decorrelation or systematic marginalization treatments in Wakeford et al. (2020).We compare our retrieved H − abundances and isothermal temperatures to those from Lewis et al. (2020) for the same data and model configurations.Furthermore, we compare our derived χ 2 ν,min and difference in Bayesian evidence, translated to a 'detection significance' using the formalism in Benneke & Seager (2013) as explained in Welbanks & Madhusudhan (2021a), to those reported in Lewis et al. (2020).
When using the UVIS jitter decorrelation data, our retrieval does not find evidence of H − absorption (< 2σ).Our derived 'detection significance' of 1.6σ is lower by more than 1σ to the reported value of 2.9σ by Lewis et al. (2020).Nonetheless, the retrieved log(H − ) = −9.04+0.66  −1.17 and isothermal temperature of T= 1008 +197 −120 K are consistent with the derived properties by Lewis et al. (2020).Additionally, our χ 2 ν,min = 1.55 is very close to χ 2 ν,min = 1.50 in Lewis et al. (2020).The agreement in retrieved properties and χ 2 ν,min alongside a disagreement in associated 'detection significance' may be the result of dissimilar priors and subtle model assumptions between our work and those employed in Lewis et al. (2020).
On the other hand, when considering the systematic marginalization HST UVIS G280 observations our retrieval finds a 2.7σ model preference for H − absorption, comparable to the 2.6σ for the same model and data from Lewis et al. (2020).This configuration retrieves log(H − ) = −9.32+0.52  −0.51 for an isothermal temperature of T= 984 +205 −163 K consistent with Lewis et al. (2020).Likewise, the derived χ 2 ν,min = 1.83 is comparable to the value of χ 2 ν,min = 1.72 from Lewis et al. (2020).This configuration reproduces the weak-to-moderate evidence of H − in HAT-P-41b.
With the reproduction of the weak-to-moderate detection of H − in HAT-P-41b in mind, we proceed to calculate the approximated PSIS elpd values, as explained in Section 3, to quantify out of sample predictive performance for the reference model relative to the model without H − absorption.Figure 4 shows the retrieved transmission spectrum and confidence intervals for this retrieval, with the data points color coded by their ∆elpd score (Equation 6) between the reference model including H − absorption and the model without H − absorption.Points with a large positive value (redder points) are better explained by the reference model with H − (i.e., the reference model has a better out of sample predictive performance), while the large negative values (bluer points) are better explained by the model without H − absorption (i.e., the reference model has a worse out of sample predictive performance).
The data point with the largest ∆elpd = 1.69 in Figure 4 is the Spitzer IRAC point at 3.6 µm, while the point with the lowest ∆elpd = −1.56 is at ∼ 1.52 µm from HST WFC3 G141.Of the four points with the largest ∆elpd, two are from the HST WFC3 G141 observations (i.e., points at ∼ 1.14 µm and ∼ 1.56 µm) covering wavelengths at which the spectral contribution of H − may be expected (see e.g., Figure 7 in Lewis et al. 2020), while two fall outside of this spectral range (i.e., points at 3.6 µm and 0.385 µm).
The Spitzer IRAC point at 3.6 µm, highlighted by an arrow in Figure 4, is the point with highest ∆elpd score.To assess the robustness of the H − detection in HAT-P-41b, we perform an additional retrieval with the same "minimal" model as before but without the data point with the highest ∆elpd.We find that without the Spitzer IRAC point at 3.6 µm the reference model with H − absorption is not preferred (< 2σ) over the model without H − .This means that the 2.7σ 'detection significance' previously quoted depends on the transit depth at 3.6 µm being reliable.Unsurprisingly, if both Spitzer photometric points are removed from the retrieval (e.g., only HST UVIS G280 and HST WFC3 G141 observations are considered) we do not find a preference for H − absorption.
When the Spitzer IRAC point at 3.6 µm is not considered in the retrieval, the retrieved abundance of H − is log(H − ) = −9.80+0.57−0.83 and the retrieved isothermal temperature is T= 811.69 +278.34  −167.40K. Additionally, the derived χ 2 ν,min = 1.79 is comparable to that of Lewis et al. (2020).Overall, while the inferred properties of HAT-P-41b remain largely consistent, removing this single transit depth results in the non-detection of H − absorption in the atmosphere of HAT-P-41b using current observations.
Besides the presence of H − , absorption due to H 2 O has been previously inferred or detected in HAT-P-41b (e.g., Tsiaras et al. 2018;Fisher & Heng 2018;Lewis et al. 2020;Sheppard et al. 2021).Specifically, Lewis et al. (2020) find a ≳ 5σ detection of H 2 O for their HST WFC3/UVIS, WFC3 G141, and Spitzer IRAC observations, considered in this work, although the model configuration associated with this detection is unclear.For the same set of observations including the systematic marginalization HST UVIS G280, and the "minimal" model considered above, our retrieval finds a model preference for H 2 O absorption at 4.7σ, similar to that reported by Lewis et al. (2020).
Figure B2 in the Appendix shows the ∆elpd for the reference model including H 2 O absorption and the model without H 2 O.Following our intuition, the points with the largest positive ∆elpd scores, i.e., those better explained by the presence of H 2 O absorption, fall within the wavelength range of the HST WFC3 G141 observations (∼ 1.1-1.7µm)where H 2 O has a prominent absorption feature.The point at ∼ 1.38µm has the largest ∆elpd = 4.73 score, almost 3× the highest ∆elpd score for the H − model comparison.
We sum all ∆elpd scores over the full data set and calculate the standard error of this difference (Equation 8) to quantify the overall performance of each of the models considered.Figure 5 shows the overall ∆elpd score for each of the models considered divided by their calculated standard error.As before, a larger and positive ∆elpd means that adding the model component, e.g., H 2 O or H − absorption, increased the predictive performance of the model.We find that including H 2 O absorption increases the predictive performance of the model by 1.7 standard errors, while including H − absorption increases the predictive performance by 1.0 standard error.This ∆elpd for H − suggests that H − absorption does not significantly improve the predictive performance of the model.
Overall, implementing elpd as a model performance metric on retrievals of transmission spectrum of HAT-P-41b provides complementary per-data-point information, enabling further scrutiny on the possible detection of H − in the atmosphere of the planet.Particularly, elpd indicates that the weak/moderate (≳ 2.6σ) detection of H − is contingent on the transit depth at a single data point (i.e., 3.6 µm).Additionally, the inclusion of H − on the atmospheric models does not significantly improve the predictive performance of the model.Complementing retrieval studies with elpd analysis, as performed here for the UV to IR spectrum of HAT-P-41b, can provide key information to investigate the robustness of our findings, understand the limits of the data, and contextualize our inferences by the observations.

SUMMARY AND DISCUSSION
Understanding the limitations of our data and our models has become a key need for the robust and re-liable interpretation of spectroscopic observations of exoplanet atmospheres.Particularly, as the resolution and precision of our observations improves in the upcoming decade, and atmospheric models include previously neglected physical effects, there is the need for diagnostic metrics that can provide insights into strength and weakness of a model.To that effect we have incorporated elpd to exoplanet atmospheric retrievals.
Unlike conventional model assessment metrics employed in atmospheric retrievals such as the Bayesian evidence or χ 2 , elpd provides an assessment of the model performance at the per-data-point level as well as for the model over the entire data set.At its core, elpd provides an estimate for the ability of a model to predict (i.e., explain) each left out data point.By understanding which data points are better explained by each of the models considered, we can quantify the dependency of our inferences on the observations obtained.In other words, elpd provides the tools to contextualize the atmospheric inferences and detections as a function of the observations.We briefly summarize the main advantages of this tool: • elpd allows each data point to be scored on how well it is predicted by a given model when it is left out of the model fit.This gives us information about the model performance at the per-datapoint level.
• The elpd score can also be totaled over the entire dataset, giving an estimate of the overall out of sample predictive performance of the model.
• The ∆elpd score between different models is an additional metric to assess the need for using more complex models over simpler models with individual components removed.
• elpd helps assess the sensitivity of resulting inferences on specific data points in the dataset.
As with any other model comparison metric, elpd has drawbacks and areas that need to be further explored within the context of atmospheric retrievals.These include: • elpd as a metric does not penalize model complexity.Therefore, when assessing models of increasing complexity, elpd should be used alongside metrics like the Bayesian evidence which do penalize model complexity.
• elpd score comparison may result in a large number of pairs of models being examined.This could lead, in principle, to a combinatorial explosion of possible pairwise model comparisons.This problem is similar to those of other metrics such as the Bayesian evidence.
• elpd should be used alongside existing model assessment and comparison metrics.Together these metrics can provide a holistic picture of the reliability of the data-model interpretation.
7.1.Is there H − in the Atmosphere of HAT-P-41b?
As we approach the era of JWST observations, covering wavelengths never probed before with unprecedented precision, it is paramount to determine what constitutes a robust and reliable detection.To date, there is still debate as of what constitutes a clear atmospheric signature, especially for chemical species or atmospheric processes with uncertain spectroscopic signatures.For example, a source of continuum opacity such as H − does not provide strong spectral features as other prevalent species in exoplanet atmospheres such as H 2 O, Na, or K. Therefore, previous studies have resorted to calculating the difference in Bayesian evidence for models with and without a model component to quantify its presence, a common practice in the field.
Indeed, as explained in Section 6, Lewis et al. ( 2020) report the tantalizing need for significant H − absorption in the atmosphere of HAT-P-41b to best explain its transmission spectrum.If confirmed, this would suggest strong chemical disequilibrium in the atmosphere of the planet and would imply the need for new atmospheric models capable of capturing more complex physical effects such as photochemical processes.On the other hand, if this suggestion is invalidated this would highlight the challenges in understanding the limitations of our models and our data.Below we contextualize the additional information obtained by using elpd on the transmission spectrum of HAT-P-41b.
Our retrieval analysis in Section 6 can reproduce the weak-to-moderate detection (∼ 2.7σ) of H − in HAT-P-41b using the same "minimal" model and the HST WFC3/UVIS G280 with systematic marginalization, HST WFC3 G141, and Spitzer IRAC observations as Lewis et al. (2020), by means of Bayesian evidence comparison.However, when using the jitter decorrelation HST WFC3/UVIS G280 and the same "minimal" model we obtain a non-significant (1.6σ) preference for H − absorption.Our analysis suggests that the presence of H − is only supported by the combination using the systematic marginalization data.
Our elpd analysis on the retrieval configuration that supports the results from Lewis et al. (2020) provides two main insights.First, the best explained data point by the model including H − absorption is at a wavelength at which signatures of H − contribution are not strong.If the data point with the highest ∆elpd score is removed from the retrieval analysis, the presence of H − absorption is no longer preferred by our models.Second, the increase in predictive performance of the models with H − , i.e., how much better does the model with H − explain the observations relative to the model without H − , is not significant (≲ 1 SE).In comparison the presence of H 2 O absorption, the other species inferred/detected by several studies (e.g., Tsiaras et al. 2018;Fisher & Heng 2018;Lewis et al. 2020;Sheppard et al. 2021), is detected at ∼ 5σ, better explains the observations at wavelengths where H 2 O absorption is expected, and increases the predictive performance of the models more significantly (∼ 2 SE).
The sensitivity of the H − detection to the Spitzer IRAC point at 3.6 µm, may be the result of changes to the continuum level of the spectrum when H − absorption is included or not in the model.When included, the bound-free H − absorption provides a continuum in the optical and infared wavelengths (i.e.,≲ 2µm) that can help explain the HST UVIS and HST WFC3 G141 observations, but due to its decline in absorption at longer wavelengths allows for a lower transit depth in the 3.6 µm datapoint.This opacity drop in the mid-infrared is not present when species other than H − , in combination with other model components (e.g., retrieved radius at 10 bar), are used to explain the HST UVIS and HST WFC3 G141 observations, resulting in a slightly higher transit depth at 3.6 µm.The later is less compatible with the Spitzer IRAC 3.6 µm observation.This level of model criticism is achievable for the first time thanks to the implementation of Bayesian leave-one-out cross validation.
Therefore, our analysis suggests that H − is weakly detected in the atmosphere of HAT-P-41b if systematic marginalization treatment of the HST WFC3/UVIS G280 is correct, and the transit depth of the Spitzer IRAC photometric point at 3.6 µm is precise and accurate.Existing and future observations at these and complementary wavelengths can help verify this inference.For example, the retrieval analysis of Sheppard et al. (2021) using HST STIS observations (0.3-1.0 µm) at similar wavelengths does not require significant H − absorption to explain the transmission spectrum of HAT-P-41b.Future studies could perform a consistent reanalysis of these HST STIS, HST WFC3/UVIS, HST WFC3 G141, and Spitzer IRAC observations to better assess the need for H − absorption in HAT-P-41b.Finally, future JWST observations with NIRSPEC or NIRCAM could provide important insights in the near-IR where the interaction of electrons and neutral H, not considered by Lewis et al. (2020) or this work, can have significant contributions to the observed spectrum.

Future Considerations
For the work in the paper, including the PSIS elpd approximation, we have operated under the assumption of additive white Gaussian noise for all models.Recent work by Ih & Kempton (2021) suggests that the presence of more complex and correlated noise of instrument or stellar origin could impact atmospheric parameter estimation.The methods in the paper can be generalized to general correlated Gaussian noise models (Sundararajan & Keerthi 2001;Bürkner et al. 2020).A future direction of research would be to investigate correlated Gaussian noise models with PSIS-elpd using the Gaussian Process (Rasmussen & Williams 2005;Ambikasaran et al. 2015;Foreman-Mackey et al. 2017;Foreman-Mackey 2018) noise model implemented in Aurora (Welbanks & Madhusudhan 2021a).
Further cross-validation schemes specific to exoplanet atmospheric data sets could also be developed in the future (see e.g., Bürkner et al. 2020, for an example of a cross-validation scheme appropriate for time-series data).In these schemes, instead of leaving out one data point all the data points coming from a single instrument could be left out.This particular cross-validation scheme could be used to diagnose systematic instrument effects.Moreover, a cross-validation scheme leaving out several data points clustered around a spectral feature could be explored to differentiate between localized signatures of specific chemical species, and non-localized effects like clouds and hazes.
Bayesian leave-one-out cross validation can help us determine the observational requirements (e.g., resolution, wavelength coverage, signal-to-noise) required for upcoming observational campaigns (e.g., upcoming HST and JWST GO Cycles) and the next generation of large mission concept studies and probe missions (e.g., National Academies of Sciences & Medicine 2021).Particularly, computation of the elpd score using simulated observations can help us understand which observations will be decisive in detecting different atmospheric processes.Furthermore, PSIS (see Section 3 and Appendix A.1) can help us understand the sensitivity of the derived posterior distributions (e.g., the derived atmospheric abundances) to specific observations.
Finally, although we have demonstrated elpd on lowresolution transmission spectra, the methods introduced here are readily applicable to emission spectra and highresolution studies.For instance, future studies may investigate the robustness of high-resolution inferences to individual lines or parts of the spectrum.Furthermore, future studies can use elpd to assess the need for additional observations in order to better constrain or confirm previous atmospheric inferences.

Concluding Remarks
We are about to embark into journey of discovery.The imminent high-precision infrared data from JWST (Greene et al. 2016), is bound to revolutionize our understanding of exoplanet atmospheres.These long-awaited observations may help us infer previously unseen chemical species, directly ascertain the 3D properties of exoplanets, and deduce disequilibrium processes that will build towards our search for life in other worlds.These radical discoveries can only be trusted if we acquire the ability to reliably extract the atmospheric properties from an observed spectrum with a thorough understanding of the limits of the data and models.In this context, elpd opens our eyes to a new dimension where we can clearly identify the specific data points upon which our inferences hinge.This new tool helps us revisit our data analysis and check for its robustness, identify the limitations of our models, and plan for the observations required to confirm our discoveries.The diverse and complementary set of tools to assess the robustness and reliability of our discoveries will help us navigate this exploration into the unknown.portance Sampling (PSIS; Vehtari et al. 2015) approximation outlined in (Vehtari et al. 2017).The following sections closely follow Vehtari et al. (2015) and Vehtari et al. (2017).
Generally, the PSIS approximation works by instead of refitting the model N times, the model is fit to the whole data set and then the posterior samples are reweighted via importance sampling to estimate the effect of leaving each data point out.The model is only refit when the PSIS approximation is deemed poor.The procedure allows elpd i,M to be estimated with significantly less refits of the model and makes computing elpd feasible.
In general, importance sampling is a Monte Carlo method that allows the expectation of some hard to sample distribution (f ) to be computed, Since f is hard to sample from, we wish to use samples from a different distribution (g) that is easy to sample from, or has samples readily available, to evaluate Equation A1.Providing the ratio r of f and g is known up to some normalization for the samples from g ({ θs g } S s=1 ), Equation A1 can be estimated with importance sampling as,  4) against PSIS approximated elpd terms (Equation A6) for the synthetic data scenario described in Section 5.The exact terms we calculated by refitting the model N (46) times with each data point left out in turn.One-to-one correspondence is shown as a dashed line.The PSIS approximated elpd terms are in good agreement with the exact terms.Middle: Pareto k values for the PSIS approximation for each data point and for each model considered in Section 5. Reference model is shown in blue, while models without H2O, Na, K, or clouds are shown in orange, green, red, and purple, respectively.Data point number is ordered by increasing wavelength.Right: As middle panel but for the real data in Section 6 and the "minimal" model used as reference (green points), the model without H − absorption (orange points), and the model without H2O absorption (purple points).For the synthetic and real observations, all Pareto k values are < 0.7 indicating that the PSIS approximation is reliable for all left out data points.This means all elpd terms were computed with PSIS and only one model fit had to be performed.
Specifically, we use the atmospheric model explained in Section 4.2 considering 3 chemical absorbers, an isothermal atmosphere, an the possibility of inhomogeneous clouds and hazes.First we calculate the elpd score exactly by performing independent retrievals in which each individual spectroscopic data point is removed in turn and remains 'unseen' (i.e., N = 46 refits of the model or retrievals).The resulting exact elpd i terms are evaluated for each data point using Equation 4.Then, we calculate the same elpd i terms using the PSIS approximation (Equation A6) The resulting approximated PSIS and exact elpd i are shown Figure A1, along with the Pareto k values from the PSIS approximation.
Figure A1 shows the approximated PSIS elpd scores and exact elpd scores lie along a unity line and are in agreement with one another.The middle panel of Figure A1 shows the Pareto k values for the approximation, with all of them below the empirical value of < 0.7.As a result, the PSIS approximation only required one refit of the model compared to the 46 refits required to compute all the elpd terms exactly, and still provided accurate results.Overall, the PSIS approximation was accurate and saved ≈ 4, 000 CPU hours of computation in this case, i.e., the PSIS approximation took ≈ 1/N = 1/46 of the computation of the exact calculation.For completeness, we include in the right panel of Figure A1 the Pareto k values for the approximations employed in Section 6.

B. AUXILIARY RETRIEVAL INFORMATION
Table B1 shows the priors used for the retrievals in Sections 5 and 6. Figure B2 shows the retrieved transmission spectrum of HAT-P-41b on the HST UVIS/G280 systematic marginalization, HST WFC3 G141, and Spitzer IRAC observations from Wakeford et al. (2020) and Lewis et al. (2020).The data points are color coded by their ∆elpd score between the reference "minimal" model with H 2 O absorption and the model without H 2 O absorption.A larger positive value (redder point) indicates that the point is better explained by the presence of H 2 O absorption (i.e., better out of sample predictive performance by the reference model) while a larger negative value (bluer point) indicates that the datum is better explained by the model without H 2 O absorption (i.e., worse out of sample predictive performance by the reference model).As expected, the points with the largest ∆elpd score lie within the HST WFC3 observations (e.g., ∼ 1.1-1.7µm)where H 2 O has a prominent absorption feature.

Figure 1 .
Figure1.Reference model fit and simulated data as described in Section 5.Each of the data points are colored by their PSIS approximated elpd score (elpd i,Reference ).Data points with a lighter (more yellow) color have a higher elpd score and are better predicted by the model when left out.
g., Benneke & Seager 2013; MacDonald & Madhusudhan 2017; Welbanks & Madhusudhan 2021a) results in a model preference for H 2 O of 27σ, Na of 6σ, K of 2σ, while the inclusion of inhomogeneous clouds in the model is not preferred at a 2σ level.This comparison of the Bayesian evidence supports the known input of the model generating the data: the presence of H 2 O, Na, and K. Additionally, the Bayesian evidence comparison indicates that the additional model complexity associated with including the presence of inhomogeneous clouds and hazes is not supported by the data.

Figure 2 .
Figure 2. Comparison between the reference model (purple) and simpler models (green) with shaded regions showing their retrieved 2σ confidence intervals.The synthetic observations are color coded by their ∆elpd score following Equation6, between the reference model and models without H2O (top left), Na (top right), K (bottom left), and inhomogeneous clouds and hazes (bottom right).Data points with larger positive ∆elpd scores (redder points) indicate that the reference model is better at explaining them.H2O absorption preferentially explains data points at ∼ 1.1-1.7µm,Na absorption at ∼ 0.6µm, and K absorption at ∼ 0.8µm, in agreement with expectations.The increase in the predictive performance of the model (the numerical scale of the color map) is largest due to H2O, followed by Na, and K; the inclusion of inhomogeneous clouds and hazes in the model does not improve the predictive performance of the model for this input cloud free atmosphere.

Figure 3 .
Figure3.Predictive model performance quantified by the sum of the ∆elpd scores (Equation6) over the entire dataset divided by their standard error (SE, Equation8) see Table1.The ∆elpd scores are computed by the difference in elpd score between the reference model and models without H2O (blue), Na (orange), K (green), and inhomogeneous clouds and hazes (red).Positive values indicate an increase in the predictive performance of the reference model due to the addition of the specific model component.On the other hand, negative values indicate a decreased predictive performance of the model due to the inclusion of the specific model component.

Figure 4 .
Figure 4. Retrieved transmission spectrum of HAT-P-41b for the reference "minimal" model and HST UVIS G280 observations with systematic marginalization treatment, HST WFC3 G141, and Spitzer IRAC 3.6µm and 4.5µm observations from Wakeford et al. (2020) and Lewis et al. (2020).Data points are color coded by the ∆elpd score between the reference "minimal" model and the model without H − absorption.Redder data points (larger positive ∆elpd score) are better explained by the reference model with H − absorption while the bluer points (larger negative ∆elpd score) are better explained by the model without H − absorption.Orange arrow indicates the data point with the highest ∆elpd score (3.6µm), at a wavelength were bound-free H − absorption is not expected.The retrievals indicate a 2.7σ detection of H − , but removing the data point with the highest ∆elpd score (i.e., Spitzer point at 3.6µm) removes this preference.

Figure 5 .
Figure5.∆elpd scores (i.e., predictive model performance) divided by their standard error (SE) for the inclusion of H2O (blue) and H − (orange) absorption for the reference "minimal" model and HST WFC3/UVIS (systematic marginalization), HST WFC3 G141, and Spitzer IRAC observations.Adding H2O absorption into the reference model results in a larger increase in the predictive performance of the model than including H − absorption.

Figure A1 .
Figure A1.Left: Exact elpd terms (Equation4) against PSIS approximated elpd terms (EquationA6) for the synthetic data scenario described in Section 5.The exact terms we calculated by refitting the model N (46) times with each data point left out in turn.One-to-one correspondence is shown as a dashed line.The PSIS approximated elpd terms are in good agreement with the exact terms.Middle: Pareto k values for the PSIS approximation for each data point and for each model considered in Section 5. Reference model is shown in blue, while models without H2O, Na, K, or clouds are shown in orange, green, red, and purple, respectively.Data point number is ordered by increasing wavelength.Right: As middle panel but for the real data in Section 6 and the "minimal" model used as reference (green points), the model without H − absorption (orange points), and the model without H2O absorption (purple points).For the synthetic and real observations, all Pareto k values are < 0.7 indicating that the PSIS approximation is reliable for all left out data points.This means all elpd terms were computed with PSIS and only one model fit had to be performed.

Figure B2 .
Figure B2.As Figure 4 but the ∆elpd score between the reference "minimal" model and the model without H2O absorption.Redder data points are preferentially explained by H2O absorption.The data points with a larger ∆elpd score (redder data points) indicate a preference for H2O absorption, and fall at wavelengths near the ∼ 1.4µm H2O feature as expected.

Table 1 .
Summary Statistics of Retrieval on Synthetic Observations