Correcting Exoplanet Transmission Spectra for Stellar Activity with an Optimised Retrieval Framework

The chromatic contamination that arises from photospheric heterogeneities e.g. spots and faculae on the host star presents a significant noise source for exoplanet transmission spectra. If this contamination is not corrected for, it can introduce substantial bias in our analysis of the planetary atmosphere. We utilise two stellar models of differing complexity, StARPA and ASteRA, to explore the biases introduced by stellar contamination in retrieval under differing degrees of stellar activity. We use the retrieval framework TauREx3 and a grid of 27 synthetic, spot-contaminated transmission spectra to investigate potential biases and to determine how complex our stellar models must be in order to accurately extract the planetary parameters from transmission spectra. The input observation is generated using the more complex model (StARPA), in which the spot latitude is an additional, fixable parameter. This observation is then fed into a combined stellar-planetary retrieval which contains a simplified stellar model (ASteRA). Our results confirm that the inclusion of stellar activity parameters in retrieval minimises bias under all activity regimes considered. ASteRA performs very well under low to moderate activity conditions, retrieving the planetary parameters with a high degree of accuracy. For the most active cases, characterised by larger, higher temperature contrast spots, some minor residual bias remains due to ASteRA neglecting the interplay between the spot and the limb darkening effect. As a result of this, we find larger errors in retrieved planetary parameters for central spots (0 degrees) and those found close to the limb (60 degrees) than those at intermediate latitudes (30 degrees).


INTRODUCTION
With over 5000 confirmed exoplanet detections, and this number rapidly increasing with the contribution of groundbased surveys e.g.HARPS (Mayor et al. 2003) and space-based missions such as TESS (Ricker et al. 2014), we are entering a period of unprecedented potential for exoplanet characterisation.Present exoplanet observations and analyses are laying the foundations for the large-scale population studies that will be conducted with next generation space observatories such as JWST (Bean et al. 2018) and, in less than a decade, Ariel (Tinetti et al. 2021).Planets from multiple, distinct regions of the known parameter space have already been analysed in detail with transmission (e.g.Charbonneau et al. 2002;Tinetti et al. 2007;Sing et al. 2016;Tsiaras et al. 2018;Hoeijmakers et al. 2019;Pinhas et al. 2019;Tsiaras et al. 2019;Anisman et al. 2020;Pluriel et al. 2020;Skaf et al. 2020;Edwards et al. 2022;Gressier et al. 2022;Rustamkulov et al. 2022;Saba et al. 2022), emission (e.g.Swain et al. 2008;Crouzet et al. 2014;Evans et al. 2017;Mikal-Evans et al. 2020;Changeat et al. 2022) and phase curve spectroscopy (e.g.Knutson et al. 2012;Stevenson et al. 2014;Arcangeli et al. 2019;Feng et al. 2020;Irwin et al. 2020;von Essen et al. 2020;Changeat 2022;Dang et al. 2022).All of these techniques come with the caveat that the planet and star are observed as an unresolved source, with the host star providing the light source that makes these methods possible.As such, disentangling the stellar signals from the planetary ones is a challenging, but essential part of any exoplanet characterisation pipeline.The objective of this paper is to outline a simple, scalable method of disentangling these signals that can be implemented seamlessly in retrievals.The primary aim is to accurately retrieve the planetary parameters in the presence of stellar activity.We investigate what biases the simplified model assumptions could produce as a result of missing physics and how limiting these biases could be.As a secondary aim, we also explore what useful information about the host star can be extracted from a combined stellar-planetary retrieval.
Our current understanding of exoplanet host stars indicates that a substantial fraction of them will display moderate to high levels of activity.As such, stellar contamination will likely be one of the most dominant noise sources in exoplanetary observations.Evidence for this is shown in the form of activity indicators e.g.Ca H-K lines, S-index etc. (Gomes da Silva et al. 2011;Cauley et al. 2018;Klein et al. 2022), variability amplitudes from long-term photometric monitoring e.g. with Kepler (Ciardi et al. 2011;McQuillan et al. 2014) and the many spot and plage-crossing events that have been detected in light-curves of transiting exoplanets (e.g.Pont et al. 2013;Oshagh et al. 2014;Morris et al. 2017;Espinoza et al. 2019).With higher resolution observations rapidly becoming available to us it will be essential to account for this activity, ideally in a homogeneous way, in order to characterise exoplanet atmospheres with the high precision we are aiming for and subsequently to conduct larger, comparative population studies.Observations covering larger wavelength ranges will also be highly beneficial as they allow for an easier identification and subsequent correction for stellar contamination.
Some trends in activity have already been well documented in the literature, with higher levels of activity typically observed for later type K-M dwarfs (Goulding et al. 2012;Jackson & Jeffries 2013;McQuillan et al. 2014) as stars transition towards becoming fully convective (Reiners & Basri 2010) and also in the case of fast rotating, young stars or young stellar objects (YSOs) (e.g.Gully-Santiago et al. 2017;Järvinen et al. 2018;Morris 2020;Klein et al. 2022).These scenarios are particularly important as much of the pioneering exoplanetary science is now focusing on planets orbiting these stars.These smaller, later-type stars are frequently observed to host small planets (Dressing & Charbonneau 2015), which are crucial for pushing the limits of our current characterisation techniques and, ultimately for answering questions surrounding potential habitability.Contamination due to stellar activity has the potential to negatively affect observations of all types of planets, albeit in different ways.For larger planets with H-dominated, primary atmospheres, the bias introduced by not accounting for contamination could potentially result in inaccurate retrieved atmospheric parameters such as chemical abundances and temperature-pressure profiles e.g.(Saba et al. 2022).These inaccuracies may subsequently be propagated into discussions surrounding other key research areas e.g.elemental ratios ( Öberg et al. 2011;Pacetti et al. 2022) and disequilibrium chemistry (Venot et al. 2020) leading to an incorrect interpretation of the planet as a whole.For small planets, particularly those possessing secondary atmospheres, atmospheric absorption features will generally tend to be weaker, making them much more susceptible to being obscured or altered by stellar contamination (e.g.Ballerini et al. 2012;Rackham et al. 2018;Zhang et al. 2018).In recent years the number of observations of exoplanets orbiting YSOs / young main sequence stars e.g.AU Mic (Szabó et al. 2021;Klein et al. 2022), V1298 Tau (Feinstein et al. 2022;Maggio et al. 2022) and K2-33b (Thao et al. 2023), has also increased as, despite being strongly affected by stellar activity, studying these systems provides us with a unique opportunity to begin to fill in gaps in our understanding of the stages of planetary formation and evolution (Raymond & Morbidelli 2022).
In transmission spectroscopy an active star is capable of contaminating the observed spectrum in multiple ways by causing the transit chord to differ from the disk-integrated stellar spectrum.The most efficient methods of modelling and correcting for this contamination have been and are still being explored extensively (e.g.Czesla et al. 2009;Garcia et al. 2010;Silva-Valio et al. 2010;Sing et al. 2011;Ballerini et al. 2012;Oshagh et al. 2014;Micela 2015;Herrero et al. 2016;Newton et al. 2016;Rackham et al. 2018;Bixel et al. 2019;Cracchiolo et al. 2021a,b).At the spectral intervals observed for exoplanet characterisation, typically the optical and near infrared (NIR) regimes, we are observing the stellar photosphere.On the photosphere, magnetic activity manifests in the form of heterogeneities such as cooler spots and hotter faculae (Solanki 2003;Berdyugina 2005).If present, these features are the dominant sources of stellar contamination in this wavelength range.The strength of this contamination and its overall effect on the observed spectrum differs depending on what type of active regions are present and whether or not they are occulted by the transiting planet.Occulted active regions are arguably easier to identify due to the characteristic bumps they introduce in the light-curves.These bumps, whilst not always easily corrected for, can often be masked/removed from the light-curve.Unocculted features are more insidious as they affect the transit depth of the entire light-curve, without imparting any obvious identifying feature on it.In addition to this, unocculted features could potentially be far more common, as the planet only occults a small fraction of the stellar disk as it transits.
Unocculted spots, the focus of this study, reduce the average flux that originates from the regions of the star not crossed by the planet.Their presence introduces a wavelength dependent signal that deepens the transit light-curve, resulting in overestimates of the planetary radius, particularly in the optical regime.Stellar contamination is highly chromatic, with its effects appearing substantially stronger and more evident at shorter wavelengths (e.g.Ballerini et al. 2012;Rackham et al. 2018).The main concern when contamination is present but not corrected for is that it may introduce biases in the retrieved planetary parameters.It is capable of both obscuring absorption features in the case of unocculted faculae, or mimicking/strengthening them in the case of unocculted spots.An additional complication arises in that stellar contamination is temporally variable.Modulation occurs predominantly on the timescale of stellar rotation but also through spot evolution and longer timescale magnetic cycles (Ciardi et al. 2011;Bradshaw & Hartigan 2014;McQuillan et al. 2014;Zhang et al. 2018).This means that each observation of each exoplanetary system should ideally be corrected individually before they can accurately be combined or analysed simultaneously.This limitation poses a problem for small planets in particular, as we will need to stack observations from multiple visits to obtain a high enough SNR for a successful retrieval analysis.
For the above reasons, stellar activity is one of the most pressing challenges to the accurate characterisation of exoplanetary atmospheres at present.Modelling the star as a more complex astrophysical body, rather than as a homogeneous light source, is essential in tackling this issue.Despite this, increases in model complexity are not always beneficial and care needs to be taken when introducing additional parameters in retrievals as this can have detrimental effects.Using models with a higher dimensionality than is necessary may increase the risk of overfitting the data or injecting a bias through the model choice.Alternatively, if many of the parameters are degenerate, the retrieval may not be able to converge on a solution at all.More complex models will also intrinsically come with an associated computational cost.The aim of this paper is to find the middle ground by determining a stellar model that encompasses all of the essential physics of stellar activity required to remove any potential biases, but without overcomplicating the model to the detriment of retrieval reliability or computing time.To achieve this we utilise two approaches to modelling stellar activity in the form of a single, unocculted spot.The predominant difference between them is the consideration of the limb darkening effect, or the lack thereof.StARPA, the more complex of the two models, encompasses the interplay of the presence of spots and the limb darkening effect by fixing the spot position on the stellar disk.The StARPA model is then used as the forward model in a benchmarking retrieval exercise which uses ASteRA, the simpler of the two models, in which this interplay between spot contamination and limb darkening is neglected and all parameters describing the spot and the planetary atmosphere are fitted simultaneously.Stellar activity has been considered and fit for in several previous retrieval studies (Pinhas et al. 2018;Bixel et al. 2019;Espinoza et al. 2019;Edwards et al. 2021).In order to obtain accurate planetary parameters and maximise the information content from our retrievals we need to have a good understanding of what the potential biases from stellar activity are.These biases can originate both from neglecting stellar contamination or could be introduced or incompletely removed by our chosen correction process.Retrieval biases due to not correcting for stellar activity have previously been systematically explored in Iyer & Line (2020).This work presents the first investigation into the potential residual biases that are left behind as a result of neglecting the limb darkening effect in the correction process.A similar investigation was conducted in Cracchiolo et al. (2021b), albeit on a smaller scale.In Section 2.2 we introduce our grid of 27 spot-contaminated stellar disks from which we generate the contaminated transmission spectra.This grid is produced using the StARPA model where the activity is characterised by a single unocculted starspot.This spot is parameterised by its temperature contrast with respect to the quiet photosphere, its radius and the latitude of the spot centre.The StARPA model is described in detail in Section 2.3 followed by a description of the simplified ASteRA model that is used in retrievals in Section 2.4.We conduct retrievals on the grid of contaminated spectra to investigate under which conditions ASteRA is capable of accurately retrieving the planetary, and to a lesser extent the stellar parameters, in the presence of varying degrees of stellar contamination.The results of these retrievals are given in Section 3. Detailed analysis of these results alongside initial investigations into more complex, realistic cases involving multiple spots and spots that are separated into an umbra and penumbra are given in Section 4. In this section we also discuss when using a simplified stellar model, such as ASteRA, is valid at first order and under what activity conditions the assumptions within it begin to break down.Our final, concluding remarks are given in Section 5.

METHODOLOGY
This section introduces the overall experiment framework (Section 2.2) and the two modelling approaches utilised throughout the work (Sections 2.3 and 2.4) to explore the effect of model complexity on the accuracy of the retrieved planetary and, to a lesser extent, stellar parameters.The main objective of this work is to develop a stellar activity correction method that is both computationally fast and capable of retrieving planetary parameters accurately for host stars displaying different levels of activity.It is of course desirable to also be able to accurately retrieve the stellar parameters, however, we prioritise the planetary parameters as these are fundamental for the large-scale characterisation of exoplanet populations intended to be carried out with ongoing and future missions e.g.JWST and Ariel.We aim to determine under which activity conditions using a simplified model is sufficient, and, if/where it is insufficient, investigate any potential biases it may introduce.
These questions surrounding model complexity are essential questions to answer as, as our underlying physical models become more realistic, their dimensionality increases drastically, which will likely require an increase in computation time and, in the worst scenario can be detrimental to retrieval accuracy.On the contrary, by using an oversimplified model we risk neglecting underlying physical processes that are necessary in order to fully and accurately interpret our observations.As such, we are increasingly facing a compromise between complexity and the computing cost such models require, although state of the art machine learning methods could potentially mitigate this (e.g.Yip et al. 2022).
To explore the ability of the ASteRA model to accurately retrieve the planetary parameters, we generate a grid of 27 spot-contaminated transmission spectra as forward models.These are produced using the StARPA model (Section 2.3) and are then used as input observations in retrievals where the stellar contamination is accounted for in a combined stellar-planetary retrieval, using TauREx3 and the ASteRA plugin described in Section 2.4.We also introduce a case with an uncontaminated, quiet star, termed 'Case 0', which acts as a baseline for the subsequent retrieval analysis.Each contaminated stellar disk in our grid is characterised by a single, unocculted spot which is itself parameterised by its temperature contrast (∆T Spot ) i.e. how much cooler than the quiescent photosphere the spot is, the spot radius (R Spot ) normalised to the stellar radius and the latitude of the spot centre (ϕ Spot ) from which we can probe the effect of limb darkening.These 27 cases are discussed in greater detail in Section 2.2.

Simulating the Uncontaminated Planetary Transmission Spectrum for a Synthetic Star-Planet System
To determine how extreme the contamination effects are for each spot case considered and explore how well these effects can be mitigated we must first produce a synthetic star-planet system, for which the stellar and planetary parameters are known a priori.We consider a synthetic K-dwarf host star characterised by the following parameters (T ef f = 4750 K, R ⋆ = 0.8 R ⊙ , M ⋆ = 0.8 M ⊙ ) and displaying activity in the form of a single unocculted spot that we model using the methodology described in Section 2.3.This is the same methodology outlined in Cracchiolo et al. (2021b) with some subsequent, minor improvements to its efficiency.A K-dwarf was chosen as stars of later spectral types are typically more likely to be active (Berdyugina 2005;Ciardi et al. 2011;Hartman et al. 2011;McQuillan et al. 2014;Rackham et al. 2018Rackham et al. , 2019)).We decided against using an M-dwarf host for this preliminary study as this is the region of the main sequence in which stars transition to a fully convective regime (Reiners & Basri 2010), as such, stellar activity may manifest differently on these stars.We choose to focus on single spot cases for this initial benchmarking study for several reasons.Firstly, this represents the simplest spot-contaminated disk model that can be considered which makes it invaluable as a baseline for future investigations.Encompassed within this is that it reduces the dimensionality required for the input model.Having multiple spots present requires the location of each of them on the stellar disk to be defined in order to accurately compute the interplay between their properties and the limb darkening effect.Secondly, a single, larger spot also allows us to probe the extremes of this interplay in a way that multiple spots will not as, for a given filling factor, the active region is concentrated at a single latitude rather than dispersed over multiple locations on the stellar disk.Despite this, the extension to multiple spot cases is comparatively straightforward as described in Cracchiolo et al. (2021b).The results of several preliminary multiple spot cases are presented in Section 4.5 for completeness.
The transiting planet is a temperate sub-Neptune (R P = 3 R ⊕ (0.273 R Jup ), M P = 5 M ⊕ (0.0157 M Jup ) and T P = 400 K) with a primary atmosphere containing water (log(H 2 O) = -3) and H and He present as fill gases with a ratio of 0.172.Rayleigh scattering and collision induced absorption (CIA) are also included and introduce wavelength dependent contributions to the opacity.A smaller planet with a primary atmosphere is used as its scale height should result in detectable absorption features but the smaller signal to noise ratio (SNR) of such features means that they are more susceptible to being masked by stellar contamination.As such, the accuracy of our correction method, and any biases that may be inadvertently introduced, are comparatively far more important here than when considering an atmosphere with a much higher SNR, for example that of a hot Jupiter.The orbital inclination is set to 88 • so that the effects of a central unocculted spot (ϕ spot = 0 • ) can also be explored.
Using the stellar and planetary parameters described above and the retrieval code TauREx3 (Al-Refaie et al. 2021, 2022), we produce a forward model that is equivalent to the idealised, uncontaminated transmission spectrum that would be observed in the presence of a completely homogeneous host star.This synthetic spectrum has a wavelength coverage of 0.5 -9.5 µm and a resolution of 200.This wavelength range has been chosen as it is similar to the regions that are/will be covered by the JWST and Ariel instruments but with a greater coverage and resolution in the optical regime where the effects of stellar contamination are strongest.Error bars of 10 ppm are introduced for all data points regardless of wavelength.This noise level was chosen as 10 ppm is broadly consistent with the most precise observations obtained with the Hubble Space Telescope (HST) instruments(e.g.Edwards et al. 2022), albeit for larger planets further into the IR.We acknowledge that obtaining this level of precision in reality would be extremely challenging and heavily reliant on an accurate characterisation of the host star to mitigate the astrophysical noise.We rationalise our choice of using small error bars as our goal at this stage is to be limited by the model assumptions/limitations and not the instrument performance, which will be a focus of future work.Testing our correction methods on an idealised case first allows us to quantify how effective they are and ensure that they do not introduce any unknown, intrinsic bias before using them with real observations.

The Experiment Framework: 27 Spot-Contaminated Cases
Throughout this study, we consider 27 spot-contaminated cases (and an uncontaminated case as a reference frame) for the star-planet system outlined in the previous section.From this grid we investigate the potential biases that an unocculted spot may introduce and how these may vary as a function of the three spot parameters considered (∆T Spot , R Spot and ϕ Spot ).The contaminated stellar disks are produced with StARPA (Section 2.3) and the following values are considered for each spot parameter: This results in 27 unique parameter combinations in which only one spot parameter is varied at a time.The full set of spot parameters used in each case are given in Table 1.Visual representations of each case are shown in Fig 1 .Table 1.An outline of the 27 single spot scenarios considered in this study.Each case is characterised by a unique combination of three spot parameters: the temperature contrast with respect to the photosphere (∆TSpot), the spot radius normalised to the stellar radius (RSpot) and the latitude of the spot centre (ϕSpot).The spot filling factor (FSpot) is calculated assuming an elliptical projection onto the surface when the spot centre is at non-zero latitudes.An uncontaminated case is also considered to act as a control case (Case 0).

The Input Stellar Model: StARPA (Stellar Activity Removal for Planetary Atmospheres)
From the uncontaminated planetary transit spectrum produced with TauREx, the grid of 27 spot-contaminated spectra described above is constructed using the Cracchiolo et al. (2021b) model with some subsequent improvements made to the implementation.The key point of this methodology with respect to this study is that, through accounting for the limb darkening effect and defining the exact position of the spot on the stellar disk, any interactions between contamination and limb darkening can be explored.As in Cracchiolo et al. (2021b), the spots are modelled as being circular but with an elliptical 2D projection on the visible stellar disk at non zero latitudes.The out-of-transit stellar flux from the spot-contaminated star is computed using quadrature integration; the star is divided into 1000 equally spaced annuli and the fractions of each of these annuli covered by the spot are calculated.This enables us to compute both the absolute and normalised intensity profiles (Fig. 2).
To create the contaminated stellar disks we model the contribution of the quiet photosphere and the spot using the BT-Settl (Allard et al. 2012;Baraffe et al. 2015) stellar spectral model grids from the PHOENIX library (Husser et al. 2013).The spectral emission densities (SEDs) corresponding to the photosphere and the spot are governed by three fundamental stellar parameters: the stellar effective temperature (T ef f ), the stellar metallicity [M/H] and the stellar surface gravity (log g).For the purposes of this study the metallicity and gravity are fixed at [M/H]=0 (solar metallicity) and log g=4.5 respectively, in order to isolate the effects of active regions with contrasting temperatures.These are reasonable assumptions for low resolution spectroscopy but may need to be reconsidered at higher resolution.In particular, the treatment of log g may need improvement, as it has been suggested that spots may be characterised by a lower log g than that of the photosphere due to the localised increase in magnetic pressure (e.g.Solanki 2003).In total we require four SEDs, one corresponding to the photospheric temperature (T P hot = 4750 K) and three corresponding to the spot temperatures considered in this work: 3750 K, 4250 K and 4500 K, equivalent to ∆T spot of 1000 K, 500 K and 250 K respectively.
Limb darkening is a well-known phenomenon acting to reduce the flux originating from the limbs of the stellar disk with respect to its centre (Claret 2000;Howarth 2011).It also varies as a function of wavelength and as such, different bands are characterised by different intensity profiles with the strongest effects seen in the optical.To account for the limb darkening effect within our forward model we use the ExoTETHyS package, specifically the SAIL and BOATS subpackages (Morello et al. 2020a(Morello et al. ,b, 2021)), to calculate the limb darkening coefficients (LDCs) for the star using the PHOENIX 2012 13 database (Claret et al. 2012(Claret et al. , 2013) ) of BT-Settl models and the Claret four-coefficient law (Claret 2000): where λ is the wavelength/bandpass being considered, µ=cosθ (where θ is the angle between the line of sight and the normal at the stellar surface), I λ (µ) is the stellar intensity profile, I λ (1) is the intensity at the disk centre (i.e.where µ=1) and α n,λ are the LDCs.
We model the spot using the same limb darkening coefficients that have been calculated for the star which is a reasonable approximation at first order, but again, may be revised for higher resolution observations.The absolute fluxes of the star and the spot are also calculated using ExoTETHyS from which the normalised intensity profiles of the spotted star can be calculated as a function of wavelength (Fig. 2).The emission from the spot-contaminated stellar disk is calculated as in Eq. 2, where S Star,λ , S P hot,λ and S Spot,λ are the spectra of the average star, the quiescent photosphere and the spot respectively for a given wavelength (λ) and F Spot is the spot filling factor.As such the resulting spectrum for the active star is essentially a combination of the photosphere and spot SEDs weighted by their relative covering fractions (Fig. 3).The limb darkening effect, which has already been defined for the 1000 annuli considered using Eq. 1, is also accounted for in this stage.
The resulting chromatic contamination can be described as acting as a contamination factor (ε) relative to the nominal transit depth i.e. the uncontaminated spectrum that would be observed in the case of an inactive star (Eq.3).Where contamination is present only in the form of lower temperature spots, as in this study, ε λ > 1 resulting in increased transit depths at all wavelengths.Figure 3. Three SEDs that are relevant to the construction of the forward stellar model.The two solid line SEDs correspond to the two temperature components that make up the surface of the active, heterogeneous stars considered in this work, the quiet photosphere, T P hot =4750 K (yellow) and the cooler starspot, TSpot=3750 K (red).The orange, dashed line SED represents the average, disk integrated SED that would be observed in accordance with Eq. 2 due to the weighted contributions of the quiet photosphere and the spot, assuming a spot filling factor of FSpot = 0.15 (and therefore a F P hot = 0.85).We highlight that the spot filling factor is varied throughout the spot-contaminated, cases considered in this work (Table 1), as such the disk averaged SEDs will vary between cases.
From the constructed, spot-contaminated stellar disks, we then use the open source, light-curve analysis package pylightcurve (Tsiaras et al. 2016) to produce the spot-contaminated light-curves (Fig. 4) for all 200 wavelength intervals.These light-curves are then used to construct the contaminated transmission spectrum for each spot case.We highlight that for this preliminary investigation, as we are only aiming to construct the contaminated transmission spectra for use as inputs in the retrievals, we make the assumption that the orbital parameters used in pylightcurve are well constrained/known a priori and we do not introduce any Gaussian noise at the light-curve fitting stage to avoid introducing any additional bias that may arise arbitrarily.As in Section 2.1, error bars of 10 ppm are assumed for the input spectra.A comparison of the uncontaminated transmission spectrum and the spectrum with the highest degree of stellar contamination considered in this study is shown in Fig. 5.The strongly wavelength-dependent nature of the spot contamination is evident, with it imparting a steep, positive bluewards slope on the spectrum at wavelengths shortwards of ∼ 2 µm.Longwards of 2 µm, the contamination acts to introduce an almost constant positive offset equivalent to an overestimate in the transit depth on the order of 5% (> 50 ppm) for the worst case scenario.).The main benefit of using TauREx for this study is that its modular design allows for a large amount of freedom and flexibility in testing and introducing new models.Mitigating for potential contamination due to stellar activity has been conducted in previous studies in the form of combined stellar-planetary retrievals (e.g.Pinhas et al. 2018;Bixel et al. 2019;Espinoza et al. 2019) and hopefully this consideration of the host star will progressively become the new norm within the exoplanetary retrieval community.Whilst this is not the first study focused on retrieval biases as a function of the degree of stellar heterogeneity, parameterised by temperature contrast and spatial extent (Iyer & Line 2020), this is the first time that that bias has been explored as a function of three spot parameters, with the inclusion of their position on the stellar disk which governs how they interact with the limb darkening effect.
The ASteRA plugin introduces a new, heterogeneous star class to TauREx which follows a formalism similar to that used in previous stellar activity studies (e.g.Rackham et al. 2018Rackham et al. , 2019)).Instead of modelling the host star as being homogeneous and characterised by a single temperature/SED, the active star is modelled as a combination of multiple distinct temperature components, in this case two: the quiet photosphere and the spot.The spot and photosphere SEDs are homogeneous disk-integrated models as they are not spatially resolved.The observed disk-integrated stellar spectrum is then produced by combining the two SEDs, weighted by the filling factor of each component, as shown in Fig. 3 and described in Eq. 2 Using the heterogeneous star model requires the addition of two extra fitting parameters to a 'normal' retrieval; the spot temperature T Spot and its filling factor F Spot .For the purposes of this study we restrict the active regions considered to spots only, however, ASteRA is easily extended to incorporate faculae, as has been done in the analysis of HST STIS observations of the hot Jupiter WASP-17b (Saba et al. 2022).
The core difference between the stellar models of StARPA and ASteRA is the treatment of limb darkening, or lack thereof.In contrast to the forward model produced with StARPA, the ASteRA stellar model is simpler in that the interplay between the spot and the limb darkening effect is neglected, making it similar to the initial model used in Cracchiolo et al. (2021a).Conducting retrievals without the inclusion of limb darkening helps us to gauge how important its inclusion is for low resolution spectroscopy, particularly with respect to active host stars.With limb darkening neglected, the relative position of the spot on the stellar disk becomes unimportant provided that the entire spot is unocculted, reducing the dimensionality of the model by one.ASteRA requires the fundamental stellar parameters, i.e. effective temperature, metallicity and log g as inputs in order to select the correct, corresponding PHOENIX spectra and these are fixed within the retrieval, as are the orbital parameters.This is done with the assumption that for real observations, these parameters will be reasonably accurately and homogeneously constrained a priori.Indeed, this is a high priority and ongoing effort within the Ariel consortium regarding the stellar parameters (Danielski et al. 2022;Magrini et al. 2022) and the ExoClock project for the orbital parameters (Kokori et al. 2022).
For each spot case, two separate retrievals were conducted; one accounting for activity by fitting for T Spot and F Spot and one where the activity is not accounted for, despite being present.The planetary parameters R P , T P and log(H 2 O) are fit for in all retrievals and the prior bounds set for each parameter are given in Table 2.All retrievals conducted with TauREx use MultiNest (Feroz et al. 2009;Buchner et al. 2014) as the sampler with 500 live points and an evidence tolerance of 0.5.The retrievals with and without activity have a dimensionality of five and three respectively.Conducting two retrievals allows for a comparison of the Bayesian evidence to obtain the Bayes factor which quantifies how strongly the model with activity is favoured over the one without.This provides an additional, statistically motivated way of quantifying how strong the spot contamination effects are.The retrieved values are then compared to the input/ground truth values for each parameter.From this we determine how accurately the stellar and planetary parameters can be retrieved using ASteRA and if any bias is introduced as a result.The results of these retrievals are given in Section 3. -12 ; -1 log10 3. RESULTS

Retrievals Accounting for the Spot Contamination
In this section we present the results of the retrievals for the 27 spot cases outlined in Section 2.2.To account for the spot-contamination in retrieval, the spot parameters T Spot and F Spot are fitted for using the ASteRA plugin.In contrast, in Section 3.2, retrievals are conducted on the same contaminated spectra but with the incorrect assumption that the star is homogeneous.The same retrievals are also conducted on the uncontaminated transmission spectrum which we term 'Case 0'.Case 0 acts as both a frame of reference for the highest precision and accuracy obtainable with TauREx for the simulated spectra used in this study, and as a verification that the use of ASteRA does not introduce any intrinsic bias.The results of these retrievals are given in Table 3 and a visual representation of the retrieval accuracy with respect to the planetary parameters is given in Fig. 7. On inspection of the retrieved planetary parameters alone, the outlook is overall very positive, with the retrieved values falling very close to the ground truth in almost all of the cases considered.Intuitively, it becomes apparent that the largest errors in the planetary parameters are obtained for the cases where the spectra are most strongly contaminated.These cases are characterised by the largest, highest contrast spots considered in this study e.g.Cases 7, 8 and 9. Despite showing the largest errors in this study, we emphasise that these errors are not substantial enough to result in a large misinterpretation of the planet.The retrieved parameters are also substantially more accurate than those obtained when the stellar contamination is neglected in the retrieval, the results of which are presented in the following section (Section 3.2).One other thing that becomes particularly evident here is that ASteRA struggles to constrain the spot filling factor for small spots at high latitudes (Cases 3, 12 and 21) and for small, low contrast spots (Cases 19, 20 and 21).For these scenarios the retrieved F Spot is highly degenerate as the comparatively low levels of contamination can be reproduced by a larger number of spot configurations.

Retrievals When the Spot Contamination is Neglected
The retrieved parameters obtained for the 27 contaminated spectra when the spot parameters are not fit for are given in Table 4.In comparison to the retrievals presented in Section 3.1, the errors in the retrieved planetary parameters are far more significant, particularly for the highest activity cases.For the most contaminated case (Case 7), the decision to account for stellar activity, even with the simplified method used by ASteRA, can be the difference in retrieving an approximately solar level water abundance (log(H 2 O)=-3.22±0.05) or an incorrect subsolar water abundance (log(H 2 O)=-5.32±0.06)an underestimation equivalent to > 2 orders of magnitude.The planetary temperature (T P ) is also significantly underestimated with the retrieved temperature of 275 K being 125 K cooler than the ground truth (400 K), which, in the context of a temperate planet represents a very substantial error (> 30%).
Conducting two retrievals on the same contaminated spectrum allows us to determine which model is preferred by the data through the comparison of the Bayesian evidences.The Bayes factors (lnB) show a strong preference for the ASteRA retrieval for the majority of cases considered here (Fig 6).The cases in which only a weak preference for the corrected model is seen, or a preference for the retrieval in which contamination is neglected is indicated, correspond to those characterised by small, high latitude and low temperature contrast spots.As the contamination resulting from such spots is minimal, the model neglecting contamination is still able to perform reasonably well under conditions of low activity.In contrast, a penalty is applied to the model accounting for contamination when calculating the Bayesian evidence as a result of its higher dimensionality, explaining why the activity model is not conclusively preferred in the low activity regime despite a spot being present.For these low activity cases the planetary parameters are still accurately retrieved even when the presence of the spot is neglected entirely (Fig. 7).The Bayes factor shows the strongest preference for the retrieval without the activity correction for the uncontaminated case as one would expect.
Table 3. Retrieved planetary and spot parameters for each of the 27 spot cases investigated.The ground truth (GT) planetary parameters are given for comparison.'Case0' shows the parameters obtained when a retrieval using ASteRA is conducted on the uncontaminated spectrum as a frame of reference and to verify that ASteRA introduces no intrinsic bias.This reaffirms that no intrinsic bias is introduced by ASteRA and that ASteRA does not find evidence for stellar activity where there is none.In order to adequately understand and interpret the retrieval results we must first consider how each of the three spot parameters will contribute to the contamination factor and therefore the observed spectrum.The size and the temperature contrast of the spot have very intuitive impacts on the contamination spectrum.The contamination factor increases as a function of both as expected.In comparison the effect of the spots latitude/position on the stellar disk is more subtle.It also affects only the shortest wavelengths with the largest variations seen at λ ≤ 1µm.The importance of the spot latitude in this study is twofold.Firstly, the projected filling factor (defined as the percentage of the 2D stellar disk that is covered by the spot) decreases as a function of latitude for a spot of a constant radius due to decreases in its 3D geometric projection onto the surface.This reduces the contamination introduced from the spatial coverage of the spot alone.Secondly, the position of the spot dictates how it will interact with the limb darkening effect as shown by the normalised intensity profiles (Fig. 2) in Section 2.3.We subsequently term this interaction as the limb darkening-spot interplay.The limb darkening-spot interplay is an important consideration as a central spot will remove a greater amount of flux compared to a spot (with an identical filling factor) located at the limb as the intensity maximum occurs at the disk centre.

Accuracy of the Retrieved Planetary Parameters
In this section, we focus on how the use of the simplified spot model within ASteRA affects the retrieved planetary parameters (Fig. 7) and how these errors are observed to vary as a function of the spot parameters.As already stated in Section 3.1 the largest errors are seen for the most extreme activity cases.Fig. 7 allows for the visual comparison of the values retrieved when the spot is fit for (black data points) as opposed to when it is not (red data points).This reiterates the large improvement in accuracy obtained through using even a simple activity correction method over no correction at all.In the worst case scenario not correcting for stellar activity leads to an underestimation of the water mixing ratio by over two orders of magnitude.Such a large error would significantly impact our understanding of the planet as a whole.Another concerning result highlighted by Fig. 7 is that, whilst the bias introduced by not correcting for stellar activity strongly affects the retrieval accuracy, it does not affect the retrieval precision, as evidenced by the Black markers indicate that the Bayes factor is in favour of the ASteRA retrieval, where stellar activity has been accounted and corrected for.Red markers indicate a preference for the lower dimensionality retrieval in which the spot parameters are not fit for.The Jeffrey's Scale (Trotta 2008) is overplotted to show the strength of the model preference where a Bayes factor ≥ 1 is indicative of weak preference, ≥ 2.5 of moderate preference and ≥ 5 a strong preference.A Bayes factor below 1 is deemed to be inconclusive.small error bars.As such this high precision could be very misleading for real observations where a priori knowledge of the correct planetary parameters are not known.This is consistent with previous retrieval-oriented works e.g.Iyer & Line (2020) and reaffirms that caution should be taken when analysing retrieval results where stellar contamination has not been included, even if the host star is thought to be less active.
The parameter that appears to be most influential in the highest activity cases is the spot latitude which is acting as a proxy for limb darkening (Fig. 8).It is intuitive that this spot parameter should have the greatest effect on the residual bias as ASteRA has no way of fitting for the spots position.A decreasing trend as a function of increasing ϕ Spot is observed in the retrieved planetary temperature, with this being overestimated for the lower latitude spot cases (Cases 7 and 8) and subsequently underestimated at the highest latitude considered (Case 9).This underestimation can be attributed to the interplay of a high latitude spot and limb darkening.The reduced flux originating from the quiet photosphere at the limbs acts to reduce the observed spot contrast and thus there is also a reduction in the degree of contamination introduced.In contrast to this, the opposite trend is observed in the retrieved H 2 O mixing ratio with this being underestimated at the two lower latitudes and overestimated at the highest.Similar trends in T and log(H 2 O) are observed for the other cases considering large (0.4 R * ) spots (Cases 16, 17 and 18 and Cases 25, 26 and 27 respectively), however, the magnitude of the bias introduced is weaker due to the lower spot contrasts considered.
Figure 7.The retrieved planetary temperature (TP ) (left) and H2O mixing ratio (log(H2O)) (right) obtained for each spot case when the spot parameters are fit for (black data points) vs when activity is not accounted for (red data points).The plot background colours correspond to the temperature contrasts of the spot considered for each case.The ground truth for each parameter is indicated by the dashed black line.
Figure 8.The error introduced in the retrieved planetary temperature TP (left) and atmospheric H2O mixing ratio log(H2O) (right) as a function of spot latitude.Only the cases with the largest spots (0.4 R * ) are plotted as these are the cases where some minor residual contamination remains after the correction.The effect of the spot temperature contrast on the observed trend is evident with the strongest trend seen for the highest contrast spots.There is a clear anticorrelation between TP and log(H2O).

Accuracy of the Retrieved Stellar Parameters
ASteRA is less successful in recovering the spot parameters (Fig. 9).This is likely due to a combination of not accounting for limb darkening and also because many combinations of the three spot parameters are degenerate at low resolution, particularly if the spot in question has a low to moderate temperature contrast.An example of how different spot parameter combinations can result in degenerate solutions is given in Fig. 10 for clarity.The largest errors and uncertainties in the retrieved T Spot are seen for the smallest spots considered, especially when present at high latitudes.For larger spots, T Spot is generally reasonably well constrained due to the larger contamination effects they introduce.Large errors and uncertainties are also seen in the retrieved F Spot values in several cases.Large error bars point to substantial degeneracy in the small spot cases, whereas, in the case of the largest spots F Spot is often constrained to a higher precision but significantly overestimated.The results of these retrievals indicate that we should be more cautious with retrieved stellar parameters, as the degeneracies between them mean that they are constrained less confidently by the retrieval.Simultaneous, external observations e.g. at different bandpasses could help to break some of these degeneracies.

The Effect of Limb Darkening
In order to attribute the biases seen in the planetary parameters to not accounting for the effect of limb darkening, the worst-case scenarios (Cases 7, 8 and 9) were rerun.For each case the contaminated spectrum was regenerated with the limb darkening coefficients set to zero (Fig. 11) and a further retrieval conducted.When noise is taken into account the effect of including vs. excluding the limb darkening effect is really only distinguishable at the shortest wavelengths considered (λ < 1 µm), even for these worst-case scenarios.Ariel in particular will only be able to access this wavelength region through three photometric bands (Tinetti et al. 2021), as such, extracting as much information as possible about the activity of the host star from these three data points will be of utmost importance.The retrieved parameters when the limb darkening effect is removed are given in Table 5.With the exception of T Spot , which was already well constrained, all other parameters are retrieved more accurately, including F Spot now being constrained correctly.As the input observations are inherently different due to the removal of the limb darkening effect, unfortunately we cannot use the Bayes Factor as a means of model comparison here as was done in previous sections (Fig. 6).However, the fact that both the planetary and stellar parameters are retrieved more accurately when the limb darkening contribution is removed from the input observation provides compelling evidence in itself that the residual bias seen originates from the limb darkening-spot interplay.For real observations ignoring the effects of limb darkening within activity correction frameworks will unfortunately not be a viable solution as this phenomenon will always be present.As such, we believe that as a community there is a need for us to push towards developing and using stellar activity models in which the limb darkening-spot interplay is accounted for, particularly when dealing with highly active host stars.
The posterior distributions retrieved for the two input scenarios, spot-contaminated with the limb darkening contribution and spot-contaminated with the limb darkening contribution removed, are given in Appendix A. The greatest deviation between the two posteriors is seen in the case of a central spot (Fig. 13).This is intuitive as the spot masks Figure 10.An example of how spot parameters can be degenerate at low resolution, particularly in cases of moderate activity.The black line depicts the uncontaminated transmission spectrum.In blue is the contaminated spectrum obtained with a slightly larger (RSpot = 0.2R * ), central (ϕSpot = 0 • ) spot with a lower temperature contrast (∆TSpot = 500 K) which is equivalent to Case 13 in our retrieval grid.In contrast to this the pink contaminated spectrum results from a smaller (RSpot = 0.175R * ), higher latitude (ϕSpot = 30 • ), higher contrast (∆TSpot = 750 K) spot.The two spot-contaminated spectra are undifferentiable beneath the noise at all wavelengths.As such, a retrieval would not be able to confidently differentiate between these two solutions.Inset (top left) -A close up view of the two spot contaminated spectra at 0.5-0.7µmwhere they diverge the most.The differences between them are still easily lost beneath the very optimistic noise estimate considered here (10ppm).Inset (bottom right) -A visual depiction of the two degenerate spots.
Table 5.The retrieved spot and planetary parameters obtained for Cases 7,8 and 9 when the effect of limb darkening is no longer present in the input spectra i.e. the blue spectra from Fig. 11.The ground truth (GT) spot and planetary parameters are given for comparison.FSpot varies on a case by case basis, as such, the Input FSpot denotes the ground truth value for each case.the disk centre where a greater proportion of the stellar flux originates from and as such, results in the largest residual contamination due to not encompassing the limb darkening-spot interplay.As the spot is modelled at progressively higher latitudes, less residual contamination remains and the posteriors for the limb darkened, contaminated spectra (Fig. 14 and Fig. 15) converge towards the correct values.and 9).The uncontaminated spectrum assuming a quiet host star is shown in pink.The green spectrum is the spot-contaminated spectrum produced when the contribution of the limb darkening effect to the overall contamination is considered, the blue spectrum, in contrast, is the contaminated spectrum obtained when limb darkening is not accounted for.The error bars are equivalent to 10 ppm.

Preliminary Investigations into Multiple Spot Cases
In previous sections we have focused only on isolated, single spots in order to assess the interactions of the spot parameters and how they influence the observed, contaminated spectrum.In order to better understand the limb darkening-spot interplay (Section 4.1) and the bias it introduces in retrievals (Section 4.4) we conducted three preliminary multiple spot cases using the same methodology to explore how this interplay may differ when more than one spot is present.For comparability the spots have the same temperature (T Spot = 3750 K) and total filling factor (F Spot = 0.16) as the single spot Case 7. The multiple spot cases we consider are a two spot case and two cases characterised by 10 smaller spots.These 10 spots are arranged in a random configuration on the stellar disk in the first scenario, and occupy a preferred active latitude centred around ϕ = 15 • N in the second scenario.The only strict requirement for these cases is that all of the spots remain unocculted.The resultant contaminated spectra are shown in Fig. 12 alongside the uncontaminated spectrum and the Case 7 contamination spectrum, with and without the contribution of the limb darkening effect, as in Fig. 11.These spectra show that, as the number of spots increases and they are distributed over a larger range of µ values (i.e.located at different distances from the disk centre) the limb darkening-spot interplay and its contribution to the contamination spectrum are minimised.A table of the retrieved parameters (Table 6), alongside the posterior distributions (Figs. 16,17 and 18) are given in Appendix B.
These additional multiple spot experiments highlight that the large, high contrast single spots considered as the main focus of this study likely do represent the worst case scenario when considering the additional complication arising from the limb darkening effect.Observationally, the regimes in which the limb darkening-spot interplay will really start to matter will therefore be those in which the system geometries most closely replicate those in cases 7 and 8. Large high latitude and polar spots have been proposed to exist on several stars (e.g.Järvinen et al. 2018;Almenara et al. 2022;Strassmeier et al. 2023) although any biases that these could introduce will be lessened if the stars rotation axis is not significantly inclined with respect to the line of sight as the spot will appear at the limb.The worst-case scenario would occur for a system where the host star both possesses a polar spot and is inclined relative to the transiting planet and observer.In such a scenario the polar spot could manifest as a large central spot.
Examples such as the HAT-P-11 system which consists of an active K4-dwarf hosting two highly misaligned planets (e.g.Sanchis-Ojeda & Winn 2011;Morris et al. 2017;Yee et al. 2018) and the almost pole-on, solar-type star τ Ceti which is frequently included in target lists for exoplanet searches (Korolik et al. 2023) both indicate that observing a system with such geometry in the future cannot be ruled out.
Our recommendation for dealing with this worst-case, high activity regime, where it will be necessary to include the limb darkening-spot interplay in order to fully negate the bias introduced by stellar contamination, is to improve our retrieval stellar models so that they are capable of also parameterising the position of spots on the stellar disk.This is the focus of ongoing work, however, as with increasing the dimentionality of any model, we cannot ignore the risk of injecting intrinsic bias if the additional complexity is not physically motivated.For real observations there will also be the added challenge of not having any a priori knowledge of spot positions as the stellar disks are not spatially resolved.This work has shown that at first order using the simpler, two parameter spot prescription as is done with ASteRA is sufficient to remove the majority of the bias and enable a good understanding of the planets atmospheric properties.A push towards a better understanding of the limb darkening effect, whether in the presence of stellar activity or not, will also be highly beneficial.This will be especially important for later type stars where offsets due to the choice of limb darkening treatment appear to be inevitable (Patel & Espinoza 2022).

Preliminary Investigation into Spots Displaying a Radial Temperature Variation -A Separation into Umbra and Penumbra
All of the retrievals presented in the previous sections have been conducted assuming a single T Spot .In order to explore the validity of this assumption, we conducted preliminary studies into modelling the contamination introduced by spots with radial temperature variations.This section examines the resulting impact on retrieval performance.We assume the same spot geometry as was used in the highest contamination, worst case scenario (Case 7).The spot is now separated into a cooler central umbra characterised by a temperature T U mbra = 3750 K, and a surrounding penumbra characterised by a temperature T P en = 4250 K.This separation into two regions of distinct temperatures is consistent with observed sunspots and with the 3D MHD models for later type stars produced by Panja et al. (2020).We explore two different cases in which the total spot filling factor (F Spot ) is kept constant but the relative fractions of the umbra (F U mbra ) and penumbra (F P en ) are varied.Table 7 shows the input parameters for these two cases.In doing this, we explore the retrieval response to contamination effects resulting from a spot that is characterised by two different temperatures/SEDs when the retrieval model is only capable of fitting this contamination using a single T Spot value.
As expected, this further discrepancy between the forward model and the retrieval model introduces an additional source of error to the retrieved planetary parameters.The retrieved parameters are given in Table 8 alongside the equivalent single T Spot case (Case 7) for comparison.The posterior distributions for the two spot temperature variation retrievals are given in Appendix C. The retrieved spot parameters give insight into how the retrieval has attempted to correct for the dual temperature spot.In both cases, the retrieval is best able to fit for the contamination with a moderate T Spot close to T P en , and an overestimated F Spot .The overestimate in the spot filling factor accommodates the increased contamination due to the higher contrast umbra.The posteriors show an increasing correlation and degeneracy between T Spot and F Spot compared to the single T Spot cases.The introduction of radial temperature variation slightly enhances the biases in the retrieved planetary parameters that were observed for the highest activity cases.Intuitively, the greater effect is seen when considering a spot with a higher umbra-penumbra ratio, resulting in an overestimate in the retrieved planetary temperature of ∼ 47 K and an underestimate in the water mixing ratio of 0.39 magnitudes.Although this error is non-negligible, importantly it is not limiting.Retrievals conducted in the presence of such contamination would still be useful and informative.

CONCLUSION
The main objective of this paper is to determine how complex our stellar activity models should be in order to remove biases so that we can characterise the transiting planet as accurately and efficiently as possible.At the same time, we want to ensure that our retrieval analysis remains reliable.As such, it is important that we are not introducing unjustified complexity which may inject a bias intrinsically.We make use of a grid of 27 spot-contaminated stellar disks created with the more complex forward stellar model (StARPA), in which the interplay of the spot and limb darkening is accounted for, and conduct retrievals with a simpler model (ASteRA) that neglects this in order to explore under which conditions this additional complexity is necessary.We find that ASteRA performs very well in cases of weak to moderate stellar contamination, constraining the planetary parameters to a high degree of accuracy.Importantly, in all cases, ASteRA performs far better than no correction attempt at all, consistent with the findings of previous studies.The need for an activity correction is especially demonstrated by the retrieved H 2 O mixing ratios which are underestimated by over two orders of magnitude in the worst-case scenario when no correction is applied.Iyer & Line (2020) find that stellar contamination must be accounted for when spot filling factors exceed 1% in order to avoid biasing the retrieved planetary parameters.Our results substantiate this, however, we highlight that the spot temperature contrast is also important to consider alongside the filling factor.The Bayes Factor is still moderately (lnB= 4.12) to strongly (lnB=11.33) in favour of the model containing the activity correction in the cases of a small (0.1R * ), high contrast (∆T Spot =1000K) spot at latitudes of 30 • and 0 • respectively (Cases 1 and 2), despite both of these spots having filling factors ≤ 1%.The spot latitude also contributes but to a lesser degree than the temperature contrast in the small spot regime.Importantly, there is no evidence for a hard boundary or 'on-off switch' for where stellar contamination should be considered.It will therefore be beneficial to apply an activity correction to safeguard against bias even in the lower activity regimes where contamination is less dominant.
Degeneracies between spot parameters at low resolution means that the retrievals tend to be less successful for accurately characterising the host star, particularly in regimes of low to moderate activity.Nevertheless, from a planetary perspective it is an excellent tool for accounting for potential contamination bias in the fundamental planetary properties R P , T P and log(H 2 O).We believe that this is how it should be viewed and utilised by the community moving forward, albeit with the one caveat that it cannot remove any bias that is introduced as a result of the inaccurate characterisation of fundamental stellar parameters.
For the highest activity cases considered here, the bias introduced by stellar contamination cannot be fully corrected for by ASteRA due to the limb darkening-spot interplay being neglected.As such, in these scenarios, a small amount of residual bias remains resulting in a slight loss of accuracy in the retrieved planetary parameters.This subtle loss of accuracy would not be easily identified without a priori knowledge of the activity level of the star.For this reason, it may be beneficial to consider other stellar activity mitigation processes in parallel if these are available e.g.utilising the out-of-transit observations or continuous photometry, in order to better interpret the retrieval results in the context of the host star.We show that for multiple spot cases the bias introduced by the limb darkening-spot interplay is minimised under the assumption that all spots have the same temperature.As such, single large spots present the worst case scenario for real observations.Further bias is also introduced if these large spots have separated into umbral and penumbral regions characterised by different distinct temperatures, making this something we should progressively start to consider as a community.
Although the results of this study are based on idealised spectra, the spot cases investigated in this analysis will provide a good baseline from which to fully explore the impact of stellar activity on both JWST and Ariel observations as our simulations cover a similar wavelength range and spectral resolution to what is/will be obtainable with these observatories.In future work we intend to conduct similar investigations using more realistic models and observations.With this in mind, from a stellar perspective we aim to extend the flexibility of the ASteRA plugin to incorporate the interplay of spots and limb darkening, as this study has shown that there are scenarios where this cannot reasonably be neglected.This will, however, need to be done with caution to avoid unknowingly injecting bias.We also aim to extend our retrievals to investigate other spectral types, in particular M dwarfs which may be more complicated, especially as their spectra can contain molecular lines that could be incorrectly attributed to the exoplanet atmosphere.Other exciting lines of investigation that naturally follow on from this work include exploring more complex manifestations of stellar activity for example; occulted spots and the presence of both spots and faculae.With respect to the exoplanetary atmosphere, we intend to extend this analysis to more complex, realistic atmospheric compositions.A particular emphasis will be put on physical processes responsible for producing features in the optical regime, where the stellar contamination is most pronounced, such as the presence of clouds and hazes and other opacity sources e.g.absorption by the alkali metals Na and K. Finally, we intend to transition from using idealised spectra to simulated instrument observations to explore the effects of more realistic noise and a more restricted wavelength coverage in the optical before eventually using this framework to achieve our ultimate goal of accurately analysing real observations.

Figure 1 .
Figure 1.Visual representations of the 27 spotted star cases investigated in this study.The spot colour corresponds to its temperature contrast with respect to the quiescent photosphere which has a temperature T P hot = 4750 K.

Figure 2 .
Figure2.Intensity profiles, normalised to the flux emitted from the disk centre of a homogeneous star with T ef f =4750 K, for a spot-contaminated star of the same temperature possessing a 0.2R * , ∆TSpot=1000 K spot located at latitudes of 0 • (left), 30 • (centre) and 60 • (right) and viewed at wavelengths of 0.5 µm (pink), 1.5 µm (green) and 9.5 µm (blue) respectively.

Figure 5 .
Figure 5.The effect of an extreme case of stellar contamination introduced by a large, high-contrast, unocculted, central spot (Case 7) on the observed transmission spectrum.The transit depths are overestimated at all wavelengths for the contaminated spectrum (blue) in comparison to the uncontaminated spectrum (pink).Error bars for both spectra are equivalent to 10 ppm.The magnitude of this overestimation increases exponentially as a function of decreasing wavelength with the strongest contamination seen in the optical regime.
Understanding the Interaction Between Spot Contamination and the Limb Darkening Effect

Figure 6 .
Figure 6.A graphical representation of the Bayes factors for the uncontaminated and 27 spot-contaminated cases explored.Black markers indicate that the Bayes factor is in favour of the ASteRA retrieval, where stellar activity has been accounted and corrected for.Red markers indicate a preference for the lower dimensionality retrieval in which the spot parameters are not fit for.The Jeffrey's Scale(Trotta 2008) is overplotted to show the strength of the model preference where a Bayes factor ≥ 1 is indicative of weak preference, ≥ 2.5 of moderate preference and ≥ 5 a strong preference.A Bayes factor below 1 is deemed to be inconclusive.

Figure 9 .
Figure 9.The error in the retrieved spot parameters TSpot and FSpot obtained for the uncontaminated and the 27 spotcontaminated cases.Figure elements are the same as in Fig. 7.Note that FSpot is very poorly constrained for the uncontaminated case which is consistent with the absence of stellar activity.In contrast, TSpot

Figure 11 .
Figure11.Forward model transmission spectra highlighting the effect of accounting for the contribution of limb darkening when considering the spot contamination or not for the most severely contaminated cases (Cases 7, 8 and 9).The uncontaminated spectrum assuming a quiet host star is shown in pink.The green spectrum is the spot-contaminated spectrum produced when the contribution of the limb darkening effect to the overall contamination is considered, the blue spectrum, in contrast, is the contaminated spectrum obtained when limb darkening is not accounted for.The error bars are equivalent to 10 ppm.

Figure 12 .
Figure 12.Left -Visual representations of the single spot Case 7 and the three multiple spot cases investigated: a two spot case, a multiple spot case with a random spot configuration and a multiple spot case where the spots occupy a preferred active latitude centred around ϕ = 15 • .All spots have the same temperature contrast (∆TSpot = 1000K) and their combined filling factors are kept constant at FSpot = 0.16.Right -The resulting contaminated spectra for the two spot (yellow), active latitude (orange) and random configuration (purple) multiple spot cases compared to the uncontaminated spectrum (pink) and the single spot of Case 7 with (green) and without (blue) the contribution of the limb darkening effect.The 10ppm error bars have been removed here for clarity.

Figure 16 .
Figure 16.Retrieved posterior distributions for the two spot case (Section 4.5).The pink lines indicate the ground truth values for each parameter.Inset: a graphical depiction of the stellar disk.

Figure 17 .
Figure 17.Retrieved posterior distributions for the multiple spot case in which the spots have a random configuration (Section 4.5).The pink lines indicate the ground truth values for each parameter.Inset: a graphical depiction of the stellar disk.

Figure 18 .
Figure 18.Retrieved posterior distributions for the multiple spot case in which the spots occupy a preferred active latitude centred around ϕ = 15 • (Section 4.5).The pink lines indicate the ground truth values for each parameter.Inset: a graphical depiction of the stellar disk.

Figure 19 .
Figure 19.Retrieved posterior distributions for Case 1 when considering a spot with radial temperature variations (Section 4.6).The pink lines indicate the ground truth values for each parameter.The orange and purple lines depict the ground truth values for the spot penumbra and umbra respectively.Inset: a graphical depiction of the stellar disk for this case showing the separation of the spot into an umbra and penumbra and their respective parameters.

Figure 20 .
Figure 20.Retrieved posterior distributions for Case 2 when considering a spot with radial temperature variations (Section 4.6).Figure elements are the same as those in Fig. 19.

Table 2 .
Fitting parameters used within the retrievals with ASteRA and TauREx and their prior bounds.

Table 4 .
Retrieved planetary parameters obtained for the same 27 cases as in Table3) but without accounting for the presence of stellar contamination by fitting for the spot parameters.Comparison with the ground truth (GT) shows that in cases of severe activity (e.g.Cases 7, 8 and 9) the planetary parameters retrieved are highly inaccurate. No.

Table 6 .
The retrieved planetary and spot parameters for the multiple spot cases considered in Section 4.5 compared to the ground truth values (GT) and Case 7, a single spot of the same temperature and filling factor.