Detrending Exoplanetary Transit Light Curves with Long Short-term Memory Networks

The precise derivation of transit depths from transit light curves is a key component for measuring exoplanet transit spectra, and henceforth for the study of exoplanet atmospheres. However, it is still deeply affected by various kinds of systematic errors and noise. In this paper we propose a new detrending method by reconstructing the stellar flux baseline during transit time. We train a probabilistic long short-term memory (LSTM) network to predict the next data point of the light curve during the out-of-transit, and use this model to reconstruct a transit-free light curve—i.e., including only the systematics—during the in-transit. By making no assumption about the instrument, and using only the transit ephemeris, this provides a general way to correct the systematics and perform a subsequent transit fit. The name of the proposed model is TLCD-LSTM, standing for transit light-curve detrending-LSTM. Here we present the first results on data from six transit observations of HD 189733b with the IRAC camera on board the Spitzer Space Telescope, and discuss some of its possible further applications.


INTRODUCTION
Since the first exoplanet atmosphere observation twenty years ago (Charbonneau et al. 2000), more than 3000 transiting extrasolar planets have been discovered. Transit spectroscopy -i.e. multi-wavelength transit observations -has opened the way for the characterization of atmospheric content and properties of exoplanets. In effect, this can be done by first reconstructing the transmission or emission spectrum from the transit depth measurements at various wavelengths, and at a typical precision level of just a few parts-per-million (ppm) for hot gaseous planets. This is to be contrasted with the imprints left in the stellar light curve by various instrumental and astrophysical effects which make the measurement of the transit depths extremely challenging. Given the shift of the field towards increasingly smaller planets, the need for efficient detrending methods is thus ever growing. Here we present a long-short term memory (LSTM) neural network approach to effec- mario.morvan.18@ucl.ac.uk, n.nikolaou@ucl.ac.uk angelos.tsiaras.14@ucl.ac.uk, ingo.star@ucl.ac.uk tively model and detrend instrument and astrophysical systematics in transit light curves. The total flux F (t) received by a detector at time t can be broken down as follows: 1. Star flux: F s (t) 2. Planetary signal: δ(t) = (R P (t)/R S ) 2 in the case of primary transit obstruction with no limb darkening, where R S is the stellar radius and R P the apparent planetary radius 3. Background stars and transient events:

Noise and instrumental systematics: G(.)
The total flux received by each pixel of the detector can then be written as F (t) = G (1 − δ(t))F s (t) + F b (t) , where F s and F b may vary depending on the position on the detector and are then subject to instrumental systematics. We will refer to individual pixel time series as pixel light curves, and to the summed contribution of pixels over time as a raw light curve. Essentially, the main instrumental systematics trend observed both with the Hubble WFC3 and the Spitzer IRAC cameras are the so-called ramp effect (Knutson et al. 2007), hypothesized to be due to the charge trapping in the detector (Agol et al. 2010), and intra-pixel and inter-pixel variations which are correlated with the position of the source on the detector which shows variations in quantum efficiency across different pixels 1 .
Footprints of these entangled variability sources can be found in additional instrumental data collected besides the detector raw flux. In particular, the center and scale of the stellar point spread function (PSF) can be processed to give valuable information on the systematics while being mostly uncorrelated with the planetary signal itself.
Considering the analysis of time-correlated light curves with the end goal of detrending transit light curves and extracting the transit parameters as precisely as possible, one can approach the problem in several ways. Indeed, the disentanglement of various independent signals might naturally guide one toward blind source separation techniques, which have been applied on this problem (Waldmann 2012, Morello et al. 2014, Morello et al. 2016 using the pixel light curves as correlated components. In a complementary way, signal processing analysis techniques have also been used to denoise the raw or pixel light curves, with Gaussian processes (Gibson et al. 2012), pixel level decorrelation (Deming et al. 2015) or wavelet analysis (Carter & Winn 2009, Thatte et al. 2010, Morello et al. 2016). Here we choose the angle of interpolation, i.e. we want to provide predictions for the raw light curve during the transit time provided the out-of transit parts of the light curves. The interpolation method we propose is nonlinear and thus capable of capturing complex long term dependencies in the light curve.
The use of artificial neural networks (ANNs) is burgeoning in various fields including Astronomy. In particular, Charnock & Moss (2017) presented one of the first use of recurrent neural networks (RNNs) in astronomy for supernovae classification. Yet, in the subfield of exoplanetary sciences, only a few studies have been using ANNs so far, with namely Hinners et al. (2018) who predicted stellar and planetary parameters from Kepler light curves using RNNs and representation learning, Zingales & Waldmann (2018)  Here we make use of a long short-term memory (LSTM) neural network (Hochreiter & Schmidhuber 1997) to interpolate the flux of a raw light curve during the transit, given additional time-series data coming from the PSF centroid. The LSTM network learns to predict the next value of the light curve at each time step. The predictions of future time steps are then performed in a probabilistic manner using ancestral sampling, i.e. by injecting the current prediction as input to the subsequent prediction and so on. We thus assume that the pre-transit and post-transit information, along with additional data such as centroid time-series, are sufficient to predict the flux that the detector would have received in the absence of a planet transit.
This paper is organised as follows: Section 2 contains background information about neural networks, Section 3 presents the interpolating model and how it can be used for transit light curve fitting, and finally Section 4 is dedicated to an application on Spitzer data.

RECURRENT NETWORKS AND LSTMS
In a typical supervised statistical learning task, the goal is to learn a model h(x) y that maps an input x to an output y 2 given examples of pairs (x, y) in such a way that the expected error of future predictions is minimized.
Feed-forward neural networks or multi-layer perceptrons (MLPs) represent the simplest architecture of deep neural networks. An example of this type of architecture is shown on Figure 1. No feedback connections exist in these models. Every layer consists of a set of neurons and the neurons of the input layer represent each of the original input variables x. The output of each neuron is a scalar value and is used as input for the neurons of the next layer. Each subsequent layer transforms a linear co mbination of the outputs of the neurons of the previous layer using an activation function σ: h l+1 = σ(W l h l +b l ) where W l is a matrix of multiplicative weights, b l the bias vector, h l the vector of units and σ l the activation function, all at layer l. If we interchangeably write h l for the function represented at layer l as well as its output, the full function represented by a feed-forward network can then be written: y = h D (h D−1 (...h 1 (X))) where D is the depth of the network. Note that the non-linearity of at least one of the layers activation functions is key to obtaining a non-linear predictor.  The main characteristic of Recurrent Neural Networks is that they allow for recurrent connections 3 . If we consider an input sequence {x 1 , x 2 ...} of vectors, a recurrent hidden layer will thus process it sequentially, receiving at step t both the input x t as well as other previous hidden state(s) in order to compute the current state h t . A typical example is shown on Figure 2, where the recurrence occurs between the hidden units of the same layer: h t = h t (x t , h t−1 ). Compared to MLPs, RNNs allow us to reduce the number of parameters of the network by sharing weights between time-steps while seeking temporal patterns in the data.
In practice, several more sophisticated recurrent architectures are often more effective than basic RNNs, with most being variants of the long short-term memory (LSTM, Hochreiter & Schmidhuber 1997) architecture whose cell is shown in Figure 3. LSTM networks have proven successful in a large range of applications including unconstrained handwriting recognition (Graves et al. 2009), speech recognition (Graves et al. 2013), machine translation (Sutskever et al. 2014), to cite only a few. An LSTM cell contains four different gates (see Figure  3), allowing the network to either retain or forget information from the past of the input sequence. This enables the relevant long-term time dependencies to be picked up more easily. The main addition in LSTMs  (2016), which replaces a usual hidden unit (i.e. neuron) in a feed-forward neural network. The input, forget and output gating units enable the cell to accumulate or shut off respectively the current input, long-term dependencies and output through a sigmoidal activation function. The square here indicates a delay of one-time-step, and operation symbols in the circles indicate the operation involving the gates' outputs.
compared to the basic RNNs has been to introduce selfloops, which are conditioned on the context and controlled by the gates. Below we state the detailed update formulae for the gates and states composing each LSTM unit: Where t denotes the time step, W ab the matrix of weights relative to the vectors a and b, b a the bias vector relative to a, the Hadamart (i.e. entrywise) product and σ is the activation function, typically a logistic sigmoid or tanh function.
Incidentally, these types of gated RNNs also have the advantage of being easier to train than basic RNNs, by alleviating the well known vanishing or exploding gradient issue 4 (Kolen & Kremer 2001).

TLCD-LSTM
Here we describe the proposed model to interpolate a time-series on a pre-defined prediction range. As the final goal of this paper is to study the transit signal contained in the interpolation range after correction of the systematic errors, we name the method Transit Light Curve Detrending LSTM (TLCD-LSTM).
The model is based on the deep auto-regressive neural network model described in Salinas et al. (2017). It assumes that temporal relations exist in the time-series and learns to predict the next step in the training range of the input time-series. It can also make use of additional data available for prediction contained in the socalled covariate time-series, which is to be distinguished from the main time-series. In general, one can consider both the main and covariate time-series to be multivariate, i.e. to be composed of several time-series each.
TLCD-LSTM is specifically adapted for interpolation within a given range, and therefore differs from Salinas et al. (2017) mainly in that the values it tries to predict are not in the future (i.e. the end of the time-series) but in timesteps somewhere within the time-series.

Model description
Let us denote with {x 1 , x 2 , .., x T } (abbreviated {x t }) the main time-series of length T we ought to interpolate on the prediction range [t 1 ..t 2 ] with t 1 and t 2 integers in [1..T ], and {z 1 , z 2 , .., z T } (abbreviated {z t }) the time-series of covariates, which constitute additional data available for prediction on the whole time range. Finally, let us also denote with {y 1 , y 2 , .., y T } (abbreviated {y t }) the target time-series, identical to the main time-series in the training range but which may differ in the prediction range. In the case of {x t } being a transit light curve, {y t } is the hypothetical light curve without any transit signal.
As sketched in Figure 4, each value of the input timeseries passes through a stack of LSTM layers, the output of which branches into two distinct feed-forward layers outputting two parameters µ t and σ t at each time-step, which are the predicted mean and standard deviation for the distribution of the current value x t , respectively. The details and hyperparameters of the architecture are 4 Neural networks are trained via gradient-based minimization of a loss function. In each iteration of training, each parameter of the model (weight) receives an update proportional to the partial derivative of the loss w.r.t the current weight. Allowing these gradients to grow vanishingly small or too large can cause numerical instabilities, slow down training or stop it prematurely.
presented in Appendix C. The same network is used both for the training and prediction ranges with only the inputs differing in each case. At each timestep t, the network predicts the current value x t from all past timesteps x 1 , .., x t−1 as well as from the current covariate z t . While the actual previous time-series value x t−1 is used as input in the training ranges, in the prediction range the previous prediction µ t−1 is injected as an input instead of it (see Table 1).

Training the model
We assume each value z t is sampled from a normal distribution: The loss function is then computed as the product of individual likelihoods outside the prediction range: Note that the log-loss is only computed in the training ranges. However, the last output of the prediction range is taken as the first input of the second training range, thus providing a way to link together the outputs in the different ranges.

Predicting the time-series
There are several ways one can generate predictions in [t 1 ..t 2 ], once a model is trained. Since the inputs of the network consist of parameters of a probability distribution, the simplest one is to directly take the vector of predicted means y t = µ t . However, one can also generate a trace by drawing every value from the Gaussian distribution at every timestep in the prediction range: y t ∼ N (µ t , σ 2 t ), and injecting each of these predictions as input for the next time step. Multiple traces obtained with this process then represent the joint predicted distribution (of which they are samples) in a more general way than merely using the means vector. To generate a single vector of predictions from multiple traces, one can -for instance -select the median or mean value at every timestep to construct the median trace or mean trace on the prediction range. In Section 4, we focus on the simplest approach, i.e. selecting the output means and standard deviations.

Covariant Features
The covariates time-series {z t } can consist of singledimensional or multi-dimensional data available both in the training and prediction ranges. It is used by the network as additional information besides the target time-series. This works merely by concatenating x t (conversely x t in prediction mode) to the covariate data z t to construct the new input to the network at every timestep. Ideally, one wants {z t } to be correlated with the target time-series. Several time-series might be related to the time-correlated noise we intend to correct, and therefore can be used as covariate data in the model. In the application presented in Section 4 we suggest the use of PSF-related time-series, namely the instrument's point spread function (PSF) centers and widths of a 2D Gaussian fit on the images at every time step. One could also think of other potentially relevant information such as simultaneous host star activity, calibration data relative to the detector and estimations of background flux. For ground-based applications, information about airmass, seeing and weather patterns could be included.

Application to transit light curves
Here we discuss the use of the interpolating model specifically to transit light curves.
The transit signal must be contained within the prediction range. This requires either to know beforehand when the transit occurs, or to adapt the prediction range during the first phase of the training. Pre-transit and post-transit data are used for training the network, and are assumed to not be contaminated by any transit event. They can however contain any sort of variability coming from the star, the background or the instrument. In fact, the model aims at picking up variations due to all sources other than a transit event in order to predict the flux due to these sources alone during the transit time.
We perform a transit fit at each evaluation step even though our model does not strictly require it for the training. This is done for two main reasons: 1) The transit fit can be used as a proxy to evaluate the quality of the prediction and provide us with a criterion for early-stopping the training of our model. The transit fit is performed on the detrended light curve normalized with respect to the star (1 − δ t ). For details, see Appendix A. 2) We can use the transit fit to adapt the prediction range [t 1 ..t 2 ] during training so that it matches better the actual transit range of the data. This can be done by extracting the fitted mid-transit time and transit duration to compute the times for the beginning and end of transit.

APPLICATION
We present an application to 6 transit observations of planet HD 189733 b from the Spitzer/IRAC detector at 8 µm, collected in 2007 and 2008 (PI: E. Agol,program: 40238). This hot-Jupiter planet has been extensively studied and makes a good candidate for bench-marking our method. In this wavelength channel, the ramp effect can be heavily pronounced (Agol et al. 2010), while the intra-pixel variations due to pointing jitter are less important than at shorter wavelengths. A few preprocessing steps are applied to the data 5 and detailed in Appendix B. These include outlier removal, raw light curve extraction and normalization, centroids fitting and background light estimation.
In this section, we present predictions on the pretransit range (on intervals not used in the training of the model) as an initial evaluation of the model, and then show results on the real transit ranges on which we derive the detrended light curve and subsequent transit fit.

Testing
As the ground truth, i.e. the predicted stellar and instrumental flux, is not available in the transit range, we chose to first test the interpolating model on the pretransit range instead, where its predictions can be evaluated more directly. In practice, three prediction ranges are selected in the first 250 timesteps of the time-series, where no transit signal is present, and the mean squared error (MSE) metric is used to evaluate and compare various models. An example of prediction is shown on Figure 6, where the prediction is obtained by averaging 50 sampled traces.

Hyperparameter optimisation
We perform a grid search over different types of inputs and hyperparameters. More specifically, we vary the aperture width of the sub-array used for computing the raw light curve between 5 and 7 pixels; we experiment with including and excluding covariate features, namely: 1) excluding covariate features altogether; 2) including centroid time-series, and 3) including centroid and PSF width time-series. Furthermore, we vary the number of layers (between 1 and 4), units per layer (powers of 2 up to 1024 and dropout rate (between 0% and 50% in steps of 1%) 6 values for the LSTM block; and a unidirectional or bidirectional network 7 . We train each different model on the 6 light curves and 3 different prediction ranges, monitoring the average MSE for these 18 predictions and using it as a criterion for early stopping and comparison between the different models.
From these tests we observe the following: • Including the centroids information improves the quality of the prediction by a factor of ∼ 2, and including the PSF widths time-series besides the centroids brings a further increase in MSE.
6 Dropout is a common regularization technique in deep learning consisting in randomly reinitializing a fraction of the neurons of a given layer. The dropout rate refers to this fraction. 7 By 'unidirectional' network we mean one that uses just past timesteps to infer the current one. With 'bidirectional' we mean using timesteps from both past and future to infer the current one.
• Dropping 3% of the recurrent units improves slightly the predictions, especially when the number of parameters of the network increases.
• Using a bidirectional network slightly decreases the quality of predictions.
More information on the hyperparameters and model training used is presented in Appendix C.

Performance
We present in Table 2 the results of the best tested model in the explored grid. As a reference for the performance of the interpolation, we include a baseline model, which is a linear composition of the centroid X/Y timeseries {z X t } and {z Y t }: The model is trained on the training ranges 8 and evaluated in the prediction ranges. The metrics computed for both models include the MSE, the mean-absolute error (MAE) and the mean signalto-noise (SNR) ratio, defined as: where N is the number of observations, and σ noise an estimate of the noise level computed by taking the mean value of the running standard deviation of width 15 over each input light curve. Given its simplicity, this baseline model does a remarkably good job at interpolating {x t }, and this is why it was chosen here as a reference for the MSE. Furthermore, since the TLCD-LSTM also uses the centroid time-series, the increase in performance seen on Table 2 can directly be interpreted as the improvement brought by the LSTM's ability to identify temporal dependencies in the raw light curve.  . Example of interpolations on 3 light curves containing no transit. The prediction range is located inside the vertical dashed lines. The raw light curve is displayed in blue, and the predicted traces in grey and the median prediction in orange.

Prediction on real transit ranges
Using the optimised hyperparameters listed in 4.1 and after training the model for 3000 epochs we extract the output of the network for the whole time ranges, shown in red on Figure 7. Note that the decreasing learning rate used guarantees the convergence of the network towards a stable solution. Visually, the model seems to be able to pick up the trends and variability of each timeseries, while joining smoothly the pre and post transit ranges where the ground truth is known.
The last step is to perform the transit fit on the detrended light curve {1 − δ t } normalized with respect to the stellar flux F s (t). Since the limb darkening effect is minor at 8 µm, we chose a transit model with linear limb  Table 2. Comparison of performance on the 6 light curves for each model. Every value is averaged between prediction and actual value over 3 different ranges of length 60 and starting respectively at timeseteps 80, 100 and 120. The three last lines show the mean performance over all light curves and ranges in terms of MSE, MAE and SNR.
darkening (bound between 0.05 and 0.25), and compute the best fit using a Markov Chain Monte Carlo optimization procedure 9 (Tsiaras et al. 2016). The fitted parameters are R p /R s , the mid-time transit time t c , a linear limb darkening coefficient u, the orbit inclination i and orbital semi-major axis relatively to the stellar radius a/R s . The fitted model, residuals and auto-correlated functions (ACF) are shown in Figure 8 and the fitted parameters are presented in Table 3. The higher variance present in the residuals of the 5th lightcurve is due to a higher noise level in the input data for this light curve. We compare the retrieved transit depths with the results published in Agol et al. (2010) for the same data set and preprocessing steps (Figure 9). Although slightly   Table 3. Fitted physical parameters for each of the 6 transits. smaller, the scatter of the predictions is still present with a standard deviation of 91.7ppm instead of 144ppm. The mean weighted by the standard deviations of the 6 transit depths is also found to be slightly smaller in our case by 94ppm ≈ 4σ.

DISCUSSION AND CONCLUSIONS
We presented a deep learning model suitable for interpolating time-series, and showed how it can be used to predict the variability of stellar light curves for subsequent transit fit. This approach has the advantage of not making any assumption on the types of noise, systematics or transit shape.
The presented method is similar to the Gaussian Process (GP) approach (Gibson et al. 2012, Rasmussen & Williams 2005 in that they both construct highly nonlinear models, avoid explicit physical modelling of the systematics and provide probabilistic predictions. However, they differ in various aspects: 1) The neural network lightcurve interpolation approach we propose does not need any transit model whereas it is included in the kernel of the GP. This makes the TLCD-LSTM approach more generally applicable as it does not depend on a pre-defined kernel function.
2) The GP approach requires fewer parameters to train and provide fully Bayesian predictions compared to our LSTM-based approach. The smaller number of free parameters may make GPs the preferred choice for short time series. However, GPs computation scales more poorly with the number of data points, preventing them to be applicable to datasets of more than ≈ 1000 time steps without binning of the time series. The proposed interpolating LSTM can on the other hand be applied to longer or multiple light curves as commonly found in Kepler and TESS time series allowing for even very long period variability to be captured in the predictive LSTM model. This is because the computational complexity in the case of GPs mainly depends on the number of data points, while in the case of the deep neural networks in the architecture chosen (i.e. the number of layers, number of nodes per layer & type of layers in our case).
While the current implementation still relies on a few preprocessing steps such as computing the raw light curve or centroids fits, it constitutes a first step towards the ultimate goal of developing an end-to-end detrending pipeline where the input would be the raw pixel light curves or focal plane images. Furthermore, while we trained our network on data from six real light curves only, taking advantage of a large number of light curves, real or simulated, would allow developing a more general detrending approach for each instrument. LSTMs allow for efficient transfer learning between data sets and instruments (e.g. Kepler to TESS). This may become important in modelling common systematics such as stellar noise between planet-star systems observed by multiple instruments.
As we have firmly entered the era of 'big data' in planet detection (e.g. Kepler, TESS and ground based surveys) and with upcoming characterisation missions and instruments (e.g. JWST, Ariel, CHEOPS and the ELTs), the opportunities for data detrending and modelling with scalable deep learning methods, capable of processing large numbers of high dimensional data will become increasingly prevalent in the future.

A. TRANSIT FIT
To obtain a light curve normalized with respect to the star, three steps are required: transformation to the original units y t → y t , subtraction of background flux F b (t), and division of the background subtracted raw light curve by the predicted star flux: With the time-series notations where x t andŷ t are the input and mean prediction of the neural network in the original units: Note that during training, we use a simple piecewise-linear transit model with four parameters described in Carter et al. (2008) optimized by least-square fitting and neglect the contribution of the background F b x t ,ŷ t .

B. PRE-PROCESSING
Here we describe the different preprocessing steps applied to the raw subarray data.
Outlier removal -Due to a number of causes, such as remaining cosmic rays or bad pixels, the flux on individual pixels can exhibit great fluctuations within short timescales (≈ 1sec). These abnormal values are identified by computing the absolute difference of the pixels' flux with their corresponding median within a time window of width 5 (2 sec exposure). The values of the median-subtracted time-series greater than 4σ are then replaced by the median values, where σ is the standard deviation of the time-series.
Raw light curve extraction -In order to limit the influence of background light and focus on the brightest pixels of the stellar PSF, 3 × 3, 5 × 5 and 7 × 7 pixel regions are extracted around the brightest pixel. The raw light curve is then obtained by summing all the individual pixel light curves.
Centroid fitting -As mentioned earlier the centroid position time-series are highly correlated with the flux received by the detector. In order to compute the centroids, we perform a two-dimensional Gaussian fit with offset to the data at every timestep, and hence extract four useful time-series, two of which are monitoring the position of the center on the detector and two for the width of the Gaussian. As discussed in Agol et al. (2010), this method provides by far a better estimate of the centroids over other methods such as the flux-weighted ratio extraction.
Background extraction -The background flux contribution to the total flux, although minor, increases with the aperture size used for the light curve extraction. We estimate it here by taking the median flux value of the pixels located in the four corners of each frame, corners delimited by the complement of a circular aperture of radius 16. It accounts for 0.67% to 1.2% in our analysis, and should therefore be taken into account. However, as the background estimation is necessarily approximate, we advocate to still interpolate on the raw light curve directly, and only correct for it before the transit fit.
Normalization -The raw light curve and centroid time-series are all locally standardized, i.e. individually centered around a mean value of zero and rescaled to have their standard deviation equal to one. The preprocessed raw light curves and centroid X/Y positions are shown on Figure 5. Note the diversity of effects among them, showing more or less stochastic noise, ramps or jitter.

C. HYPERPARAMETERS
Training parameters -Training was performed using the ADAM optimizer (Kingma & Ba 2014) with parameter values β 1 = 0.9, β 2 = 0.99, = 108. The learning rate was decreased from 0.01 to 0.0001 using a polynomial decay law with exponent 20. We train the model using a batch size of 6 (all the lightcurves) for faster training.