Identifying Exoplanets with Deep Learning. IV. Removing Stellar Activity Signals from Radial Velocity Measurements Using Neural Networks

Exoplanet detection with precise radial velocity (RV) observations is currently limited by spurious RV signals introduced by stellar activity. We show that machine learning techniques such as linear regression and neural networks can effectively remove the activity signals (due to starspots/faculae) from RV observations. Previous efforts focused on carefully filtering out activity signals in time using modeling techniques like Gaussian Process regression (e.g. Haywood et al. 2014). Instead, we systematically remove activity signals using only changes to the average shape of spectral lines, and no information about when the observations were collected. We trained our machine learning models on both simulated data (generated with the SOAP 2.0 software; Dumusque et al. 2014) and observations of the Sun from the HARPS-N Solar Telescope (Dumusque et al. 2015; Phillips et al. 2016; Collier Cameron et al. 2019). We find that these techniques can predict and remove stellar activity from both simulated data (improving RV scatter from 82 cm/s to 3 cm/s) and from more than 600 real observations taken nearly daily over three years with the HARPS-N Solar Telescope (improving the RV scatter from 1.753 m/s to 1.039 m/s, a factor of ~ 1.7 improvement). In the future, these or similar techniques could remove activity signals from observations of stars outside our solar system and eventually help detect habitable-zone Earth-mass exoplanets around Sun-like stars.


INTRODUCTION
The Radial Velocity (RV) method has seen tremendous improvements since the first detections of exoplanets around Sun-like stars between 1988 and 1995 (Mayor & Queloz 1995;Latham et al. 1989;Campbell et al. 1988). Currently, the primary challenge in measuring Extremely Precise Radial Velocities (EPRVs) is overcoming noise from stellar variability (National Academies of Sciences, Engineering, and Medicine and others 2018; Haywood et al. 2020). The surfaces of Sunlike stars are affected by numerous phenomena from convective granulation to magnetic activity in the form of spots, plages and faculae. Due to the time-evolving and sometimes periodic nature of these features, they have been mistaken for planets on several occasions (e.g. Setiawan et al. 2008;Huélamo et al. 2008;Queloz et al. 2001) and can severely complicate the interpretation of RV measurements. Currently, these forms of stellar variability commonly limit RV measurement precision to 1 m s −1 (Dumusque 2018a;Haywood et al. 2016). To detect the 10 cm s −1 signals induced by Earth-mass exoplanets in the habitable zones of Sun-like stars, our limiting RV precision must improve by an order of magnitude.
The signals that limit RV precision on stars like the Sun are caused by four main physical processes: 1. Solar-type oscillations -produced by pressure waves propagating at the surface, pressure-mode (p-mode) oscillations result in a contraction and expansion of the external envelope of the star on timescales of a few minutes (Kjeldsen & Bedding 1995;Leighton et al. 1962;Ulrich 1970;Butler et al. 2003;Arentoft et al. 2008). These oscillations can produce RV signals ranging from 10 cm s −1 to 1 m s −1 for solar-like stars (Arentoft et al. 2008). The period and amplitude vary depending on the stellar type and evolutionary stage. For our Sun, this RV variation is at the 0.5 m s −1 level at a ∼ 5 minute period (Strassmeier et al. 2018;Cegla 2019).
2. Granulation phenomena -originating from convection in the outer layers of solar-type stars, granulation and supergranulation can induce RV signals at the m s −1 level (Lefebvre et al. 2008;Dumusque et al. 2011a) on timescales of a few minutes up to 48 hours. These granulation phenomena are found throughout the photosphere except in active regions where convection is suppressed by magnetic fields (Brandt & Solanki 1990;Livingston 1982;Dravins et al. 1981).
3. Short-term stellar activity -induced by stellar rotation paired with dark spots and bright faculae on the surface of the sun, short-term stellar activity is caused by two different physical effects.
In the first effect, the presence of strong magnetic fields in active regions suppresses the convection and thereby the convective blueshift effect (Brandt & Solanki 1990;Cavallini et al. 1985;Livingston 1982;Dravins 1982). Relative to the quiet photosphere, the active regions then seem redshifted (Cavallini et al. 1985). As the active regions come in and out of view during the rotation, they produce RV signals of ∼ 0.4 − 1.4 m s −1 for the Sun (Meunier et al. 2010). In the second effect, the temperature difference between these active regions and the quiet photosphere result in flux differences. For example, dark sunspots are ∼ 700 K cooler and thus have much lower flux than the rest of the star (Meunier et al. 2010). In this way, spots break the balance between the blueshifted approaching limb and the redshifted receding limb as they pass across the stellar disk and induce RV variations that can reach 0.4 m s −1 on the Sun at high activity (Saar & Donahue 1997;Meunier et al. 2010). In other cases, like young stars and Mdwarfs, dark spots can dominate the activity signals. Both the suppression of convective blueshift effect and the flux effect produce RV variations on the timescale of the rotation period.
4. Long-term stellar activity -generated by solar-like magnetic activity cycles, long-term stellar activity variations influence RV measurements on the timescale of several years. In solar-type magnetic cycles, the filling factor of active regions increases during high-activity phases. Since the increase in magnetic field in active regions suppresses the convection (and thereby convective blueshift), these areas will be relatively redshifted (positive velocity) as the activity level rises. Thus, the activity level and RVs are positively correlated (Meunier et al. 2010;Lindegren & Dravins 2003). Dumusque et al. (2011b) found that stars other than the Sun can also have these solar-like magnetic cycles and their corresponding long-term RV variations. For our Sun, we observe an 11-year magnetic cycle during which the sunspot number varies from zero to ∼ 150-200 on the visible hemisphere (Hathaway 2015) and large, bright magnetic regions can dominate solar RVs (Milbourne et al. 2019).
On the Sun, all four of these phenomena contribute activity signals with comparable amplitudes. In RV analysis, these signals are often aggregated into a single measurement of the stellar activity. When approximating each source of RV variations as Gaussian noise, the total scatter in an RV observation, σ tot , can be summarized as: σ tot ≈ σ 2 phot + σ 2 ins + σ 2 magn + σ 2 gran + σ 2 p−mode (1) where σ phot is photon noise, σ ins is instrumental noise, σ magn originates from both short and long-term activity, σ gran is scatter from granulation phenomena, and σ p−mode is scatter from p-modes.
To mitigate some of these forms of stellar variability, observing strategies have been developed to average out noise from granulation phenomena and p-mode oscillations. Dumusque et al. (2011a,a) showed that RV signals caused by these two stellar noise sources can be averaged out with longer integration times and a higher frequency of observations throughout the night (or day in the case of observing the Sun). Later, Medina et al. (2018) extended this strategy to evolved stars. For p-mode oscillations specifically, Chaplin et al. (2019) demonstrated that fine-tuning exposure times to stellar parameters (e.g. 5.4 minutes for the Sun) can also efficiently average out p-modes down to ∼ 10 cm s −1 .
Other methods of distinguishing planetary systems from stellar variability include tracing activity indicators such as log R HK (Noyes et al. 1984), the Bisector Inverse Slope Span (Queloz et al. 2001), Hα (Bonfils et al. 2007;Robertson et al. 2014), or using statistical methods (like Gaussian Processes (GPs); Haywood et al. 2014;Jones et al. 2017;Delisle et al. 2018, Moving Average;Tuomi et al. 2013, tomography modelling;Donati et al. 2014), or measuring the RV from stellar lines that are the least affected by stellar activity (Dumusque 2018a;Cretignier et al. 2020) to reduce the impact of stellar activity in RV datasets. Other methods of capturing stellar activity variations include using photometry (the FF' method; Aigrain et al. 2012, a GP framework that extends the FF' method; ; using simultaneous spectroscopy and photometry to disentangle the contribution of spots, plagues, and network regions to the RV signal , and combining RV metrics with solar photometry to predict rotation periods; Kosiarek & Crossfield 2020). More recently, several studies have investigated how individual spectral lines are affected differently by stellar activity (Dumusque 2018b;Cretignier et al. 2021;Wise et al. 2022).
Although methods such as the FF' method and the GP frameworks have been successfully applied to numerous datasets and detected low-amplitude planetary signals, they often require and rely on high cadence and carefully timed observations. It is often difficult to obtain such timely observations given the myriad scheduling constraints involved in running astronomical observatories. Ideally, we would employ a method that successfully addresses short-term stellar activity, but does not require high sampling or timing information that is necessary for GPs and Moving Averages. In this paper, we illustrate that machine learning (ML) algorithms have the potential to resolve both these challenges by identifying changes to the average shape of spectral lines. While this technique does require a substantial set of observations for training, densely sampled observations are not necessary.
Our strategy is to use ML to identify and interpret the subtle changes to stellar spectra that are caused by stellar activity. Previously, Davis et al. (2017) used principal component analysis (PCA) to show that photospheric activity signals are clearly distinct from Keplerian RV shifts in simulated data. Beyond distinguishing the two phenomena, we want to be able to predict the RV signals induced by stellar activity such that we can remove these signals and reveal smaller Keplerian signals that were previously hidden. Some preliminary methods are now emerging to separate stellar activity signals using these spectral changes (Collier Cameron et al. 2021;Cretignier et al. 2022;. In this work, we attack the problem with machine learning and train multiple models to predict and remove stellar activity RV signals from observations of the HARPS-N Solar Telescope (Dumusque et al. 2015;Phillips et al. 2016;Collier Cameron et al. 2019).
Our paper is organized as follows. In Section 2, we describe the simulated data and real observations that serve as training sets, and in Section 3, we describe how we process and prepare the data to be input to our ML models. In Section 4, we describe how the observations were divided into training, (cross-)validation, and test sets. In Section 5 and 6, we describe the ML architectures we used, including several different neural networks and our training procedure. In Section 7, we present our results. In Sections 8 and 9, we discuss the implications of these results and conclude.
2. DATA ML methods require data on which to learn. We trained ML models on two datasets: one set of simulated stellar spectra from the SOAP 2.0 software (Dumusque et al. 2014) and one set of real RV observations of the Sun from the HARPS-N solar telescope (Dumusque et al. 2015;Phillips et al. 2016;Collier Cameron et al. 2019).

SOAP simulations
We first explored the problem of predicting stellar activity variations with a simulated dataset. We produced this dataset using SOAP 2.0, a software package which estimates photometric and RV variations induced by sunspots and faculae (Dumusque et al. 2014;Boisse et al. 2012). SOAP simulates a star by dividing the visible hemisphere into a grid and injecting in each created cell an observed solar cross-correlation function (CCF; obtained by cross-correlating a solar spectrum by a binary mask, as it is done when reducing HARPS-N highresolution spectra). The CCFs are shifted to account for the star's rotation velocity at each point on the star's surface. In certain user-specified locations, SOAP 2.0 modifies the local CCFs to mimic active regions, like dark spots or faculue. Finally, SOAP 2.0 sums the CCFs over the entire visible hemisphere of the star, convolves the resulting CCF with a simulated instrumental line profile, and fits the result with a Gaussian function to derive the RV due to the activity signal.
We modified SOAP 2.0 to produce a large set of simulated observations with varying parameters chosen with a Monte Carlo technique. We generated 20,000 random starspot/faculae configurations and used SOAP 2.0 to produce a simulated CCF and activity RV measurement for each configuration. Table 1 lists the range of values that each of the stellar parameters spans.

HARPS-N Solar Telescope
The HARPS-N Solar dataset consists of 528 days of solar observations (See Section 3.2 for details) from the HARPS-N Solar Telescope spanning three years (July 2015 -July 2018). Commissioned at the Telescopio Nazionale Galileo (TNG), the HARPS-N spectrograph is a vacuum-enclosed cross-dispersed echelle spectrograph that has temperature and pressure stabilization (Cosentino et al. 2012). HARPS-N spans the wavelength range from 383 to 693 nm and has a resolving power of λ/∆λ = 115,000. During the daytime, a custom-built solar telescope connected to HARPS-N continuously observes the Sun with 5-minute integration times designed to mitigate the short-term variability caused by solar 5-minute p-mode oscillations. The solar telescope and control system are further described by Phillips et al. (2016).
The solar data are reduced using the same HARPS-N Data Reduction System (DRS) as used for nighttime stellar observations. By taking calibration exposures at the end of each day of solar observations, we acquire order-by-order information on the locations of the echelle orders and the wavelength calibration scale. The instrumental drift is monitored by taking exposures of light passed through a stabilized Fabry-Perot cavity concurrently with the solar exposures. After applying optimal extraction procedures (Horne 1986;Marsh 1989), the data are calibrated in wavelength such that we can obtain a 1D background subtracted spectrum in each echelle order. Lastly, the data are cross-correlated with a digital mask (Baranne et al. 1996;Pepe et al. 2002) derived from solar observations and corrected for instru-  (2011) [4] From Oshagh et al. (2013) mental drift based on the Fabry-Perot exposures. The resulting CCF is used for our input representation to the ML method. Finally, the DRS extracts the RV of each observation by fitting the CCF with a Gaussian function. A Gaussian function is simple and symmetric. Therefore, it is unable to model the small perturbations to CCF shapes induced by stellar activity. So these RVs include both center of mass RV shifts and stellar activity signals.
We note that the full three-year dataset of HARPS-N used in our paper was recently released to the public, as described by Dumusque et al. (2021). The data products released by Dumusque et al. (2021) 2 were reduced with a new extraction pipeline originally built for the newly commissioned ESPRESSO instrument. This new ESPRESSO pipeline analysis resulted in more precise RV measurements than the original HARPS-N DRS re-2 https://dace.unige.ch/sun/? ductions (Collier Cameron et al. 2019) and were thus used in this analysis.

PREPARING THE INPUT REPRESENTATIONS
For our ML models, we cannot use the data directly from the telescope or simulations. Instead, we have to pre-process these data products into a uniform format that makes capturing the features in the data easier for ML models. We outline the steps we took to prepare the input representation for the ML models both for the simulated (Section 3.1) and HARPS-N Solar Telescope Data (Section 3.2).
We design the input representations to pose the problem to our ML models of predicting the activity signal, not the actual center-of-mass velocity of the star. Essentially, we want the ML model to predict the difference between a Gaussian fit to the CCF and the true velocity shift. With these predictions in hand, we can easily subtract them from the input RVs to produce a corrected RV time series. Intuitively, RV signals due to planets cause translational shifts on the CCF, but do not result in shape changes of the CCF. In contrast, stellar activity does not result in translational changes and only causes shape changes. Thus, we want the measured RV of our simulated CCF to be shifted to 0 so that the ML methods become primarily sensitive to detecting these shape changes, not translations.

SOAP Input Representation
Given the CCFs generated by SOAP 2.0 and the measured RV signals (due to the simulated active regions on the star), we apply the following pre-processing steps before sending the CCFs (without any timing information) into our ML models: 1. First we take the simulated CCF from SOAP 2.0, and shift it by the velocity measured by the SOAP's Gaussian fit to the CCF. We do this by creating an x axis that is shifted by −∆RV , where ∆RV is the stellar activity shift measured by SOAP and interpolating the CCF values from the x axis to the original x axis by using scipy.interpolate.interp1d() to perform a cubic interpolation. We tested multiple interpo-lation methods (linear, nearest, cubic) and found that the cubic method was optimal. We also confirmed that any systematics introduced by interpolation were smaller than the changes due to stellar activity. (Note: In the presence of keplerian shifts this procedure would need to be modified as described in Section 8.4). Shifting the CCF to the velocity measured by the SOAP Gaussian fit purposefully leaves a small translational shift between the true stellar radial velocity and 0; this shift is exactly what we wish to train the model to predict.
2. We then calculate a differential ∆CCF by subtracting a template CCF generated by SOAP with no active regions (also shifted as described in the previous step). We note that choosing a template CCF with no active regions or another random template CCF does not affect the overall analysis results, but this particular choice of ∆CCF highlights the changes to the shape of the CCF introduced by the active regions.
3. Lastly, we normalize the inputs to the ML methods. In particular, since each input is an array comprising the ∆CCF, we calculate the median and standard deviation of each point in the CCF Figure 2.
Comparison of Residual CCF (∆CCFs) and SDO Observations: Column 1 (A1, B1, C1, D1) are SDO/HMI intensitygrams; Column 2 (A2,...,D2) are SDO/HMI magnetograms; Column 3 (A3,...,D3) are CCFs from HARPS-N Solar Telescope observations; Column 4 (A4,...,D4) are Residual CCFs from HARPS-N where the corresponding RVs are indicated by their color (red = redshifted, blue = blueshifted). Each of the rows are from a different day of observations and subsequently display distinct surface inhomogeneities that correspond to the residual CCF line shape (A4, B4, C4, D4). over the entire simulated dataset, and normalize by subtracting the median and dividing by the standard deviation. This helps the optimization process by making the scale of variations of each input parameter roughly equal.

Each input into the ML methods is only this nor-
malized ∆CCF without any timing information. The neural network is then trained to predict the stellar activity signals only based on shape differences between normalized ∆CCFs from different observations.

HARPS-N Input Representation
Our pre-processing for the HARPS-N data is nearly identical to our pre-processing for the SOAP data, with only two additional steps. Our procedure is as follows: 1. Create a daily average of CCFs. During the day, HARPS-N takes repeated 5 minute-long exposures of the Sun. We average all of these exposures together to obtain a daily averaged CCF and RV measurement. To do this, we follow Collier Cameron et al. (2021). In short, we perform a signal-to-noise weighted average of the CCFs and RVs. We exclude individual observations where the probability of being cloud-free (as calculated by Collier Cameron et al. 2019) is greater than 99% and where the expected differential extinction 3 correction is less than 10 cm s −1 .
2. Remove signals from Solar System planets. The raw RVs measured by the DRS consist of both the radial motion induced by the solar system planets and stellar activity signals. The planetary signal is dominated by a sinusoidal signal with a semiamplitude of 12 m s −1 and a period of ∼ 13 months, which is the synodic period of Jupiter observed from Earth. To remove the planetary signals, we transform both the RVs and the CCFs from the barycentric to the heliocentric reference frame using the JPL Horizons ephermeris (Giorgini et al. 1996). For the RVs, we simply subtract the Sun's barycentric motion in the direction of the TNG to derive the heliocentric RV. For the CCFs, we perform the shift with the same method as we used to shift the SOAP CCFs, by creating a shifted x axis and interpolating back onto the x axis. The resulting velocities and CCFs contain only stellar activity shifts (and instrumental systematics), in analogy to the simulated CCFs produced by SOAP 2.0.
4. We calculate the differential ∆CCF by subtracting a reference HARPS-N observation taken when the Sun had few magnetic features on its visible hemisphere, as determined by visual inspection of images from the Solar Dynamics Observatory (SDO) Helioseismic and Magnetic Imager (HMI). The observation we used as a quiet reference is from March 29, 2016 (See Figure 1). Although choosing a random other template CCF yields the same overall results, choosing a template with few magnetic features on its visible hemisphere allows us to visualize CCF shape changes as a function of activity more clearly. This step is analogous to Step 2 from our SOAP preprocessing in Section 3.1.

5.
We normalize the input features in exactly the same way as described in Step 3 from our SOAP preprocessing in Section 3.1.
6. Each input into the ML methods is only this normalized ∆CCF without any timing information. The neural network is then trained to predict the stellar activity signals only based on shape differences between normalized ∆CCFs from different observations.

Visualizing the inputs
The result of these processing steps is a residual ∆CCF for each observation in our dataset. Here, we hope to give an intuitive understanding of what these residual ∆CCFs represent and how they convey information about stellar activity. In Figure 1, we illustrate how we subtract a quiet observation (panels B and D) from the observation of interest (panels A and C) to calculate the residual CCF ((A-B) and (C-D)). We note that the residual ∆CCFs shown here have not yet been scaled by subtracting the median and dividing by the standard deviation. Figure 2 shows several example ∆CCFs taken . HARPS-N ∆CCFs before normalization -Computed residual CCFs (∆CCFs) by subtracting the mean CCF, highlighting differences in features between CCFs. For training the model, ∆CCF is the input and the RV from stellar activity is the output. The RV is indicated by its color (red = redshifted, blue = blueshifted). An animated version of this figure is available online 4 has a duration of 53 seconds. In this animated version, the CCF residuals are layered on top of each other in the order with which they were observed. At the end of the animation, we reach the same final frame as is displayed in this static version of the Figure. 4 https://github.com/zdebeurs/exoplanet-ml/tree/master/exoplanet-ml/rv net on different dates, illustrating how different activity patterns change the ∆CCFs.
In Figures 3 and 4 (animated version available online 4 ), we illustrate the differences in shape of the ∆CCFs for different RVs induced by sunspots and faculae over our entire dataset. Each ∆CCF is color-coded to show the measured RV induced by stellar activity. Clear patterns emerge in these plots where similar ∆CCF shapes tend to have similar measured RVs. These are the patterns our ML methods will use to predict stellar activity signals from the ∆CCFs which we can use to correct stellar activity.

CREATING TRAINING, VALIDATION, AND TEST SETS
In ML, datasets are commonly randomly separated into a training, validation, and testing set. The model is initially fit on the training set, a set of examples used to fit the parameters of the model. Next, the validation set provides a measure of predictive accuracy and model fit. The validation set consists of examples that the model has not seen in the training set and allows for optimization of the architecture and hyperparameters. Lastly, after the model architecture and hyperparameters are finalized, the test set is used as one last objective test of the model accuracy and fit.
In our work, we divided our two datasets (from SOAP 2.0 and the HARPS-N Solar Telescope) into separate groups for training, validation, and testing. Since the simulated SOAP 2.0 training set is sufficiently large, we divided the dataset into training (80% of the data), validation (10%), and testing sets (10%).
However, our smaller dataset from the HARPS-N Solar Telescope required a different approach. Instead, we created a cross-validation set (80% of the dataset), a validation set (10%; which was trained on the full crossvalidation set), and a testing set (10%). We then use k-fold cross-validation to provide as many tests with the available data to optimize the architecture and hyperparameters. In k-fold cross-validation, the cross-validation dataset is divided into k subsets. For each round of validation, one of the k subsets is treated as a holdout sample and the model is trained on the other k-1 subsets. In this way, k-fold cross-validation significantly decreases bias (i.e. overestimate of model performance) as we are using the majority of the data for fitting. The exact choice of k represents a trade-off between bias and variance (i.e. performance changes significantly based on data chosen to train the model). A higher value of k may decrease the variance but can also increase the bias. We divided our cross-validation set into 10 folds as k=10 has been shown empirically to yield test error estimates that minimize both bias and variance (James et al. 2013). We optimized the architectures by assessing the performance on both the k-fold cross-validation set and the held-out 10% validation set (which we trained on the full cross-validation set). To estimate how well the final model generalizes, we evaluated our best models' performances on the test set (10 % of the dataset), which consists only of examples that were not used in the cross-validation or validation.

NEURAL NETWORK ARCHITECTURES
To learn to predict stellar activity RV signals from differences in shape of the ∆CCF, we trained three different ML architectures: a linear regression model, a fully connected neural network, and a convolutional neural network.

Linear Architecture
The most basic model we trained is a linear regression model, which is equivalent to a zero hidden layer neural network. As illustrated in Figure 5a, the linear architecture takes a vector x ∈ R n as an input where R n is a real coordinate space of dimension n (where n = dimension of the input data). The input vector x represents the rescaled CCF residuals. After taking the vector x, the linear architecture predicts the value of a scalar y as the output, which is the predicted stellar activity RVs. Then the predicted value of y will bê where w ∈ R n are the weights that determine how each feature affects the prediction, and b is a bias term. w and b are the trainable parameters of the model.
Although a convenient choice due to its simplicity, a linear architecture makes the strong assumption that a linear relationship exists between the points in the CCF and the RV activity signal. For simulated ideal data this assumption may not pose a problem, but for more complex real data this assumption can break down (See Section 7). Figure 5b shows a fully connected neural network (FC NN; also sometimes referred to as a multilayer perceptron or feed-forward neural network). Each layer consists of scalar-valued units called neurons where the outputs from one layer of neurons are the inputs for the next layer. The function that produces outputs based on the inputs is called an activation function. This activation function φ produces a new representation of w x + b through a nonlinear transformation; its output φ(w x + b) can be thought of as a set of features describing x.

Fully Connected Architecture
The values of the first and last layers comprise the inputs and outputs of the network. However, the values (activations) of the intermediate layers are not directly observed and are therefore referred to as hidden layers. The hidden and output layer activations are defined by a n = φ(W n a n−1 + b n ) ( 3) where n is the layer number, a n is a i n -long vector of activations in layer n, W n is an i n × i n−1 matrix of learned weights, b n is a i n -long vector of learned bias parameters, and φ specifies an activation function that computes the hidden layer values.
In FC NNs, the most common activation function is the rectified linear unit (ReLu; Jarrett et al. 2009;Nair & Hinton 2010;Glorot et al. 2011), defined by the element-wise activation φ(x) = max{0, x}. The element-wise activation φ(x) applies a nonlinear transformation where values of x < 0 are mapped to zero and others remain equal to x. In this way, rectified linear units are nearly linear and preserve many of the properties that make linear models easy to optimize with gradient-based methods (Goodfellow et al. 2016). In our neural network layers, we used ReLu as our activation functions.

Convolutional Architecture
FC NNs use matrix multiplication where the matrix has a separate parameter for the interaction between each input unit and every output unit. Every input interacts with every output, causing FC NNs to be "agnostic" to spatial structure present in the data. For example, they treat adjacent data points exactly the same as data points that are far apart, making it inefficient to learn local features (e.g. edges and shapes) that may appear in different locations. In contrast, convolutional neural networks (CNNs) have only local (sparse) interactions, which force them to learn local features across the entire input and exploit the spatial structure ( Figure  5c). Rather than learning local features for every single input-output interaction, they only have to learn these features once. This reduces the number of parameters that the model needs to learn and decreases the number of computational operations required to predict the output.
The 1D convolutional layers depicted in Figure 5c require an input stack of K vectors a where * is the discrete cross-correlation function (often referred to as a "convolution" in the ML literature), w (k,l) n is a m n -long vector of learned parameters called the convolution kernel or filter, b n a i n -long vector of learned bias parameters, and φ specifies the activation function. By making the kernel size small (m n ∼ 3 − 7), the feature map becomes sensitive to local features along its input.
Intuitively, we would expect the CNN to perform best because we are trying to identify shapes and relationships between adjacent points in our ∆CCF to produce the final stellar activity output.
Commonly, CNNs use both convolutional layers and pooling layers. Pooling layers typically replace the output of the neural net at a certain location with a summary metric (e.g., maximum) of the nearby outputs. Pooling generally helps to make the representation translationally invariant. However, in early iterations of our models, we found that adding pooling layers negatively affected our performance so we do not use any pooling layers in our CNN models.
After the convolutional layers in our CNN model, we finally include one (or more) fully connected layer(s). The last fully connected layer then produces the final output.
6. NEURAL NETWORK TRAINING 6.1. Training algorithm Neural networks are trained to minimize a loss function, which quantifies the difference between the predictions and the true labels in the training set. For regression problems, the mean squared error (MSE) is the standard loss function and is defined by the equation where p is the vector of all parameters in the model, The most popular neural network training algorithms use gradient descent to find the parameters p that minimize the loss function. These algorithms calculate how the parameters of the model can be changed to decrease the loss function by computing the gradient of the loss function with respect to the parameters. The model's parameters start at random values and are iteratively updated by descending along the gradient until the desirable minimum of the loss function is achieved. The step size is set by the learning rate, which is a hyperparameter that requires tuning during (cross-)validation to achieve optimal performance.
Computing the exact gradient of the loss function is unnecessary and computationally inefficient as it requires iterating over the entire training set. Rather, each gradient step approximates the true gradient by taking a random batch (i.e. subset) of the training set. The algorithm is then called a stochastic gradient descent (SGD) algorithm. The batch size is typically determined by the available computational resources. We kept the batch size constant at 300 - Shallue et al. (2019) demonstrated that the performance should be the same at any batch size, provided the other hyperparameters are welltuned.
In practice, neural networks are often trained using variants of the basic SGD algorithm. For our FC and CNN neural networks, we used SGD with momentum (Polyak 1964) with the momentum parameter fixed at 0.9.

Overfitting and Regularization
The fundamental challenge in ML is that algorithms have to perform well on novel, previously unseen inputsnot just the data on which the model was trained (Goodfellow et al. 2016). The ability of a model to perform well on previously unseen inputs is called generalization, and can be estimated from its performance on a test set comprised of data not used during training.
Overall, we can summarize the performance of a ML method by its ability to minimize both 1) the training error and 2) the gap between the training and test error. These two abilities correspond to the problems of underfitting and overfitting on training data, respectively. Techniques that seek to reduce overfitting, and therefore improve generalization, are known as regularization methods.
A common approach to regularization is to limit the complexity of the model by constraining the values of its parameters p. Two such methods are L 2 regularization and weight decay regularization. L 2 regularization adds a penalty term to the loss function proportional to the squared L 2 norm of the parameter vector p, where α is the strength of the regularization 5 and the L 2 norm ||p|| 2 is defined as Weight decay regularization shrinks the parameter vector by a constant factor on each iteration, where (∆p i ) opt is the change to the parameter vector computed by the optimization algorithm at iteration i.
5 Sometimes, this regularization strength parameter is referred to as λ in the literature. We choose α here for consistency with our description of our our linear model regularization procedure in Section 6.3.1.
Both of these techniques encourage smaller values of the parameters of p. In fact, L 2 regularization and weight decay are equivalent when using the basic SGD training algorithm. However, they are not equivalent for all variants of SGD, in particular for SGD with momentum, which we used for our FC and CNN neural networks.
In those cases, we chose weight decay because it empirically performs better for neural networks (Loshchilov & Hutter 2017). Larger values of the weight decay parameter α discourage overfitting, but can also cause underfitting. We optimized α by exploring the parameter space during validation for the SOAP 2.0 simulated data and during cross-validation for the HARPS-N data for both our FC and CNN models.

Implementation and Training Procedure
For each of the model architectures, we tune the hyperparameters unique to the architecture. In contrast with parameters that are learned during the training process, hyperparameters are parameters whose value is used to control the learning process. The hyperparameters can significantly affect model performance.

Linear model implementation
We implemented our linear model in scikit-learn, an open-source library for ML in python (Pedregosa et al. 2011). Specifically, we performed a ridge regression which is equivalent to a linear regression with L2 regularization.
We performed random searches across the parameter space for α which determines the regularization strength. The values of α spanned 0 to 800 and significantly affect the model performance as illustrated in Figure 7. In Table 2, the best model's performance Figure 8. Linear model input CCFs (before normalization) and corresponding vector weights -The Residual CCFs that were included in training set for our best linear model (α = 3.609) are plotted in red, white, and blue where the color corresponds to the RV signal in the same way as Figure 4. The model input CCFs are normalized but we plot them before normalization here for visualisation purposes. In black, the learned vector weights are plotted. The weights show correlations with their neighbors (reflecting the ≈ 3 km s −1 HARPS-N instrumental profile) and show most information comes from points within 5 km s −1 of the line center.
across the validation and cross-validation sets is listed. Each model was trained on a single CPU and the training time took ∼1 minute for the ten runs over which we average the predictions to compute our final stellar activity predictions. For our linear model, we extracted the vector weights and plot these alongside the input CCFs in Figure 8. In the bottom panel of Figure 8, neighboring weights appear correlated and the largest weights are concentrated in the center, which is expected from visually examining the input CCFs in the top panel of Figure 8.

FC NN and CNN model implementation
We implemented our FC NN and CNN models in Ten-sorFlow, an open-source software library for ML algorithms (Abadi et al. 2016).
We used Stochastic Gradient Descent (SGD) with momentum to minimize the loss function over the training set. We performed random searches across the parameter space for the learning rate, weight decay, kernel size, filters, number of layers, and the number of epochs over the (cross-)validation set as listed in Table 3. Each model was trained on a single CPU and the training time ranged from 5 minutes to 20 minutes depending on the complexity of the model. Across the 10 runs whose results we average, our best models took ∼ 5 and 8 minutes to train for the fully connected and convolutional architectures respectively. In Table 4 and Table 5, the three best performing model hyperparameters are listed for the FC NN and CNN architectures respectively. Further, the final model architectures used for both the sim-  Note-To find the best hyperparameters across model architectures for both the SOAP2.0 and HARPS-N observations, we performed random searches across the parameter space. The ranges of the parameter space explored are the same for both datasets. Convolutional layer parameters are denoted conv parameter . Fully connected layer parameters are denoted dense parameter . For FC NN and CNN models, we kept momentum = 0.9 which is a generally accepted value.

Model Ensembling by Averaging
After optimizing our hyperparameters for a specific architecture, we train 10 independent copies with different random parameter initializations. We then average the 10 outputs for all predictions to compute our results in Section 7. This method of model averaging often improves performance. Across the input space, different versions of the same configuration may perform better or worse and this process averages out this difference in performance. This is especially important when the training set is small and we are at higher risk for overfitting. In addition, model averaging reduces the variance arising from randomness in parameter initialization and data ordering during training, making it easier to compare different architectures.

RESULTS
Here we report the results of our ML activity predictions. First we discuss the metrics we used to evaluate the performance and then we summarize how the different models performed on each dataset. 7.1. Performance metrics: σ SD , σ k·M AD , and σ P ercentile After we finish optimizing our model parameters and hyperparameters on the training data, we evaluate our models' performance by characterizing the scatter of the "corrected" RVs, which we define as: where RV raw are the input RVs 6 without any activity corrections, and RV predicted are the predictions from our ML models. We introduce three metrics to characterize the scatter in the corrected RVs: σ SD , σ k·M AD , and σ P ercentile .
6 These are the RVs in the heliocentric frame as calculated in Step 2 of Section 3.2. Figure 9. Linear, Fully Connected, and Convolution Neural Network Results for SOAP2.0 (column I) and HARPS-N Data (Column II). The scatter metric is standard σSD for the SOAP 2.0 simulated data and σ percentiles for the non-Gaussian HARPS-N observations. The stellar error removed is the difference between the raw scatter and corrected scatter in quadrature. For the SOAP 2.0 simulated data, the FC NN model performs best across the test set reducing the raw scatter from 82.0 cm s −1 to 3.1 cm s −1 . For the HARPS-N Solar data, the CNN model marginally outperforms the FC NN architecture by reducing the RV scatter from 175.3 cm s −1 to 103.9 cm s −1 compared to 106.2 cm s −1 for the FC NN across the full dataset.
The first metric we calculate is the standard deviation of the corrected RVs, σ SD . The standard deviation is given by: where M is the number of corrected RV observations. For well-behaved datasets like the SOAP 2.0 simulated data, using just the σ SD metric is sufficient to characterize the scatter. On the other hand, the HARPS-N dataset is more complex, so we introduce two new metrics in addition to σ SD : σ k·M AD , based on the Median Absolute Deviation (MAD) and σ P ercentile . We introduce these additional two metrics for HARPS-N data because our data does not follow a normal distribution perfectly, and the presence of a few outlier datapoints in the HARPS-N dataset made it difficult to assess the model performance using only the σ SD metric. These new metrics are less sensitive to outliers than σ SD . The MAD metric is defined for a set of corrected RVs as: In other words, MAD takes the median of the data's absolute deviations around the data's median. To ease comparison with our other metrics like the standard deviation, we scale MAD by a factor k such that: where k depends on the type of distribution. For a normal distribution, k ≈ 1.4826 and we approximate our distribution as normal. We call our final scatter metric σ P ercentile . We calculate this metric by computing: where RV 84th% is the 84th percentile of the corrected RVs, and RV 16th% is the 16th percentile of the corrected RVs. For a normal distribution, this is equivalent to calculating the standard deviation. However, our distribution of stellar activity signals is not perfectly normal and skewed by some of the outliers. Thus computing σ P ercentile serves as a proxy for the standard deviation that is less sensitive to outliers.

SOAP 2.0 Results
For the simulated data using SOAP 2.0, our best performing models were the linear model and FC architecture which reduce the RV scatter, σ SD , from 82.0 cm s −1 to 3.7 cm s −1 and 3.1 cm s −1 across the test set respectively. These results are summarized in Figure 9 and Table 6. Thus, for the idealized case of simulated data, we can predict the stellar activity signal nearly exactly based on the shape changes in the normalized ∆CCF.  Note-We computed the standard deviation across the simulated stellar activity signals before applying any corrections (raw data) and then apply stellar activity corrections using all three model architectures. Their resulting reductions in scatter are listed across the test set. Our results across all three architectures are summarized in Figure 9 and Table 7. Our best performing models were the FC NN and CNN ( Figure 6) which reduced the RV scatter, σ P ercentile , from 175.3 cm s −1 to 106.2 cm s −1 and 103.9 cm s −1 respectively across the full dataset. This remaining scatter is likely dominated by instrumental noise, not photon statistics. Closely following the performance of the FC NN and CNN, our linear model reached a minimum scatter of 108 cm s −1 across the full dataset. From Table 7, we note that the overall reduction in scatter varies slightly across scatter metrics and methods. Overall, these results suggest that all three model architectures match the structure of the HARPS-N Solar Data well, but the CNN model is potentially marginally more suitable. The raw RVs, CNN predicted RVs, and CNN stellar activity corrected RVs are plotted over time in Figure 10.

Periodogram: Activity Signal Peaks Disappear
We investigated the behavior of the raw and corrected HARPS-N RVs in the Fourier domain to see which signals are being removed to achieve this reduction in RMS scatter. Figure 11 shows the Lomb-Scargle Periodograms (Lomb 1976;Scargle 1982) of the RVs before and after applying the activity correction. To implement a generalized Lomb-Scargle Periodogram, we used the periodogram functions in astropy.timeseries  the periodograms are normalized according to the formalism in Zechmeister & Kürster (2009).
In the top panel of Figure 11, the peaks at ∼ 25 days and ∼ 12 days correspond to the Sun's rotation period at the equator and half the rotation period respectively. The signal beyond > 900 days is the long-term magnetic cycle. Lastly, the peaks at ∼ 1 day correspond to aliases from both the rotation period signals and the long-term magnetic cycle. After applying the corrections from our CNN, the periodogram of the corrected RVs no longer has peaks corresponding to these activity signals. Thus, the CNN is able to identify and remove the quasi-periodic variability at the stellar rotation period based only on shape changes in the ∆CCF, and no information about when the observations were collected. Using machine learning, we were able to predict and remove stellar activity signals from HARPS-N Solar telescope observations and reduce the scatter in the measured RVs by about a factor of two from from 175.3 cm s −1 to 103.9 cm s −1 . While this improvement in RV precision is impressive and could increase our sensitivity to small planets if applied to observations of stars other than the Sun, our final scatter is still far greater than the roughly 10 cm s −1 precision necessary to detect habitable-zone Earth analogs around Sun-like stars.
What is limiting the precision of our activity-corrected HARPS-N RVs? One possibility is that our stellar activity corrections are not perfect, and the scatter is our corrected velocities is dominated by residuals stellar activity signals. However, we think that this is unlikely. We see no evidence for any quasi-periodic stellar activity signals in the periodogram of our corrected RVs (see Figure 11), and our experiments with the SOAP 2.0 sim- ulated data indicate that it is possible to achieve few cm s −1 precision after modeling and removing stellar activity signals in a similar configuration. Certainly real data will have complications and subtleties that make stellar activity harder to correct than in our idealized SOAP 2.0 simulations, but it seems unlikely that these differences would cause our limiting precision to be 20 times greater. Several other analyses of the HARPS-N Solar Data seem to agree that, even when we successfully model activity at the rotation period using a variety of different techniques, there is still some other process limiting our RV precision (Milbourne et al. 2019;Dumusque 2018a;Miklos et al. 2020). It is likely that the remaining scatter in our corrected HARPS-N data is dominated by instrumental noise. While HARPS-N is highly stabilized, the instrument does experience slow drifts and requires frequent calibrations to ensure the accuracy of its wavelength solution. The quality of these wavelength solutions limits the precision of velocities measured by HARPS-N. Dumusque et al. (2021) report that wavelength solutions generated by the version of the DRS we use in this paper tend to change by about 74 cm s −1 on day-to-day timescales, which could explain almost all of the scatter we see in our corrected HARPS-N solar velocities. If this is the case, then this technique could in principle yield more precise velocities when applied to data from newer stabilized spectrographs like ESPRESSO ).

Comparison to other methods
Other common methods of reducing the RV scatter by characterizing and removing stellar activity signals include GPs. In a recent paper by Langellier et al. (2021), GPs reduce the RMS scatter of the HARPS-N dataset to a similar reduction in RV scatter as our ML methods. One notable difference is that we achieve this reduction in RMS scatter without using any information about the timing of the observations, potentially eliminating the need for high cadence sampling 7 .
Another promising method for predicting stellar activity signals is to track the unsigned (unpolarised) magnetic flux as a proxy . By estimating rotationally modulated RV variations and the unsigned magnetic flux daily over 8 years using spatially resolved SDO/HMI images, Haywood et al. (2020) showed that a simple fit with unsigned magnetic flux reduces rotationally modulated RV scatter by 62% (a factor of 2.6 improvement). They successfully recovered planet semiamplitudes of 0.3 m s −1 at orbital periods of ∼ 300 days. These numbers are not directly comparable to the work presented here because of different instrumental systematics and observational baselines; however, the improvement is of similar order. The authors note, however, that the unsigned magnetic flux is not yet measurable at high precision in slowly rotating, relatively inactive stars like the Sun. While this measurement is readily available for the Sun, making similar measurements for other stars will require pushing beyond the current state of the art in measuring Zeeman broadening from stellar spectra.
Recently, Collier Cameron et al. (2021) also explored the shape changes introduced in CCFs by computing an autocorrelation function that is invariant to translation (and thereby not sensitive to planetary reflex motion) but focused on stellar activity shape changes. In this analysis, the full 5 years of the HARPS-N dataset were used and injected planet signals of K = 0.4 m s −1 and periods ranging from 7 to 200 days were recovered. Since this analysis used the full 5 years and an older version of the DRS, these numbers are not directly comparable to our work. However, the improvement in RMS is similar, which further supports that these stellar activity-driven shape changes can serve as a useful indicator for the stellar RV contribution.

Implications for planet detection
To estimate the implications this ML method could have for planet detection, we injected a synthetic planet signal into both the full dataset of raw RVs and raw CCFs (528 days of observations). We then ran our full pipeline as described in detail in Sections 3, 4, and 6. We attempted to detect the signals in a Lomb-Scargle periodogram and assess their significance using Markov Chain Monte Carlo (MCMC).
In Figure 12, we show a periodogram of the HARPS-N RVs before and after activity corrections with a planet signal injected with a semi-amplitude of 0.4 m s −1 at a 1 year period (corresponding to a planet of 4.53M ⊕ ). In the periodogram of the uncorrected HARPS-N RVs (top panel), the injected signal is visible, but difficult to distinguish from other stronger peaks in the periodogram caused by stellar activity. However, in the periodogram of the corrected RVs (bottom panel), this planet signal is clearly the most prominent after we corrected for stellar activity signals.
Using the periodograms to initialize our MCMC, we derived the phase-folded fit in Figure 13 with K=0.53 ± 0.07 m s −1 , P = 362.69 +9.26 −8.03 days (Figure 14 in Appendix ). The MCMC uncertainties in this fit seem to be slightly underestimated, likely due to the non-Gaussian noise properties in the RVs.
The sensitivity of our method appears to compare favorably with the sensitivity of GP regression. Langellier et al. (2021) found that they would need 10-15 years of HARPS-N Solar data to detect an 0.5 m s −1 RV signal at a 225 day period with 5σ confidence. We found that we can recover a 0.4 m s −1 injected signal with ∼ 5σ confidence using only 3 years of HARPS-N Solar data.

Future Work and Prospects for this Technique on Other Stars
To extend this method to other stars, we could take two different approaches. One approach would be to focus on one star at a time where we fit a simple (linear) ML model simultaneously with planet signals. This method has the advantage that it does not require the removal of astrophysical signals before fitting and that we would only need data for one star at a time. Some of the disadvantages would be that it could limit the complexity of the ML model and be more likely to overfit or have degeneracies where it fails to properly distinguish activity from planet signals, which is a common problem for other techniques like GPs. Preliminary explorations of this type of simplified activity model has shown promising results on data from both HARPS-N (de Beurs et al. in prep) and EXPRES ) on stars with between 25 and 100 observations. Another approach would be to train a more complex model on all stars observed by a given spectrograph simultaneously and predict stellar activity corrections for new stars (not included in the training set) based on the entire ensemble. The advantages would be that the larger training set would allow for more complex ML models that can predict the activity signals more accurately. However, this method has the disadvantage that the planet signals will need to be removed ahead of time. Some undetected planet signals will always remain in the data, meaning we would lack a perfect "ground truth" on which to train our models. Instead, we would have to hope that undetected signals would average out across the training set. In addition, the different rotation rates, spectral types, and inclination angles may be challenging to solve and require significantly more model complexity. Adding input features to our models, such as the log R HK or Hα time series, stellar parameters like effective temperature and stellar radius, or stellar inclination angles derived from measurements of the projected rotational velocity and rotation period, may help our models make more accurate predictions.
In some ways, observations of stars at nighttime may be simpler to use as inputs to ML models. Unlike solar observations, nighttime observations have the following properties: 1. Differential extinction is significantly less of a concern for observations of other stars at nighttime. Unlike the sun, the other stars that we observe are essentially point sources. Thus, differential extinction across the disc would not be resolved and induce significantly less systematic signals.
2. There are some yearly effects on the CCFs of the sun that will not appear in stars observed at nighttime. In solar observations, the Full Width at Half Maximum (FWHM) of the CCF is modulated with 6-month and 1 year timescales. This phenomenon is due to the eccentricity of Earth's orbit, which causes the Earth's angular velocity about the Sun to vary annually and changes the relative angular velocity of the Sun's rotation that we observe. The changing relative rotational velocity affects our measurement of the rotational broadening of solar spectral features and therefore causes variations in the FWHM of the CCF (See Figure 8a in Collier Cameron et al. 2019). The six-month oscillation in the FWHM arises from the obliquity of the ecliptic plane relative to the solar equator.
On the other hand, nighttime observations will introduce new challenges of their own. For nighttime observations, we will not be able to average out granulation as well as for the Sun due to lower cadence observations. For other stars, The spectral lines also move across the detector due to barycentric velocity changes. In addition, observations of stars other than the Sun will often be photon-limited, unlike our solar observations in which photon noise is negligible. Noisier observations will make separating activity signals from true RV shifts more difficult. Nighttime RV observations are strongly heteroskedastic, unlike solar observations, due to factors like the observing conditions and different exposure times. Training a model may require more sophisticated weighting than we used in this work. In the future, this deep learning method could be applied to spectrographs like ESPRESSO (Pepe et al 2020 accepted ) and EX-PRES (Jurgenson et al. 2016), and might perform especially well on the extremely large data sets expected to be collected by HARPS-3 (Thompson et al. 2016;Hall et al. 2018).
In future NN architectures, it may be advantageous to add pooling layers. Pooling layers take advantage of translational invariance in the input of CNNs. On the one hand, adding pooling layers could help for other stars where there may be slight shifts in the CFF due to undetected planets. On the other hand, pooling layers may prevent the detection of precise shifts in RV due to magnetic features. While our initial tests with pooling layers on solar data did not seem to help network performance, they may be helpful on more complex datasets.
Other possible future applications of our ML stellar activity model include detecting young planets orbiting young stars. Young stars' high rotation rates and high levels of stellar activity make them especially complex, but simultaneously open the door to study- ing planetary formation and migration mechanisms. A machine-learning technique to remove stellar activity signals could open the door to measuring more masses and densities for transiting planets orbiting bright young stars Vanderburg et al. 2018;David et al. 2019;Newton et al. 2019). In this way, studying young planets around young stars is crucial to exoplanet demographic studies (Damasso et al. 2020).

CONCLUSION
Achieving the extreme RV precision necessary to detect long-period Earth-mass exoplanets requires mitigating stellar activity signals. These dominant stellar activity signals hide the ∼10 cm s −1 signatures of Earth analog exoplanets and are difficult to remove due to their unpredictable time evolution and quasi-periodic nature. Current methods for mitigating stellar activity signals often rely on high cadence and carefully timed observations, but even with these methods, the detection of an Earth analog around a Sun-like will be very difficult .
We have demonstrated a machine-learning approach to removing stellar activity signals that does not require the frequent sampling and timing information that other methods (like GP regression) depend on. By interpreting small shape changes in the stellar spectra induced by stellar activity, our ML models can predict and remove these dominant signals. So far, we have trained and tested our methods on simulated data and observations of the Sun, and plan to apply these or similar methods to observations of other stars in the future. For stars other than our Sun, we will not have as many spectra that can serve as a training set for a single star of interest but we may be able to train on an ensemble of other stars instead. We will make our code publicly available online to the exoplanet community 8 .
We developed and tested our methods on both simulated data (Monte Carlo generated using SOAP2.0; Dumusque et al. 2014) and solar observations from the HARPS-N Solar Telescope (Dumusque et al. 2015). Our best performing model for simulated data was a FC NN which successfully reduced the scatter in simulated stellar activity signals from 82.0 cm s −1 to 3.1 cm s −1 across the test set. For the HARPS-N Solar observations, our best performing models, a FC NN and a CNN, both reduced the RV scatter from from 175.3 cm s −1 to 103.9 cm s −1 across 3 years of observations. When comparing our result to works that use GPs for HARPS-N observations ), our FC NN and CNN models achieve similar (slightly better) precision than GP regression on the same data, without the need for high cadence sampling and timing information.
We explored how much an activity correction like the one we have demonstrated on the Sun could improve the detectability of planets in similar datasets around other stars. We injected planet signals into our activitycorrected HARPS-N observations and were able to recover signals with semi-amplitudes down to 30 cm s −1 , improving upon our detection limits in un-corrected observations by more than a factor of 2. However, these are not end-to-end injection/recovery tests and represent a best-case improvement to detection sensitivity. Future work will focus on investigating how to robustly prevent the algorithms from potentially confusing planetary and stellar activity signals. Nonetheless, these tests demonstrate that these advanced techniques could potentially pave the way to revealing previously hidden planets around our closest stellar neighbors.