Overfitting Affects the Reliability of Radial Velocity Mass Estimates of the V1298 Tau Planets

Mass, radius, and age measurements of young (<100 Myr) planets have the power to shape our understanding of planet formation. However, young stars tend to be extremely variable in both photometry and radial velocity, which makes constraining these properties challenging. The V1298 Tau system of four ~0.5 Rjup planets transiting a pre-main sequence star presents an important, if stress-inducing, opportunity to directly observe and measure the properties of infant planets. Su\'arez-Mascare\~no et al. (2021) published radial-velocity-derived masses for two of the V1298 Tau planets using a state-of-the-art Gaussian Process regression framework. The planetary densities computed from these masses were surprisingly high, implying extremely rapid contraction after formation in tension with most existing planet formation theories. In an effort to further constrain the masses of the V1298 Tau planets, we obtained 36 RVs using Keck/HIRES, and analyzed them in concert with published RVs and photometry. Through performing a suite of cross validation tests, we found evidence that the preferred model of SM21 suffers from overfitting, defined as the inability to predict unseen data, rendering the masses unreliable. We detail several potential causes of this overfitting, many of which may be important for other RV analyses of other active stars, and recommend that additional time and resources be allocated to understanding and mitigating activity in active young stars such as V1298 Tau.


Young Planets as Probes of Formation
Planet formation is an uncertain process. Giant planets are thought to form with large radii, inflated due to trapped heat, then cool and contract over the first few hundred Myr of their lives (Marley et al. 2007). However, the accretion efficiency of the formation process, which sets the planets' initial entropy and radii, spans orders of magnitude of uncertainty. The processes sculpting the post-formation masses and radii of smaller terrestrial exoplanets are also uncertain. Young, terrestrial planets also have uncertain initial entropies, and for highly irradiated planets, the unknown rate of photoevaporation (itself due to uncertainties in a planet's migration history, among other physical unknowns) during and after formation compounds this ambiguity (Lopez et al. 2012, Owen & Wu 2013, Chen & Rogers 2016, Owen 2020. Measuring the masses, radii, and ages of newly-formed planets presents a path forward (Owen 2020). Young moving groups provide rigorous age constraints, and relatively model-independent methods of measuring planetary radii exist for both young directly-imaged and transiting planets (for transiting planets in particular, only stellar radius model dependencies impact the inferred planetary radius). However, in both situations, few model-independent mass measurements exist. For transiting planets, there are two complementary methods for measuring planetary masses: transit timing variations (TTVs), and stellar radial velocity (RV) timeseries.
Measuring RV masses of young planets is a difficult task, so some advocate to rely on transit timing variations (TTVs) alone to measure masses of young planets. However, not all planets transit, and only planets in multi-planet systems at or near mean motion resonance exhibit TTVs (Fabrycky et al. 2014). Even in systems that do, individual TTV mass posteriors are often covariant, since TTVs to first order constrain the planetary mass ratio (Lithwick et al. 2012, Petigura et al. 2020. In an ideal scenario, both RVs and TTVs would be used to jointly constrain planetary masses in a given system, reducing posterior uncertainty and TTV degeneracies.

Stellar Activity & Overfitting
As the instrumental errors of extremely precise RV instruments approach 10 cm s −1 , and as the RV community begins to target more active stars, accurately modeling astrophysical noise is becoming more and more critical. Young stars present a particular challenge. These are highly magnetically active (Johns-Krull 2007), with starspots that occupy significant fractions of the stellar surface and induce RV variations on the order of ∼km s −1 (Saar & Donahue 1997). These RV variations are hundreds of times larger than the activity signals of older quiet stars typically targeted by RV surveys and complicate the detection of planet-induced Doppler shifts from even close-in Jupiter-mass planets (e.g., Huerta et al. 2008;Prato et al. 2008).
Other assumptions and/or information can be leveraged to model the activity signal, even if the signal isn't easily understandable from the RVs themselves. A widely used practice involves independently constraining the rotation period from a photometric timeseries, then using an informed prior on the rotation period to model the RVs (e.g., Grunblatt et al. 2015). Other related examples include specifying a quasi-periodic kernel for a Gaussian Process regression (GPR) model (i.e., assuming that the stellar activity has a quasi-periodic form), or modeling the RVs jointly with other datasets. The latter approach achieves better model constraints either by explicitly modeling the relationship between the datasets (e.g., Rajpaul et al. 2015) or by sharing hyperparameters between datasets (e.g., Grunblatt et al. 2015, López-Morales et al. 2016. As is true for every model-fitting process, misspecifying the stellar activity model (i.e., fitting a model that is not representative of the process that generated the data) or allowing too many effective degrees of freedom can lead to overfitting.
Overfitting is a concept ubiquitous in machine learning, and in particular is often used to determine when a model has been optimally trained. One algorithm for determining whether a model is overfitting is as follows 1 : divide the data into a "training" set and an "evaluation" set (a common split is 80%/20%), and begin optimizing the model using just the training set. At each optimization step, calculate the goodness-of-fit metric for the model on the evaluation set, which is otherwise omitted from the training process altogether. This method of evaluating a model's ability to successfully predict new, or "out-of-sample," data is known as cross validation (CV).
The classic observed behavior is that the goodness-of-fit metrics for both the training and evaluation set improve as the model fits the training data better and better. At a certain point, the model begins to overfit to the training data, and the goodness-of-fit metric for the evaluation data worsens. This is because the model parameters have begun to reproduce the noise in the training set, at the expense of reproducing the signal common to both datasets. A model that is overfitting, then, can be defined as one that predicts the observations in a training set better than those in an evaluation set. An overfitting model fits aspects of the data that are not predictable or common to the entire data set, e.g., noise.
The optimally trained model is selected not by its performance relative to the training data, but by its performance relative to the evaluation data, which was omitted from the training process altogether. Making an analogy to Bayesian model comparison, we could imagine a similar process where the goodness-of-fit is evaluated for an "evaluation set" left out of the training process (i.e., posterior computation using MCMC, nested sampling, etc.) for a series of models. One benefit of this method over, e.g., formal Bayesian model comparison is that it also provides an easily-interpretable absolute metric for how well the model fits the data: if the evaluation set goodness-of-fit is significantly worse than that of the training set, we know the model is misspecified, even if it has (comparatively) the lowest Bayesian evidence.
In this study, we apply the CV technique as defined above to evaluate the predictiveness of one particular model fit to one particular star. This is intended as a case study, aiming to inspire further investigation into the extent of and causes of overfitting in RV modeling of young, active stars.

V1298 Tau
V1298 Tauri (hereafter V1298 Tau) is a young system of four ≳ 0.5 R J planets transiting a K-type pre-main sequence (PMS) star (David et al. 2019b, David et al. 2019a). Very few transiting planets have been discovered around PMS stars (other notable systems being AU Mic, Plavchan et al. 2020, Cale et al. 2021K2-33, David et al. 2016;DS Tuc, Newton et al. 2019, Benatti et al. 2019HIP 67522 Rizzuto et al. 2020;Kepler 1627A, Bouma et al. 2022andTOI 1227, Mann et al. 2022). David et al. (2019b) reported the discovery of V1298 Tau b, a 0.9 R J object with an orbital period of 24d. David et al. (2019a) discovered three additional planets in the system: V1298 Tau c at 8d, V1298 Tau d at 12d, and a single-transiting object, V1298 Tau e.
It is unclear from their radii (0.5 − 1R J ) alone whether these planets are gas giants that contracted rapidly after forming, or young terrestrial or mini-Neptune planets, which will lose a large fraction of their atmospheres to photoevaporation (Owen & Wu 2013) and/or core-powered mass loss (Ginzburg et al. 2018) as they age.
Suárez Mascareño et al. (2021), hereafter SM21, presented over 100 RVs of the system with four different instruments, and used a suite of Gaussian Process (GP) models in an effort to isolate the RV signals of the outer two transiting planets (b and e), finding masses of 0.64 and 1.16 M J for each respectively. Combined with radii of 0.91 and 0.78 R J , this implied high densities of 1.2 and 3.6 g cm −3 . Such high densities would require rapid contraction after within the 23 Myr lifetime of the system and place the outer V1298 Tau planets at the upper density boundary of even mature field exoplanets, where few theories predict planets to exist. Since V1298 Tau e only transited once while observed with K2, SM21 did not have a period constraint from transits, and derived a period of 40.2 ± 0.9 d from their RV measurements. After the publication of SM21, Feinstein et al. (2022) published updated ephemerides of all four planets using TESS photometry, including a second transit of planet e. They placed a strict lower limit of 42 d on V1298 Tau e's period, in tension with the value from SM21. In addition, Tejada Arevalo et al. (2022) performed a dynamical stability analysis using the mass measurements reported in SM21, finding that 97% of system configurations consistent with the SM21 posteriors are gravitationally unstable over the lifetime of the system. SM21 performed a rigorous and state-of-the-art analysis, comparing several complex models with Bayesian methods. However, several independent lines of evidence appear to call the masses they report into question: the tension with formation theory discussed in SM21, the updated planet e orbital period of Feinstein et al. (2022), and the improbability of long-term stability derived by Tejada Arevalo et al. (2022).

This Paper
The purpose of this paper is two-fold: 1) to demonstrate the use of CV to show that the preferred model of SM21 is overfitting the RVs, and 2) to point out several potential causes of the overfitting, which are not unique to SM21 but common throughout the literature. We do not attempt to update the mass estimates for the V1298 Tau planets in this paper. We also do not attempt to prove that the mass estimates published in SM21 are incorrect, but instead seek to call into question their reliability. Our argument is that future joint models of the stellar activity and planetary signals of V1298 Tau will need to prove their predictiveness in order to be trustworthy. The structure of this paper is as follows: in Section 2, we review the literature data scrutinized in this paper and describe one additional contemporaneous RV dataset taken with Keck/HIRES. In section 3, we demonstrate that the preferred model of SM21 is overfitting. Section 4 discusses several potential causes of this overfitting, and advises modelers on how to detect and/or avoid these subtle pitfalls. In particular, Section 4.4 argues that differential rotation is an important effect for V1298 Tau, and must be modeled carefully. We conclude in Section 5. We also provide an appendix that provides a geometric interpretation of how GPR penalizes complexity. All of the code to create the plots in this work are publicly available on GitHub 2 .

DATA
Throughout this paper, we reference several data sets: three photometric time series measured by different instruments and three RV timeseries derived from spectra measured by different instruments. Each dataset is detailed in the subsections below. All of the photometry is shown in Figure 9, and all of the RVs are shown in Figure 1.

LCO photometry
We obtained ground-based LCO photometry verbatim from SM21.

TESS photometry
2 https://github.com/sblunt/V1298Tauri We obtained TESS lightcurves from Feinstein et al. (2022), who combined timeseries photometry of V1298 Tau from TESS Sectors 43 and 44. Feinstein et al. (2022) used the 2-minute light curve created by the Science Processing Operations Center pipeline (SPOC; Jenkins et al. 2016), and binned those observations to 10 mins. We normalized the data for each TESS orbit separately, following Feinstein et al. (2022).

SM21 RVs
We obtained CARMENES and HARPS-N RVs directly from SM21. We note that SM21 excluded infrared-arm CARMENES RVs in its analysis, and we do the same here.
HARPS-N RVs are wavelength calibrated using a ThAr lamp, and the HARPS-N spectrograph covers 360-690 nm. The visible arm of the CARMENES instrument covers the spectral range 520-960 nm, and spectra from this instrument are wavelength calibrated using a Fabry-Perot etalon, anchored using hollow cathode lamps.

Keck/HIRES RVs
Between November 16, 2018, and February 6, 2020, we obtained 36 RVs using the HIRES spectrograph on the Keck I telescope (Vogt et al. 1994). Wavelength calibration was performed by passing starlight through a warm iodine cell, and data reduction was performed using the California Planet Search pipeline described in Howard et al. (2010), which is adapted from Butler et al. (1996). All HIRES RVs used in this study are given in Table 1. Some of these RVs were previously published in Johnson et al. (2022), and the processing is identical in that paper and this. The same stellar template, constructed from two stellar spectra taken on 24 Oct 2019 UT without the iodine in the light path, was used to derive RVs in both studies. In-transit RVs from that study have been excluded here. Spectra were typically taken using the C2 decker (14" x 0.861"), which enables sky subtraction, and is the CPS HIRES observer "decker of choice" for stars fainter than V∼10. However, a CPS HIRES observer "rule of thumb" is to use the shorter B5 decker (3.5" x 0.861") in poor seeing conditions, as the Doppler pipeline sky subtraction algorithm is unreliable when the stellar PSF fills the slit. Sky subtraction is not performed under such conditions. Accordingly, 7 RVs published here were calculated from spectra using the B5 decker. In both modes, HIRES has a resolving power of ∼60,000, and the iodine cell spectral grasp translates to contributions to the RV from wavelengths between 500 and 620 nm (Butler et al. 1996).

CROSS VALIDATION TESTS
Our intention in collecting additional RVs of V1298 Tau with HIRES was to jointly analyze these data together with literature data and update the masses published in SM21. However, early on in the analysis, we noticed clues that made us question our assumptions. In particular, the new data we had collected did not seem consistent with the models of SM21. In addition, many tested models converged on results that were physically unreasonable or clearly inconsistent with subsets of the data. We ultimately decided to test the predictive capability of the SM21 model that we were using as our starting point, as a check on our own assumptions. This section details the outcome of those experiments.
The main finding of this paper is that the median parameter estimate of the preferred model of SM21 (their 4p P QP 2 ) is overfitting. For convenience, we will refer to this model throughout the rest of this paper as the "SM21 preferred model." Showing that a point estimate is overfitting does not necessarily indicate that every model spanned by the posterior is overfitting. However, since the preferred model presented by SM21 (their figure 11) appears approximately Gaussian around the MAP estimates of the parameters relevant for us, (except for the kernel parameter C and the white noise jitter for CARMENES, which both peak at effectively 0), we assume that the MAP and median for this fit are close enough to make no difference, and that inferences made about the median fit hold true for other high-probability areas of parameter space.
Our goal was to test the predictiveness of the preferred SM21 model 3 using CV. In an ideal situation, we would do this by evaluating the model's performance on new HARPS-N data, unseen by the trained model. Lacking this, we constructed two ad hoc "validation sets:" a timeseries of Keck/HIRES data contemporaneous with the SM21 HARPS-N data, and the CARMENES data presented in SM21 (that the model was also trained on, but which were treated as independent from the HARPS-N data; see section 4). By chance, this results in a nearly perfect 80%/20% split for both validation sets (80.3%/19.7% for HARPS-N/CARMENES, and 78.9%/21.1% for HARPS-N/HIRES). In Figures 2 and 3, we show two visualizations of the results of performing CV on these two validation sets. Figure 2 shows the GP prediction of the SM21 preferred model, together with the HARPS-N data on which it was trained and conditioned. The contemporaneous HIRES data and CARMENES data and their residuals are overplotted. Figure 3 shows the residuals of this fit, given in terms of standard deviations from the mean GP prediction. In both figures, the residuals of the HIRES and CARMENES data have a much wider spread about 0 than the HARPS-N points. Because our intention was to evaluate the existing model, we did not re-train the GP hyperparameters in order to compute the prediction shown in Figure 2. Rather, we used the median parameters of the SM21 4p P QP 2 model, conditioned on the HARPS-N data published in that study, to predict RV values at each of the CARMENES and HIRES epochs. Our interpretation of the difference in residual distributions shown in these two figures is that the preferred SM21 model fits data included in its training set (i.e., the HARPS-N data) significantly better than contemporaneous data not included. In other words, the model is not predictive. This is a hallmark of overfitting, and indicates that the preferred SM21 model is not representative of the process generating the data.
An important counter-interpretation is that the V1298 Tau RVs measured by HARPS-N, HIRES, and CARMENES show different activity signals, and not that the preferred SM21 model is overfitting. In particular, starspots cooler than the stellar photosphere cause RVs collected in redder bands, where the contrast between spot and photosphere is lower, to show lower variability amplitudes (e.g., Carpenter et al. 2001;Prato et al. 2008;Mahmud et al. 2011). In addition, we expect different instruments to have different RV zero-point offsets. Importantly, these two effects cannot explain the increased out-of-sample residual spread observed in Figures 2 and 3 4 ; the preferred SM21 model fitted the CARMENES zero-point offset, white noise jitter value, and activity amplitude, and those values have been applied to the CARMENES data. To account for the potential differences between the HIRES and HARPS-N RVs, we applied an RV zero-point offset and scale factor (0.76) that minimizes the residual spread (i.e., we applied a best-fit linear model to the HIRES data in order to minimize χ 2 with respect to the GP model prediction). See section 4.1 for further discussion of this point.
Another potential explanation for the phenomenon observed in Figures 2 and 3 is that the activity signals observed by HARPS-N, CARMENES, and HIRES are fundamentally different; i.e., the signal observed by one instrument is not a linear combination of the signal observed by another. This might occur because, for example, all three instruments have ∼km/s instrumental systematics relative to one another, or because the shape of the activity signal changes significantly with wavelength. To rule out this explanation and provide more evidence that the effect we're seeing is actually overfitting, and not instrument-specific differences, we repeated the experiment above using only HARPS-N data. We randomly selected 80% of the HARPS-N data published in SM21, conditioned the preferred SM21 model on that subset, and computed the residuals for the random "held-out" 20%. The results are shown in Figures 4 and 5. Even though these held-out 20% were included in the training process (i.e., they informed the values of the hyperparameters), we observed substantially larger residuals than for the conditioned-on subset. This experiment provides additional evidence for overfitting, and not instrumental-or wavelength-dependent systematics.
It is worth noting that we distinguish between residual distributions (Figures 4 and 3) "by-eye" in this paper, but this technique will not generalize for more similar residual distributions. Residual diagnostic tests (see Caceres et al. 2019 for an example) will be helpful in generalizing this methodology.

POTENTIAL CAUSES OF OVERFITTING
This section points out several potential causes of the overfitting described in the previous section, and advises on how to detect and/or ameliorate these effects. We do not attempt to quantify the effect of each of these on the overfitting discussed in the previous section, but intend this as a qualitative discussion. Many of these effects are potentially relevant for stars other than V1298 Tau. Importantly, this is not a list of "mistakes," but a list of assumptions we questioned throughout the process of trying to explain why the preferred SM21 fit was overfitting. We encourage future close investigation of each of these phenomena, both for V1298 Tau and other objects. This list is not exhaustive.

Correlated Datasets vs Datasets that Share Hyperparameters
The mathematical formalism in this section is essentially identical to that of Cale et al. (2021, see their section 3.2), but was developed independently. We encourage readers to compare our explanations, and we ask readers to also cite Cale et al. (2021) whenever referencing Section 4.1 of this paper.
There is a difference between correlated measurements that are allowed to have different GP amplitudes and datasets that share GP hyperparameters but are themselves uncorrelated. We are motivated to stress this distinction by the need in RV timeseries fitting to write down the joint likelihood of a model applied to datasets taken from several Figure 2. SM21 preferred model prediction and contemporaneous observed data. The HIRES data have been scaled and offset by linear parameters that minimize the residual spread with respect to the GP model, and the median 4pP QP 2 CARMENES data RV zero-point value was been applied in order to more easily compare both datasets with the model expectations. Top: mean model prediction (gray solid line), together with contemporaneous HARPS-N (black), CARMENES (red), and HIRES (purple) RVs overplotted. Bottom: model residuals, together with 1-and 2-σ GP uncertainty bands (shaded dark and light grey regions, respectively). Takeaway: The preferred SM21 model is overfitting to the HARPS-N data, which can be seen in the increased spread about the residual=0 line for both HIRES and CARMENES data during epochs with contemporaneous HARPS-N data. different instruments. As a concrete example, let's consider three fictional RV data points, the first two from HIRES and the next one from CARMENES, to which we would like to fit a GP model. Because of the different bandpasses of HIRES and CARMENES, we might expect the same stellar activity signal to have a different amplitude when observed by these two instruments. However, we might expect the time-characteristics of the signals to be identical. In other words, we expect the CARMENES activity signal to be a scalar multiple of the HIRES activity signal. 5 As discussed in Section 3, this assumption is borne out, at least to first order, in observations of other active stars at different wavelengths (see, e.g., Mahmud et al. 2011, who investigated the RV activity of the T-Tauri object Hubble I 4 with contemporaneous infrared and optical spectra taken with different instruments), but this point warrants further scrutiny. Comparing the variability of active stars with different instruments, as well as the variability of the sun with different solar instruments, is an important endeavor.
Another important caveat is the use of different techniques for computing RVs from stellar spectra (e.g., the iodine/forward-modeling technique of HIRES vs simultaneous reference/CCF technique of CARMENES and HARPS-N). Switching from one of these techniques to another is not expected to affect an astronomer's ability to recover common Keplerian signals, but spot activity is not a simple Doppler shift. More work is needed to understand and model spot activity at the spectral level. We proceed by assuming that modeling the same spectrum using an iodine/forward-model and with a simultaneous ThAr lamp reference (as an example) will only change the effective wavelength range of the spectrum that is used to compute RV, and therefore affect only the amplitude of spot-induced variations.
Assuming linearly-related GPs for different instruments, we can write down the joint covariance matrix for our three fictional data points, allowing unique amplitude terms a C and a H for each dataset, and assuming an arbitrary kernel function k i,j describing the covariance between RVs at times t i and t j : Optimizing the hyperparameters of a fit that uses this covariance matrix to define the GP likelihood will give the desired result. SM21, following many other fits in the literature, constructed an independent covariance matrix for each RV instrument in their dataset and summed the log(likelihoods) given by these together. This allows each RV dataset to be independent; i.e., a datapoint taken by HIRES is not correlated with a datapoint taken at exactly the same time  Figure 2, except that the model prediction is computed by conditioning on a randomly-selected 80% subset of the HARPS-N data, as described in the text, as the residuals are computed for the 20% subset that was held-out. Takeaway: The effect seen in Figure 2 cannot be explained by instrument-or wavelength-dependent systematics, because the same larger residuals are seen within the data taken by only HARPS-N. by CARMENES. Figures 11 and 12 illustrate the difference between these two likelihood definitions for data for a different object (chosen because it is easier to see the effect using this dataset).
This assumption of independent data for each instrument effectively adds additional free parameters to a model, and makes it more susceptible to overfitting. This is also why, in Figures 2 and 3, we could demonstrate that the preferred SM21 model was overfitting by comparing the model prediction conditioned on HARPS-N data to the CARMENES data; the CARMENES data influenced the final values of the hyperparameters, since they were shared between the two Gaussian processes, but otherwise the datasets were treated as independent.
To illustrate the effects discussed in this paper, we used a modified version of radvel (Fulton et al. 2016), built on tinygp , that treats the models for different instruments as correlated, but allows each instrument its own GP amplitude, white noise jitter term, and RV zero-point offset term. 6 The difference between the  Figure 3, except computed using the same method as for 4. Takeaway: the larger and more uniform spread of residuals for HARPS-N data on which the model was conditioned provides more evidence that the preferred SM21 model is overfitting.
previous version of radvel and this modified version is also illustrated in Figures 11 and 12 in the Appendix. This modified version of the code is available at https://github.com/California-Planet-Search/radvel/tree/tinygp. Future work should continue to test this assumption by obtaining simultaneous (or near simultaneous) RVs for a variety of stellar types with different instruments, across a wide range of bandpasses.

P rot and P rot /2
Another practice that may have made the SM21 preferred fit susceptible to overfitting involves constructing a GP kernel with one term at the rotation period and another term at its first harmonic. In other words, the SM21 preferred model kernel has the following form: To understand the motivation for this, we first need to scrutinize the RV signal in Fourier space. Figure 6 shows the Lomb-Scargle periodogram of all RV data presented in SM21, zooming in on two important parts of period space. There are four extremely significant peaks in the RVs, which can all be explained with a single periodic signal at 2.91d, the rotation period identified by SM21. Along with a strong peak at 2.91d (hereafter P rot ), there is a signal at P rot /2, which is often observed in RVs of stars showing starspot-induced variability (Nava et al. 2020). The other two strongly significant peaks can be explained as 1-day aliases of P rot and P rot /2. In other words, the dominant RV signal is periodic, but requires a two-component sinusoidal fit (i.e., it needs more terms in its Fourier expansion) in order for the fit to reproduce the shape of the curve. This is visualized in Figure 7, which shows the RVs phase-folded to P rot . In summary, the RV curve comprises a single periodic pattern, but that pattern is not a simple sinusoid. The preferred SM21 model kernel sums two approximately quasi-periodic terms, one at P rot and one at P rot /2, because the approximate quasi-periodic kernel used in SM21 (SM21 equation 1; derived in Foreman-Mackey et al. 2017) is less flexible than the standard quasi-periodic kernel (SM21 equation 3). In other words, the approximate kernel is less capable of fitting non-sinusoidal shapes. However, each term was modeled with its own independent exponential decay timescale. This adds an additional free parameter to the fit, which exacerbates the potential for overfitting.
The most straightforward way to address this is to construct a model with fewer unnecessary free parameters, for example by equating the parameters L 1 and L 2 in SM21 equation 1. A more complicated suggestion, which would be an excellent avenue for further study, is to leverage the correlation between the photometry and RVs, following, for example, Rajpaul et al. (2015). This requires assuming (or fitting for) a relationship between a photometric datapoint and an RV datapoint at the same time. Our preliminary investigations along these lines indicate that the FF' formalism, which models an RV signal as a function of a simultaneous photometric (F) dataset and the time derivative of the photometric dataset (F'; Aigrain et al. 2012), does not allow for a good phenomenological match between the LCO photometry and the contemporaneous RVs, but the derivative of the LCO photometry appears to fit better (i.e., the RV curve appears to be possible to model as a linear combination of the F' component only) 7 . Future work could write down a joint GP formalism that models RVs as the time derivative of the photometry (such a formalism would be very similar to that of Rajpaul et al. 2015).
Regardless, in order to be confident in the relationship between the photometry and the RVs, as well as to pick out the components of the RV that do not occur at P rot , we suggest a very high-cadence (several observations per night) RV follow-up campaign with contemporaneous photometry 8 in order to develop a high-fidelity model of the stellar variability 9 . It is important to note that this campaign need not be performed by an RV instrument with 30 cm s −1 precision; Johnson et al. (2022) demonstrated 6-7 m s −1 RMS precision with HIRES over several hours, even though the stars moves by hundreds of m s −1 over even a single night. This level of instrumental RV error should be sufficient to understand the stellar activity, so long as the cadence is as high as possible.

Keplerian Parameters Enable Overfitting in the Presence of Un-modeled Noise
A Keplerian signal has five free parameters (semi-amplitude, eccentricity, argument of periastron, time of periastron, and period). A model with two Keplerian signals therefore has 10 additional free parameters than a model without. To first order, more free parameters means more model flexibility. This problem can be addressed using model comparison, which penalizes complexity. However, if there is un-modeled noise in the data, including additional Keplerian signals in the model can lead to overfitting; for example, high eccentricity Keplerian models have similar properties to delta functions, which have relatively "flat" RV curves, except for a spike in RV near periastron. With insufficient sampling, outlier data points can be overfit with eccentric Keplerian signals.
A common worry in the RV modeling community is that using GPR to model stellar activity will "soak up" Keplerian signals, leading to underestimates of Keplerian RV semi-amplitudes (discussed in Aigrain & Foreman-Mackey 2022), even when modeled jointly. However, we find evidence for the opposite effect in the SM21 preferred fit: that the Keplerian signals function as extra parameters that make the model susceptible to overfitting, and the GP is forced to compensate. Examining Figure 8, which shows the contributions to the mean model prediction from the Keplerians and the activity-only portion of the mean GP model 10 , we find that the activity model interferes with the Keplerian model where RV data exists. This is seen most readily when smoothing the activity model over several rotation periods (effectively low-pass filtering the activity model).
We can explain this behavior by imagining that there is some un-modeled noise source in the data that is inconsistent with Keplerian motion or quasi-periodic variability (see next section). If some non-physical combination of parameters fits the data better at an epoch with many data points that is affected by this noise source, this may outweigh the negative Bayesian evidence contributions from 1) the added complexity and 2) the worse fit at epochs with fewer data points. We would then expect the Keplerian model to oversubtract at epochs with fewer data points (e.g., around jd = 1725 in Figure 8).
This effect suggests that the Keplerian signals in the SM21 preferred fit are not a viable description of the RV variability at timescales greater than the rotation period. More effort certainly needs to be spent understanding this phenomenon, but in the meantime we suggest performing CV tests in order to detect overfitting of this nature.

Differential Rotation
7 This was also noted in SM21. 8 As of 1-30-23, V1298 Tau will unfortunately not be reobserved with TESS through year 6. We used tess-point (Burke et al. 2020) to make this determination. 9 It is worth pointing out that similar strategies have been successful before, e.g., to measure the mass of Kepler-78 b (Pepe et al. 2013, Howard et al. 2010 The activity-only portion is isolated following SM21, subtracting the Keplerian mean model from the total mean GP prediction.   The previous subsections all argue that the preferred SM21 fit had too many free parameters (or effective free parameters) that allowed the model to overfit. In other words, we have argued that a simpler model (one for which the GP predictions for each instrument are scalar multiples of each other, a single period is present in the kernel, and no Keplerian signals are present in the model) would be more predicitive, albiet perhaps with larger uncertainties. In this section, we suggest that this much simpler proposed model is still insufficient, because the host star has multiple, differentially rotating, active regions.
Differential rotation may not be the un-modeled noise source that we propose is affecting the SM21 preferred fit. The conclusions of this paper do not change if this is true. We discuss it here because it is potentially widely relevant, especially for young stars. We call for more work on modeling and understanding differential rotation in RVs.

Evidence for a Strong Differential Rotation Signal from Photometry
In the K2 and TESS photometry of V1298 Tau (Figure 9), two periodic signals of different amplitudes are visible by eye. These peaks are coherent in phase towards the end of both baselines, producing a larger overall photometric variability amplitude. Although each baseline covers only a portion of the beat periods implied by these different periods coming into and out of phase, the beating "envelope" is still easily distinguished. To guide the eye, we overplotted the shape of the beating envelope formed by the three dominant periods in the Lomb-Scargle periodogram of the K2 data.
Multiple closely-related periodicities are also apparent in the periodograms of the K2 and TESS data (and the LCO data, albeit at lower significance, potentially due to the lower cadence of that dataset; Figure 10). In particular, over both the K2 and TESS baselines, a dominant periodicity at 2.85 and 2.92 d, respectively, and two less prominent periodicities (one at a larger period, and one at a smaller period) are present. The multiple periodicities in the light curve, visible both in the shape of the beating envelope and in Fourier space, have often been interpreted as a smoking gun of differential rotation (see, e.g., Lanza et al. 1994, Frasca et al. 2011. It is important to note, however, that short spot lifetimes may also produce the observed photometric pattern, and have been shown in simulations to be easily confused with differential rotation (see, e.g., Basri & Shah 2020). Longer photometric time baselines than are available in the photometric data presented in this paper are needed to distinguish between the two. The conclusion of this section (that there is a noise source visible in photometry that is un-modeled in the SM21 preferred model) would remain unchanged in this case, but this interpretation has important implications for future modeling efforts. That the signals arise from a close binary is ruled out by the multiple nearby periods in the light curve (rotation of two tidally extended binary stars can produce a similar pattern, but with a single period), while astroseismic pulsations are ruled out by the amplitude and period of the variability; V1298 Tau is a PMS 1.2M ⊙ star with log g=4.48 (SM21), which we would expect to be oscillating on the scales of minutes and ≲1 ppt, not days and 20 ppt (Chaplin & Miglio 2013; see their Figure 3).

Effect on RVs
Assuming that V1298 Tau is differentially rotating, it is possible that the combination of a multiply periodic structure with insufficient cadence is leading the GP to prefer a more complex model. In other words, the data is not consistent with a quasi-periodic structure, so a simple quasi-periodic model will not be preferred over a more complex model (e.g., one with Keplerians at longer periods), even if neither is predictive. Even a secondary active region with 5% the RV amplitude of the primary structure (reasonable given the photometric amplitude ratios) would incur an RV variability of 20 m/s, significantly greater than the instrumental floor of HARPS-N, CARMENES, and HIRES.
An important clarification is that this conclusion is consistent with the discussion in Section 4.2. Although there is a clear periodic 2.91 d signal visible in Figure 7, there is also ∼200 m s −1 of scatter around this signal. It is possible that this scatter may contain coherent signals at other periods that are unresolvable with the current RV cadence.
Complicating this already complicated story is the fact that the dominant periodicity appears to change over time ( Figure 10). This provides further motivation for our major recommendation, first given in Section 4.2: V1298 Tau appears to be a multiply-periodic star with evolving periodicity. A high-cadence (several data points per night) RV campaign is necessary to construct a high-fidelity activity model. The high cadence is necessary to resolve the close periodicities due to apparent differential rotation. Care should be taken to ensure that the periods do not evolve significantly over the observing baseline, or that this effect is sufficiently modeled. Figure 9. A tour of the relevant photometry of the star V1298 Tau. Panel a: detailed view of the K2 photometry (purple points), with a beating envelope over-plotted in solid pink. The beating envelope is drawn to illustrate the effect of spot beating on overall variability amplitude, not to precisely fit the data. The envelope drawn is constructed from the beating of three sinusoids at 2.70, 2.85, and 3.00 d. Signatures of beating can be seen by eye: two peaks of different amplitudes phase up toward the end of the K2 baseline, producing a single-peaked variability pattern and a larger overall variability amplitude. Panel b: detailed view of the TESS photometry (purple points). Beating characteristics are also visible, although the baseline is shorter than that of K2. Panels c, d, and e: relative views of K2, LCO, and TESS photometry, emphasizing relative time baseline and variability amplitude. A typical error bar for each dataset is also shown in the bottom left corner of each panel. The differences in wavelength coverage and flux dilution between the K2, LCO, and TESS photometry largely account for the overall differences in amplitude of the signals. Both the K2 and TESS data cover less than one complete beat period of the two largest-amplitude periodic signals, but the LCO photometry (which is contemporaneous with the RVs of SM21) covers a longer time baseline. Panel e: All photometry, plotted on the same panel to emphasize relative time elapsed between each dataset. Takeaway: differential rotation effects are visible by eye in both the K2 and TESS datasets. In this study, we have presented evidence that the preferred model of SM21 is overfitting using two ad hoc "validation" data sets: one set of contemporaneous HIRES and CARMENES data, and one set of artificially held-out HARPS-N data. The effects that we have proposed may be responsible for the non-predictiveness of the preferred SM21 model are:

SUMMARY & DISCUSSION
• The RV datasets from different instruments are treated as uncorrelated, allowing the model more freedom.
• The SM21 preferred model includes two summed quasi-periodic terms at P rot and P rot /2 in their kernel, each with its own free exponential decay parameter. This additional free parameter grants the model unnecessary flexibility.
• The SM21 model also includes parameters describing eccentric Keplerian signals, which grant even more degrees of freedom.
• We find evidence from multiple independent photometric datasets that this star has a strong differential rotation signal, indicating that a singly (quasi)-periodic activity model is insufficient. This explains why more complex models were favored over simpler models in SM21, even though the preferred model fell victim to overfitting.
The first point, in particular, warrants further scrutiny for stars across a range of ages and spectral types. We argued in Section 4 that RV datasets taken by instruments with different bandpasses and calculated using different RV extraction techniques should be linear combinations of each other, recapitulating the observation made in Cale et al. (2021), but this assumption may not be true. Contemporaneous RV datasets made by different instruments will help test this assumption.
These authors have devoted significant person-and computer-power to producing a fit to the data presented here that take into account all of these effects. However, we have found that jointly fitting all the data using only a single rotation period forces all of the instrumental GP amplitudes to 0. We interpret this as evidence that a singly (quasi)-periodic GP model is incapable of fitting the data (i.e., a more complex model is needed), and differential rotation provides a ready (but not sole) explanation. However, the differential rotation effects are very complicated to disentangle with the current dataset. 11 Again, we suggest a high-cadence RV campaign to resolve the multiple, nearby periodicities in the RVs and construct a high-fidelity model.
One important detail to note is that the GP kernel which best-fits a highly active, rapid-rotator like V1298 Tau may be wholly inappropriate to fit the activity signal of an older, quieter, Sun-like star. In young, rapid-rotators, the activity signal is relatively long-lived, often stable across several observation epochs (e.g., Yu et al. 2019a;Carvalho et al. 2021).
On the other hand, Sun-like stars have much shorter-lived spots, sometimes evolving over the course of one or two week observing campaigns (Giles et al. 2017;Namekata et al. 2019). A GP kernel describing the activity of Sun-like stars should be more flexible, allowing for more rapidly changing and decaying signals. While a single kernel may be capable of spanning these regimes of period evolution, the attempt to construct one should be made with caution. For the time being, the best approach may be to treat the two regimes of activity with unique kernels. This analysis is imperfect and incomplete. Many of the effects we have discussed are subtle, and we encourage others to study them further. This analysis has also evolved (quite a lot) over the preparation of this study.
There are many exciting follow-up avenues for the V1298 Tau system. First, an independent determination of the planet masses with TTVs would be enormously helpful in providing a "check" for RV modelers. Second, we believe it is worthwhile to explore modeling frameworks for V1298 Tau that explicitly model the relationship between contemporaneous photometry, activity indices, and multiple RV datasets. These frameworks (such as that of Rajpaul et al. 2015 andCale et al. 2021) move beyond sharing hyperparameters between contemporaneous photometric and RV datasets and allow a function of one dataset to be directly correlated with the other, decreasing the overfitting potential. In the longer term, comparing or jointly modeling these data with Doppler tomographic information and spectrum-level measurements, as in Yu et al. (2019b);Finociety et al. (2021); Klein et al. (2022) will provide even stronger constraints.
In addition to working toward an optimal physical model of all available data, it is worth investigating alternative statistical modeling pathways to GPR, especially low computational cost techniques like autoregressive moving average (ARMA) models (Feigelson et al. 2018, Durbin & Koopman 2001. ARMA models treat the ith datapoint as a linear combination of past data points and model residuals, and "training" involves optimizing the linear coefficients. Directly comparing models constructed with ARMA and GPR would be a worthwhile exercise in general for datasets containing stellar activity, and in particular for young, active stars.
We believe that understanding the RV variability of young stars is an endeavor that will pay dividends in the near future. The relative long-term stability of activity on young stars allows for detailed study of a given spot geometry and its impact on both photometric and spectroscopic observations across multiple bands. As we work to understand how to best fit activity with GPs, young stars, particularly WTTSs, provide good laboratories on which to test our techniques.
Just as we validate the performance of a new instrument on stars with large, well-studied Keplerian signals, we must, as a field, validate the performance of our activity-modeling techniques on stars with large, well-studied activity signals before we can trust activity-models applied to Sun-like stars at 30 cm s −1 precision 12 . This starts by allocating resources to the construction of high-cadence RV datasets of young stars, and continues by studying a) the relationship between RVs and auxiliary data, such as photometry and activity indices, b) the best phenomenological models (kernels, etc) for the data, c) the best methods for validating a given model's accuracy, and d) the cadence needed to resolve periodic signals (and combinations of signals). We believe that these studies, on young stars, will pave the way for stellar activity models with 30 cm s −1 predictive capability, on which the characterization of Earth 2.0 depends.

A. GAUSSIAN PROCESSES AND OCCAM'S RAZOR
Many introductions to GPR (e.g., Aigrain & Foreman-Mackey 2022) mention that the GP likelihood has an "Occam's razor" term built in that penalizes complexity. This section briefly reviews GPR, then outlines a geometrical interpretation of the complexity penalty in order to further readers' understanding.
A Gaussian process regression model parameterizes the covariance between data points using a kernel function. A statistician may pick an arbitrary function (subject to certain mathematical requirements, see Rasmussen & Williams 2006 for the gory details) to be the kernel, which can then be used to calculate the covariance between any two data points. As an example, let's consider the periodic kernel: where η 1 is the amplitude, P rot is the variability period (often the star's rotation period), and η 3 is the harmonic complexity, or degree of "wiggly-ness" of the repeating signal. Given this model for the covariance of our data, and some data, we can make a prediction, which is the conditional probability distribution over expected values at new measurement times. This is referred to as conditioning a GP on a set of data. Importantly, Gaussian process regression does not inherently involve training (i.e., parameter tuning, generally via an optimization and/or MCMC step). Gaussian process regression is just the process of using a parametrization of your covariance matrix to predict the values and uncertainties of new data points given existing data points.
The "training" part comes in when you are optimizing the hyperparameters of your kernel (optionally jointly with parameters of a mean function, which could be a function of Keplerian orbital parameters). Now, it becomes important to compute a statistic describing how well your GP model fits your data, so that you can optimize the (hyper)parameters to obtain your result. This is where the Gaussian process likelihood comes in: where C is the covariance matrix computed for the times at which you have data, N is the number of measurements, and r is the vector of residuals (data -mean model). The first term is a constant, and does not change as a function Figure 11. Demonstration of the impact of constructing separate covariance matrices and adding the log(likelihoods). Compare with Figure 12. The data and best-fit parameters are for K2-131, published in Dai et al. (2017), for demonstration purposes only. Top: GP mean prediction (black solid line) and 1-σ uncertainties (purple filled), together with the HARPS-N data points on which the GP is conditioned (purple points). Middle: Same as top, but for PFS data. Bottom: Residuals with respect to the GP mean prediction. Takeaway: When separate covariance matrices for each RV instrument are used, contemporaneous data are uncorrelated in the model, allowing additional degrees of freedom. Figure 12. Same as Figure 11 (in particular, using the exact same data and GP hyperparameters), but here a single covariance matrix is constructed, following the suggestion in Section 4.1. Takeaway: Constructing a single covariance matrix requires that GP predictions for separate instruments are scalar multiples of one another, which is more consistent with physical expectations and results in a more constrained model than one with a separate covariance matrix for each instrument.
of the kernel hyperparameters, and the second term is analogous to χ 2 (in fact, it reduces to χ 2 in the limit of no offdiagonal covariance). The second term describes how well your mean model and correlated noise description matches your data. The third term is the "Occam's razor" term that penalizes complexity. To understand how the third term penalizes complexity, recall that the determinant of a matrix can be understood as the hypervolume between vectors defined by the columns of the matrix. To make this concrete, consider the 3x3 identity matrix: The vectors defined by the columns of this matrix are (1,0,0), (0,1,0), and (0,0,1). The volume of the 3D shape defined by these vectors (the unit cube) is 1, the same as the matrix determinant! The i -th column vector of a covariance matrix can be interpreted as the vector of covariances between a data point taken at t i and every other data point in the dataset. The determinant of this matrix, then, is the hypervolume defined by these covariance vectors. A perfectly covariant matrix, in which all data points are perfectly correlated, will consist of all 1s 13 , and the covariance vectors will all "point" in the same direction. This results in a third-term contribution of: A matrix of perfectly independent data points, on the other hand, is (a scalar multiple of) the identity matrix. The covariance vectors all "point" in orthogonal directions. This matrix results in a third-term contribution of: This exercise demonstrates that the determinant of the covariance matrix quantifies how "clustered" the covariance vectors corresponding to each data point are in hyperspace. More clustered covariance vectors get a big likelihood boost, while less clustered/more independent covariance vectors get a smaller boost. Figure 5.3 in Rasmussen & Williams (2006) decomposes the likelihood contributions of the second and third terms in Equation A2, illustrating how they combine to produce a local likelihood maximum in parameter space for a toy model.