Latent Stochastic Differential Equations for Modeling Quasar Variability and Inferring Black Hole Properties

Quasars are bright and unobscured active galactic nuclei (AGN) thought to be powered by the accretion of matter around supermassive black holes at the centers of galaxies. The temporal variability of a quasar’s brightness contains valuable information about its physical properties. The UV/optical variability is thought to be a stochastic process, often represented as a damped random walk described by a stochastic differential equation (SDE). Upcoming wide-field telescopes such as the Rubin Observatory Legacy Survey of Space and Time (LSST) are expected to observe tens of millions of AGN in multiple filters over a ten year period, so there is a need for efficient and automated modeling techniques that can handle the large volume of data. Latent SDEs are machine learning models well suited for modeling quasar variability, as they can explicitly capture the underlying stochastic dynamics. In this work, we adapt latent SDEs to jointly reconstruct multivariate quasar light curves and infer their physical properties such as the black hole mass, inclination angle, and temperature slope. Our model is trained on realistic simulations of LSST ten year quasar light curves, and we demonstrate its ability to reconstruct quasar light curves even in the presence of long seasonal gaps and irregular sampling across different bands, outperforming a multioutput Gaussian process regression baseline. Our method has the potential to provide a deeper understanding of the physical properties of quasars and is applicable to a wide range of other multivariate time series with missing data and irregular sampling.


INTRODUCTION
Active galactic nuclei (AGN) are among the brightest objects in the Universe and play a crucial role in galaxy evolution.They are believed to derive their immense brightness from the conversion of gravitational poten-tial energy into thermal radiation in the hot accretion disks of supermassive black holes (SMBHs) at the center of galaxies (Salpeter 1964;Zel'dovich 1964).Luminous AGN with unobscured accretion disks are known as quasars.These objects are so luminous they remain observable at extreme cosmological distances (Mortlock et al. 2011;Bañados et al. 2018) making them exceptional probes of the early Universe.
The stochastic variability of quasars has been studied extensively since their discovery (Greenstein 1963;Haz-ard et al. 1963;Matthews & Sandage 1963;Oke 1963;Schmidt 1963).While the accretion disk of a quasar is too small to be resolved at extragalactic distances, the temporal brightness variability found in quasar light curves can provide insight into the intrinsic physical properties of AGN.For instance, the variability amplitude increases with decreasing luminosity, rest-frame wavelength, and Eddington ratio (Wills et al. 1993;Giveon et al. 1999;Berk et al. 2004).Efforts have been made to constrain the correlation of variability parameters with black hole mass, but their robustness has been inconclusive, with studies claiming positive or negative relations depending on the degrees of observational bias present in the fit data (Wold et al. 2008;MacLeod et al. 2010;Simm et al. 2015).Inferring black hole physics from quasar light curves can teach us about the evolution of the Universe and the nature of dark matter and dark energy (Khadka & Ratra 2020;Czerny et al. 2023).
UV/optical quasar variability is typically modeled using Gaussian process regression (GPR) by fitting light curves with a specified kernel and optimizing the kernel parameters (see Aigrain & Foreman-Mackey 2022, for a review of GPR in astronomical time series).For example, Stone et al. (2022) used the GPR package Celerite (Foreman- Mackey et al. 2017) to fit 20 yr quasar light curves separately in each band with a damped random walk (DRW) kernel.Markov Chain Monte Carlo (MCMC) sampling is then used to draw from the joint posterior distribution and obtain uncertainties on the variability parameters.The measured variability parameters have been empirically shown to be related to properties of the quasar such as the black hole mass (MacLeod et al. 2010;Suberlak et al. 2021).
This UV/optical variability is thought to be powered by an X-ray driving variability source that illuminates the accretion disk (Cackett et al. 2007).The reprocessing of the driving variability in the UV/optical regions of the accretion disk leads to wavelength-dependent time lags that can be measured through continuum reverberation mapping to probe the relative size scales of the emitting regions.This also makes the UV/optical variability highly correlated, since these regions are driven by the same X-ray source (e.g., Miller et al. 2023).The time delays between bandpasses are related to properties of the accretion disk and black hole (Cackett et al. 2021;Jha et al. 2022;Wang et al. 2023).
Upcoming wide-field surveys such as the Rubin Observatory Legacy Survey of Space and Time (LSST) mark an unprecedented improvement in data quality and volume.The LSST main survey alone (10,000 deg 2 ) is projected to yield tens of millions of AGN light curves with six UV/optical bandpass filters (ugrizy) at 55−185 sam-plings per band, out to redshift z ∼ 7.5 and moderate luminosities of L ∼ 10 44 erg s −1 over a ten-year period (Abell et al. 2009).In the Deep Drilling Fields, a smaller sky area (200 deg 2 ) with finer cadence, we expect to detect 40,000 additional ultrafaint AGN at about 1000 samplings per band (Brandt et al. 2018).Machine learning (ML) algorithms are well suited to analyze the large amount of data expected from LSST and other wide-field surveys; however, these light curves pose challenges for existing techniques such as multiple bands, long gaps of missing data, nonuniform sampling, and photometric and systematic noise.
Applying a GPR analysis on the entire LSST lightcurve sample would be very computationally challenging, but a fully trained neural network (NN) could be rapidly deployed on all the tens of millions of AGN light curves expected from LSST.Several authors have applied ML methods to analyze quasar light curves.Tachibana et al. (2020) used a recurrent autoencoder to model quasar variability and found it to be superior to a DRW model in real data, but it performed similarly on simulated DRW light curves.Sá nchez-Sáez et al. ( 2021) used a recurrent variational autoencoder as a form of anomaly detection to detect changing-state AGN. Čvorović Hajdinjak et al. (2022) used conditional neural processes (Garnelo et al. 2018) as a nonparametric way of modeling AGN light curves.Sheng et al. (2022) applied stochastic recurrent NNs (RNNs; Fraccaro et al. 2016) to reconstruct light curves of simulated LSST data.In microlensed quasar light curves, Vernardos & Tsagkatakis (2019) used a convolutional NN to measure the accretion disk size and temperature profile in simulations, and Best et al. (2022) predicted the black hole mass, inclination angle, and impact angle.Park et al. (2021) introduced the first method to simultaneously reconstruct quasar light curves and predict accretion disk parameters using attentive neural processes (Kim et al. 2019).Danilov et al. (2022) used neural inference Gaussian processes and introduced the concept of the convolved DRW to model quasar variability, and thus outperformed standard GPR.
In this work we adapt latent stochastic differential equations (SDEs; Li et al. 2020) to model irregularly sampled astronomical time series and perform parameter inference.Latent SDEs are a type of generative deep learning model that can model continuous-time stochastic dynamics.They can be viewed as infinitedimensional variational autoencoders (VAEs; Kingma & Welling 2013;Rezende et al. 2014) with an SDE-induced process as their latent state.We train a latent SDE to model the dynamics of simulated ten-year LSST light curves and simultaneously predict variability and accre-tion disk parameters such as the timescale and strength of the variability, the black hole mass, the temperature slope, and the inclination angle.Unlike GPR on individual light curves, latent SDEs can model the variability across an entire training set without being restricted to any Gaussian process kernel and can directly infer accretion disk properties.Latent SDEs can also simultaneously fit all bandpasses at once and directly predict parameter posteriors without the need for computationally expensive MCMC sampling.
We present our light-curve simulation in Section 2. We introduce our latent SDE model and training in Section 3. We describe our GPR baseline in Section 4. We give our results on the performance of our model at reconstruction of light curves and inference of variability and accretion disk parameters in Section 5. Our concluding remarks and discussion are given in Section 6.

LIGHT-CURVE SIMULATION
Our training set comprises of a realistic simulation of LSST 10 yr light curves.The UV/optical variability in our simulation is modeled as a DRW X-ray source corona situated above the black hole illuminating the accretion disk (Cackett et al. 2007;Starkey et al. 2015).
To simulate the response of the accretion disk to the driving DRW, a transfer function is calculated for each spectral band (Blandford & McKee 1982).The transfer functions account for the reprocessing of the X-ray driving variability at the corona to the UV/optical variability emitted by the accretion disk, and they contain information about the quasar structure and black hole geometry.After the light curves are simulated, they are degraded to mimic LSST-like 10 yr observation cadences and noise.

Accretion Disk Model
Recent studies of quasar variability have found that quasar accretion disk sizes tend to be larger than the predictions of the thin-disk model by a factor of ∼2-3 (Edelson et al. 2015;Fausnaugh et al. 2016;Edelson et al. 2017;Jiang et al. 2017;Starkey et al. 2017;Kokubo 2018;McHardy et al. 2018;Mudd et al. 2018;Pozo Nuñez et al. 2019;Amorim et al. 2020) although this is not always the case (Edelson et al. 2019;Homayouni et al. 2019;Hernández Santisteban et al. 2020;Yu et al. 2020).We use a modified version of the thin-disk model 1 that includes special relativistic effects (Doppler beaming) and general relativistic ray tracing in a Kerr black hole geometry (Kerr 1963).All of the physical 1 The model will be open source in Best et al. (in preparation).
properties of the black hole and quasar geometry are included in the transfer functions that are modeled using the general relativistic ray-traced accretion disk simulation software Sim5 (Bursa 2017(Bursa , 2018) ) similar to the procedure of Best et al. (2022).These transfer functions assume an X-ray driving source irradiating a vertically thin, optically thick accretion disk (Shakura & Sunyaev 1973).The thin-disk model predicts an effective temperature profile of the form T eff ∝ r −3/4 .Recent studies of quasar microlensing variability studies have considered a general temperature profile of the form T eff ∝ r −β and favor shallower slopes β < 3/4 (Cornachione & Morgan 2020).Some accretion disk models predict shallower or steeper slopes than the thin-disk slope of β = 3/4.For example, the slim disk model predicts a shallower slope of β ≈ 0.5 (Abramowicz et al. 1988) while the MRI model predicts a slope of β = 7/8 (Agol & Krolik 2000).The temperature slope can also be modified due to the presence of wind outflows where the accretion rate becomes radially dependent (You et al. 2016;Li et al. 2018;Sun et al. 2018).We assume a modified version of the thin-disk plus lamppost temperature profile of the form where M is the black hole mass, r is the radial distance to the SMBH, r s = 2r g = 2GM/c 2 is the Schwarzschild radius, r ISCO is the radius of the innermost stable circular orbit (ISCO), Ṁ is the accretion rate, σ SB is the Stefan-Boltzmann constant, β is the temperature slope or power law of the accretion disk, η X is the conversion efficiency from accreted matter to X-ray radiation, and A is the albedo with full absorption at A = 0 and full reflection at A = 1.The ISCO radius only depends on the SMBH spin a where r ISCO = 6r g for a = 0 (nonrotating), 1r g for a = 1 (maximally prograde), and 9r g for a = −1 (maximally retrograde).Sun et al. (2018) proposed a wind cooling parameter corresponding to a radially dependent power law in the accretion rate, which is consistent with our the first term in Equation (1) in the case where β < 3/4.We generalize the power law to also consider the case when β > 3/4, consistent with Cornachione & Morgan (2020).This could come from wind inflows or from other divergences from the thin-disk model.The second term in Equation ( 1) is from the flux due to the lamppost corona.The Eddington ratio is defined as λ Edd = Ṁ / ṀEdd where the Eddington accretion rate is ṀEdd = 4πGM m p /ησ T c 2 , m p is the proton mass, σ T is the Thomson scattering cross section, and η is the overall radiative efficiency factor.Example of r-band transfer functions showing the effect of changing each parameter.We also show the impact of using different LSST bands in the bottom right panel.Any parameters not explicitly given are fixed: log 10 (M/M⊙) = 8, λ Edd = 0.15, h = 30rg, a = 0, β = 3/4, and z = 3.The mean times of the transfer functions are found using Equation (2) and shown by the vertical dashed lines (in the same color as their respective transfer functions).We introduce t0 in order to get rid of the initial time lag where the transfer function would be zero.
The transfer functions are then generated from the temperature profile using the irradiated disk model (Sergeev et al. 2005;Cackett et al. 2007) at each LSST wavelength to an outer radius of 2000r g with a resolution of 0.1r g .The transfer functions introduce a wavelengthdependent time lag (2) into the UV/optical light curves.We normalize the transfer functions to represent a probability distribution such that Examples of transfer functions are shown in Figure 1, where different sets of parameters are varied.In this figure we set the initial time lags to zero to better compare the impacts of different parameters.For example, a greater corona height h leads to a longer initial time lag because photons need to travel a longer distance from the corona to the disk.Each band will have the same initial time lag, however, so the lag does not impact our parameter predictions.It is clear from the figure that varying the mass has the largest effect on the transfer functions.The temperature slope also has a significant impact on the mean time delay of the transfer function.The inclination angle has little impact on the mean of the transfer functions but significantly changes its shape.The redshift, height, and Eddington ratio have more minor effects, while the black hole spin has a negligible impact.
An example of the difference in the mean time delay between the r and u bands demonstrating the impact the parameters have on the mean time delay between the two bands is shown in Appendix A. Since many parameters impact the mean time delays in similar ways, there are degeneracies that make predicting the accretion disk parameters challenging.There is also a degeneracy between the transfer functions and the X-ray driving variability.Convolving the driving signal with the transfer function kernel eliminates some of the high-frequency variations, making the new time series appear to have a longer variability timescale.This means our parameter predictions must be primarily driven by the time delays between bands, with higher-order effects arising from the differences in the variability timescales across bands.

Model of the Quasar Driving Variability
Quasar UV/optical variability is often modeled as a damped random walk, a type of Gaussian process also known as the Ornstein-Uhlenbeck process (Rasmussen & Williams 2006;Zu et al. 2013).A DRW signal X(t) is governed by the SDE: where ϵ(t) is a white noise process with mean zero and variance one, τ is the characteristic timescale, b is related to the mean of the process X = bτ , and σ is related to the standard deviation of X(t) defined by the asymptotic structure function SF ∞ = σ τ /2 (Kelly et al. 2009).GPR can be used to measure the variability parameters SF ∞ and τ , which may be correlated with the black hole mass (MacLeod et al. 2010;Suberlak et al. 2021).A smaller τ yields more high-frequency variations since the variability happens on a shorter timescale.The kernel describing the covariance between two observations separated in time by ∆t is given by where the ACF is the autocorrelation function.An alternative description of the process is the structure function (SF), which is the rms scatter of magnitude differences ∆m as a function of the temporal separation ∆t.For the DRW process, the SF takes the form The DRW is a red noise process with power spectral density (PSD) given by which implies that P (f ) ∝ f −2 at high frequencies when f ≫ (2πτ ) −1 and P (f ) is constant at lower frequencies.
If the local variability of the corona follows a DRW with characteristic timescale τ rest , then we will observe a boosted variability with τ obs = (1 + z)τ rest .We boost our DRW to the observer frame when simulating light curves, but we train our network to predict τ rest since it is the timescale that governs the intrinsic variability of the quasar and should be correlated to the physical properties of the accretion disk (for example, the black hole mass).Hereafter, τ refers to the variability timescale in the rest frame.
We model the X-ray corona driving variability using the DRW process.Although X-ray variability may be better described by a more general broken power-law PSD (McHardy et al. 2004;O'Neill et al. 2005;Uttley & McHardy 2005;Summons 2007;Markowitz 2010;Yang et al. 2022;Yuk et al. 2023), we use the simple DRW prescription to try to recover its parameters τ and SF ∞ using latent SDEs and so that we can use an appropriate kernel for our GPR baseline.The DRW is itself a special case of the bending broken power law with a lower power of 0 and higher power of −2.We postulate that latent SDEs would only show improved performance compared to GPR with more complex driving variability due to their flexibility and GPR not having a suitable kernel.
Our X-ray driving DRW is generated in magnitude to avoid the possibility of getting negative values for the flux (Czerny et al. 2023).We select a mean magnitude independently of any of the variability or black hole parameters, drawing from the distribution X = N (20.90, 1.04) mag.The mean and standard deviation of this distribution come from the average magnitude of the LSST bands across the Sloan Digital Sky Survey and Hyper Suprime-Cam survey (Ahumada et al. 2020;Chan et al. 2023).We choose this strategy to avoid biasing our parameter estimations with priors on the correlation between absolute magnitude and recovered parameters including z, λ Edd , and log 10 (M/M ⊙ ).With actual LSST data, however, where any correlation is the direct result of physical processes we would aim to understand, the network could be fine-tuned with the data  to include these processes in its training, and we would expect the parameter recovery to improve.The DRW driving variability is converted from AB magnitude to flux2 and then convolved with the transfer function kernels to obtain simulated UV/optical light curves.The light curves are then converted from flux back to magnitude.
Our simulated UV/optical light curves are not strictly DRWs because they result from the convolution of the driving DRW with the transfer functions.The convolution smooths out the driving variability, and can change the power spectrum of the light curve by eliminating some of the high-frequency variations.For the majority of cases, the timescale of the transfer functions is significantly less than the timescales of the variability τ , making the effect minor.The convolution also introduces a time delay between bands caused by the difference in the mean time of the transfer functions.These time delays are related to the accretion disk parameters that produced the transfer functions.The mean time lags introduced by the transfer functions range from less than a day for less massive black holes and up to more than a month for the most massive black holes (see the top left panel of Figure 1).

Parameter Ranges
Our light-curve simulation is determined by a mean magnitude and nine physical parameters given in Table 1.The driving X-ray variability is dependent on SF ∞ and τ .The black hole geometry is determined by its mass M and dimensionless spin parameter a.The quasar is at an inclination angle θ inc with respect to the observer and at a redshift z.We also vary the height Table 1.Parameters and ranges for our light-curve simulation.The parameters are sampled uniformly between their minimum and maximum.We also fix the radiative efficiency to η = 0.1 and X-ray radiative efficiency and albedo to ( 1 of the X-ray corona h, the temperature slope of the accretion disk β, and the Eddington ratio λ Edd .There is also the radiative efficiency that we fix to the commonly used value of η = 0.1 (e.g., Fausnaugh et al. 2018).This radiative efficiency is degenerate with the Eddington ratio, so our measurement of the Eddington ratio can be thought of as λ Edd η/0.1.We also fix the X-ray radiative efficiency and albedo to a typical value of η (Ursini et al. 2020) and the albedo A is typically small, i.e., A ∼0.1-0.2 (Qiao & Liu 2018).

Mock LSST Observations
After a UV/optical light curve is simulated, we degrade it to impose LSST-like observing cadences and noise.The total error of an LSST observation is given by σ where σ sys is the systematic error and σ rand is the photometric noise.We set the systematic error to 0.005 mag, which is the maximum value expected for LSST (Ivezić et al. 2019;Suberlak et al. 2021).The photometric noise is expected to follow where γ is a band-dependent factor, m is the magnitude at each observation, and m 5 is the 5σ depth of a point source at the zenith of each visit.We set γ u = 0.38 and γ g,r,i,z,y = 0.39 (Ivezić et al. 2019;Sheng et al. 2022).
We impose LSST-like observation cadences and noise by using rubin sim 3 with the baseline v2.1 10yrs rolling cadence to obtain the time and m 5 of each observation.We sample 100,000 points anywhere in the sky that have between 750 and 1000 total observations across the ten years to include only light curves from the main LSST survey (see Figure 1 of Prša et al. 2023).The inclusion of light curves from the Deep Drilling Fields that have higher cadences would improve the performance of our model.Once the LSST data become available, the network could be fine-tuned on the data or retrain using the exact cadence strategy as part of our simulation.
The error added to each observation is randomly sampled from a Gaussian with mean zero and variance σ 2 LSST .Observations are combined to the nearest oneday interval, averaging across σ 2 rand if there were multiple observations in a band in the same interval.We simulate 11 yr light curves with 10 yr of LSST observations so our model is trained to reconstruct some time before and after the survey.We randomly select the start time of the first LSST observation within the first year of the 11 yr light curves.

LATENT SDES
In this section we describe our latent SDE model and training.The model and code are open source and available at: https://github.com/JFagin/latentSDE.

Model Architecture
The input to the network is the brightness and error values (both in magnitude), for a total of 12 features at 3 https://github.com/lsst/rubinsim each time step (6 bands and 6 errors).For each band that is not observed at a given time step, we set both the brightness and error to a dummy value of zero to be masked by the network.The only preprocessing step required to apply the network to LSST data is to combine observations to the nearest daily interval and set the unobserved time steps to zero.
The network architecture can be broken down into three main parts: the RNN encoder, the neural SDE decoder, and the multilayer perceptron (MLP) parameter estimator.Figure 3 provides a high-level overview of the model.Overall, the network operates by first encoding the multivariate time series into a context using an RNN that includes masking of the unobserved times.Then the mean and standard deviation of the variational distribution are predicted from the context.A latent vector is drawn from the variational distribution and then the latent vector and context are used by the neural SDE solver to produce a time series with features equal to the latent size.That time series as well as the errors of the input light curve are then projected using another RNN into the mean and standard deviation of our reconstructed light curve.Another latent vector is also employed by the projector to improve the uncertainty estimation on our reconstructed light curve.In addition to the light-curve reconstruction, we predict the mean and covariance matrix for the parameter inference based on the context and latent state of our SDE.We jointly model the light curve and parameters since the light-curve dynamics should be directly related to the variability and accretion disk parameters.
The NN architecture uses a hidden size of 128 and a context size of 64.A latent size of 8 is used for the SDE with an additional latent vector of size 8 given to the RNN projector to aid in uncertainty estimation.Each hidden layer is followed by a LeakyReLU activation function (Maas et al. 2013) and layer normalization (Ba et al. 2016).The network is built using PyTorch (Paszke et al. 2019) and the SDE solver uses the torchsde 4 (Li et al. 2020) implementation.Our network has a total of 903,597 trainable parameters.The light curves are normalized to have mean zero and variance one during training (only using observed points to calculate the mean and variance and averaged across the bands).
The encoder starts with a GRU-D (Che et al. 2016) layer, a modified version of the gated recurrent unit (GRU; Chung et al. 2014) that is designed to handle both masking and irregular time intervals.We use the GRU-D to mask the unobserved portions of our light curves.The GRU-D layer is followed by two GRU layers and two fully connected layers that produce the context.We include a residual skip connection (He et al. 2015) between the GRU-D layer and the output of the second GRU layer, which is intended to improve gradient flow and training stability.Two more fully connected layers are used to produce the mean µ qz and variance σ 2 qz of the variational distribution q z from the last value, mean, and standard deviation of the context and the mean and standard deviation of the unnormalized light curve.We then draw the latent vector from the variational distribution z 0 ∼ N (µ qz , σ 2 qz ), the initial state of the SDE.The decoder includes an Itô SDE solver with diagonal noise containing networks governing the drift, diffusion, and prior drift (Li et al. 2020).We use the Euler-Maruyama numerical approximation scheme for the SDE solver.The drift and prior drift networks are both MLPs with a residual skip connection.There are three fully connected layers pre-skip and two post-skip.The diffusion network consists of three fully connected layers and is applied element-wise to satisfy the diagonal noise.In order to model both the dynamics of the time series as well as produce uncertainties on predictions, we use an RNN to project the output of the SDE and the input errors of the light curve onto the mean and logvariance of the observation space.The projector also includes a new latent vector z0 ∼ N (µ qz , σ 2 qz ) that is repeated across each time step in the input.We include the additional latent vector to enhance the uncertainty estimation since the latent SDE and its latent vector should model the dynamics of the time series, but we also want to model the uncertainty at each time step.The projector consists of a GRU-D layer followed by two GRU layers.There is also a skip connection of two fully 4 https://github.com/google-research/torchsdeconnected layers between the output of the SDE and the output of the RNN.
The parameter inference component of our model uses the same MLP architecture as the drift networks and outputs the mean and Cholesky factor of the covariance matrix for the multivariate Gaussian posterior (described formally in Section 3.2).The input into the parameter estimator is the last value, mean, and standard deviation of the context, µ qz , σ 2 qz , the output of the drift, diffusion, and prior drift networks at z 0 , and the mean and standard deviation of the unnormalized light curve.

Uncertainty Quantification and Loss Function
For light-curve reconstruction, we minimize the negative evidence lower bound (ELBO) described in Li et al. (2020).We use a negative Gaussian log-likelihood between the reconstructed and true light curve averaged across the bands: where y is the true light curve, ŷ is the predicted mean, σ2 is the predicted variance, and N is the number of time steps.In practice we predict log(σ 2 ) to force the variance to be positive.To ensure that the mean prediction closely matches the context points, we additionally evaluate the weighted mean squared error of the mean prediction with respect to the context points and their errors, averaged across the bands: where y LSST is the mean observation or context point with variance σ 2 LSST , m t is a mask that is 1 if a band is observed at a time step t and 0 if it is not, and N obs = N t=1 m t is the number of observations.We find that this additional term in the loss function leads to much better performance by overemphasizing the context points with respect to the mean prediction.
For the parameter inference, we parameterize the posterior as a multivariate Gaussian distribution and minimize its negative log-likelihood: where each y is the vector of true parameter values, ŷ is the vector of predicted means, and Σ is the predicted covariance matrix.In practice we predict the lower triangular matrix L representing the Cholesky decomposition of the covariance matrix Σ = LL ⊤ to ensure that the covariance matrix is symmetric, positive semidefinite, and nonsingular.We must also restrict the diagonal of L to be positive so the covariance matrix is positive semidefinite and the parameterization is unique, so we predict the log-diagonal of L and then take the exponential.To predict n parameters, the lower triangular matrix L will have n(n + 1)/2 free parameters.Each term in the loss is given equal weighting.When simulating the light curves, each parameter is drawn uniformly from a given range.We reparameterize the labels from their physical values to between zero and one and then take the logit (scaling the labels from −∞ to ∞) before evaluating the negative log-likelihood of our posterior.Once samples are drawn from the posterior, we take the sigmoid (forcing the predictions to be between zero and one again) and then scale them back to the original physical range.This prevents posterior probability from being wasted in physically impossible parameter space (such as restricting the spin to −1 < a < 1 for example) or outside the range of the training set.

Training
We train our network with 100,000 light curves per epoch that are randomly regenerated on the fly each epoch to prevent overfitting and improve performance.We also include a fixed validation set and a test set of 10,000 light curves each.Each set uses separate transfer functions to avoid contamination.
Our network is trained for 100 epochs using the Adam optimizer (Kingma & Ba 2017) with an initial learning rate of 0.0025, an exponential decay of 0.97, and a batch size of 50.To mitigate the problem of exploding gradients, we employ gradient clipping with a maximum gradient norm of 0.5.Linear KL-annealing (Fu et al. 2019) is used for the first 15 epochs as well as annealing for the parameter inference portion of the loss (Equation ( 11)) to initially focus on the light-curve reconstruction.Training took approximately six weeks on a single NVIDIA Tesla V100 GPU with 16 GB of VRAM.

GAUSSIAN PROCESS REGRESSION BASELINE
We aim to compare the performance of latent SDEs in light-curve reconstruction with an exact, multitask GPR baseline (Maddox et al. 2021).Our multitask GPR model can infer correlations between output variables, in our case the different LSST bandpass filters, unlike with Celerite (Foreman-Mackey et al. 2017), which fits a separate GP to each band.While some researchers, such as Hu & Tak (2020), have explored using GPR with multiband light curves, the more conventional approach is to fit separate GPs.To fairly compare GPR to our latent SDE model, we must jointly model bands so that observations in one band carry information across the other bands.This is crucial since each band is related to the same X-ray driving DRW process, and LSST observations are sparse and distributed asynchronously across the different bands.
The noise variance at each observation is set from the LSST observation strategy (see Equation ( 7)) so we use the fixed-noise GPR model.We use the absolute exponential kernel (equivalent to the Matérn-1/2 kernel), which corresponds to the DRW process (i.e., Equation ( 4)).This kernel has also been empirically shown to fit UV/optical quasar variability better than the Matérn-3/2, Matérn-5/2, rational quadratic, and squared exponential kernels (Griffiths et al. 2021).When fitting the GP, each band of the light curve is independently normalized to have mean zero and variance one, using only observed points and taking into account the error on each context point.Our fixed-noise multitask GPR baseline is implemented in BoTorch5 (Balandat et al. 2019), based on the GP library GPyTorch (Gardner et al. 2018).It uses the intrinsic co-regionalization model (ICM; Bonilla et al. 2007;Swersky et al. 2013) to correlate the kernels of different tasks (i.e., the different LSST bands in our case).

Light-curve Reconstruction Performance
After our model is trained, we apply it to the test set of light curves that are produced from driving variability and transfer functions that the model has never trained on.Figure 4 shows the light-curve reconstruction for an example in the test set.As expected, the model is more uncertain at times farther away from the context points such as during the seasonal gaps.Our network learns to tightly constrain the light curve across all bands given a context point in only a single band.This is because in our training set, the variability in each band of the light curve is derived from the same driving variability source and only differs based on the transfer function kernels.For comparison, the reconstruction of the same light curve with our multitask GPR baseline is shown in Appendix B and is similarly able to infer correlations between bands.
Table 2 compares the light-curve reconstruction performance of our latent SDE model with the GPR baseline across the test set.The latent SDE model performs better than the GPR baseline on light-curve reconstruction in terms of the rms error (RMSE), mean absolute error (MAE), and negative Gaussian log-likelihood (NGLL).In order to assess the significance of these differences, we conduct a paired t-test yielding extremely low p-values across all metrics (< 10 −10 ).This gives very high confidence that the differences in performance between the two models are not a product of random variation but can be attributed to the latent SDE model's capabilities compared to the GPR baseline.
We also show how well the uncertainty is calibrated for our light-curve reconstruction in Fig 5 .The GPR baseline is overconfident in its predictions, so the true light curve can sometimes fall outside the 3σ credible region.The latent SDE model is a little underconfident, so its predicted uncertainty may be slightly overestimated.An example of structure function recovery is given in Appendix C between the true light curve, the SF for the X-ray driving DRW, the mean latent SDE prediction, and the mean GPR prediction for the same example of a light curve shown in Figure 4.
Table 2. Light-curve reconstruction performance of our latent SDE model compared to a multioutput GPR baseline.Values reported are the median ± median absolute deviation on the median for each metric across our test set of 10,000 light curves.Lower is better for all the metrics.

Parameter Estimation Performance
In Figure 6 we compare the network's mean parameter predictions with their true values.The network is able to predict the variability parameters log 10 (τ ) and SF ∞ well across a broad parameter range.We recover the rest-frame τ well despite having poor constraints on the redshift.This is because the unknown redshift will be incorporated into the uncertainties of the predictions without a large change in the mean, and we predict log 10 (τ ) while the redshift only affects τ in a linear way.The black hole mass is harder to predict since it only affects the transfer functions that are convolved with the DRW and must be extrapolated from the time delays between bands.The inclination angle θ inc and temperature slope β have smaller effects on the mean and shape of the transfer function than the black hole mass, but we can still make meaningful predictions in many cases.This is especially the case for the highermass black holes where the transfer functions produce time delays at timescales of at least a few days.The network only has a little predictive power for the redshift z, Eddington ratio λ Edd , black hole spin a, and corona height h.This is to be expected since these parameters have only minor impact on the mean time delays between bands, as illustrated in Appendix A. In the case where the network is completely unsure about the value of a parameter, it will predict mean values near the mean of the prior with large uncertainties.We show 100 examples of predictions with uncertainties in Appendix D. Cases where the median predictions are far from the true value have larger uncertainties to account for the deviation.Figure 7 shows that our reported uncertainties of the parameter posteriors are, on average, correctly calibrated across the test set, although we are slightly underconfident in our predictions.This likely arises from the limited predictive strength of certain parameters, especially for smaller black hole masses.In such situations, our parameterization of the posterior as a Gaussian would not be ideal since there is equal probability the parameters are anywhere in our prior.
An example of the predicted posterior of our parameters is shown in Appendix E.Even though the uncertainties on individual quasar light curves are often large, with a large sample of predictions like the tens of millions of quasar light curves expected from LSST, it is possible to make precise estimates of the populationlevel distributions using hierarchical inference.Hierarchical inference can combine the parameter posteriors of individual light curves into population-level predictions given that the priors on our parameter space are known and build into the training set.

CONCLUSION AND DISCUSSION
We have presented latent SDEs as a way of simultaneously modeling quasar variability and predicting black hole properties.A pretrained latent SDE could be quickly and autonomously applied to the entire LSST quasar light-curve sample, whereas using GPR with MCMC sampling to obtain posteriors on variability parameters may be computationally infeasible.Training and inference with our model can be easily scaled with batch support across multiple GPUs, while using batched GPU support for GPR is an open area of research and not supported by Celerite for example.Our network is trained to infer the variability parameters of the quasar X-ray source corona while GPR can only infer the effective parameters after the driving variability is convolved with the transfer functions.In addition, GPR is limited to optimizing the GP's kernel parameters, whereas latent SDEs have the flexibility to infer any variability parameterization and reconstruct the light curve in a nonparametric way, bypassing the need for an analytic kernel.Typically GPR can only be applied separately on different bands, but here we used a multitask GPR baseline that is capable of correlating the GPs across the different bands.Our deep learning approach outperformed the state-of-the-art GPR baseline at reconstructing LSST-like quasar light curves.
The method presented here is completely general and can be applied to any multivariate time series with missing data and irregular sampling, with applications to other astronomical time series and beyond.In our use case, the latent SDE model learns to tightly constrain the light curve across bands using a context point in a single band due to the nature of our training set.We would expect our latent SDE model to be applicable to multivariate time series where the features (i.e. the bands in our case) are less correlated, although a larger latent size may be required to capture the additional complexity.In such a situation, the network would no longer be able to use context points in one band to tightly constrain the other bands.If the features were completely uncorrelated, it may be preferable to model each feature with a separate latent SDE instead of jointly.Parameter inference could still be performed from the joint latent space.
There is also the potential to improve the latent SDE model.Our model uses an RNN encoder and latent SDE decoder including a GRU-D layer (Che et al. 2016) to mask the unobserved times.In this work we use unidirectional RNN layers, which process time series sequentially in the forward direction.Using bidirectional RNN layers, such that the time series is encoded and decoded both forwards and backwards in time, may improve the performance of our model but would increase its computational complexity.We expect this to also smooth out the uncertainty in the reverse direction in areas where there are limited context points, such as around 2700 days in Figure 4. We may also explore other RNN archi-tectures such as stochastic RNNs, used to model quasar variability in Sheng et al. (2022).Recently Schirmer et al. (2022) introduced the continuous recurrent unit, which is a probabilistic recurrent architecture that includes a hidden state that evolves as a linear stochastic differential equation.We may also consider using other architectures instead of RNNs, such as transformers (Vaswani et al. 2023).For example, multi-time attention networks (mTANDs; Shukla & Marlin 2021) feed the time embedding to the attention mechanism to incorporate irregularly sampled time series.Transformers typically have many more parameters than RNNs, however, so training them in a latent SDE network may be difficult.Improvements to our latent SDE model will be investigated in future work.
Further developments in GPR frameworks could improve its performance on quasar light-curve reconstruction.The multitask GPR model does not explicitly account for the time delays between bands introduced by the convolution of the driving signal with the transfer functions.Latent SDEs learn to model the time delays since they are present in the training set.Latent SDEs are also able to jointly model accretion disk parameters such as the mass, inclination angle, and temperature slope from the light curve.These parameters are related to the time delays between bands and their relative smoothness, but no comparison currently exists with GPR.Promising areas for advancing GPR performance include the neural inference of Gaussian processes, as applied to quasar variability in Danilov et al. (2022), and the exploration of deep Gaussian processes (Damianou & Lawrence 2013).
Another area of application of our latent SDE model that was not explored in this work is for anomaly detection.The latent space of our model may be a powerful tool to detect quasars with anomalous variability behavior.Sá nchez-Sáez et al. (2021) used the latent space of a recurrent VAE to search for changing-state AGN, and our model could be used similarly.There is also the potential to identify periodic signals from SMBH binaries or to search for transient events.
In future work, we could improve our quasar lightcurve simulation to make it more realistic.Our simulated light curves were generated with a DRW X-ray source variability convolved with physics-based transfer functions that reprocess the driving variability into the UV/optical light curves.In reality, the X-ray variability is not exactly a DRW process but may be better described by the more general broken power-law PSD (McHardy et al. 2004;O'Neill et al. 2005;Uttley & McHardy 2005;Summons 2007;Markowitz 2010;Yang et al. 2022;Yuk et al. 2023).We use the DRW model so that we can test our model's ability to recover its parameters and to properly use the corresponding kernel for our GPR baseline.Our latent SDE model could be easily retrained or fine-tuned with more complicated variability models such as higher-order continuous-time autoregressive moving-average (CARMA) processes (Kelly et al. 2014;Stone et al. 2022;Yu et al. 2022) or generated from a general PSD using the method of Timmer & Koenig (1995).A general PSD could come from fitting X-ray variability data or be modeled by a broken power law.We expect the performance gap between latent SDEs and GPR to grow with more complex variability because latent SDEs are more flexible in situations in which no Gaussian process kernel exists.Tachibana et al. (2020), for example, found that a recurrent autoencoder performed similarly to the DRW GP for simulated DRW light curves, but the deep learning approach outperformed the GP model on real data when the variability is more complex.
In our simulation, we assumed the variability parameters were uncorrelated to our black hole properties, but there are measured correlations between the variability parameters and the black hole mass, for example, although their robustness has been inconclusive (Wold et al. 2008;MacLeod et al. 2010;Simm et al. 2015).We do not take these correlations as a prior in our simulation so that the correlations could be measured in real data without bias.If we did choose to model the correlation between the variability and accretion disk parameters, we would expect to improve the performance of the parameter predictions, since measuring the variability parameters would then help to constrain the redshift, Eddington ratio, and black hole mass.
Our simulation also did not model the additional information present in the relative fluxes of each bandpass, which could be used to infer the photometric redshift.A full treatment of the color information would require modeling the quasar spectrum and integrating transfer functions across the response function of each broadband filter.In particular, the relative fluxes of each band convey information about the quasar's redshift, as the redshift determines which spectral lines are captured by each broadband filter.By modeling the quasar spectrum and brightnesses, we anticipate a significant improvement in redshift estimation, and a minor enhancement in estimating other parameters due to the better constrained redshift.Care would need to be taken to avoid biasing predictions from inconsistencies between simulated spectra and real data.
LSST will observe too many quasars to obtain followup spectra to make precise redshift measurements for all the monitored quasars.A subset of the LSST quasar sample will get spectral measurements, and the precise redshift measurements could be used to enhance the parameter inference.There may also be mass measurements from the broad-line region (Panda et al. 2019) and these measurements could be used to better constrain the mass and other accretion disk parameters.
Once the LSST data become available, we could finetune our network in a self-supervised way using only the context points of the light curves (minimizing the loss function given in Equation ( 10) for example).We could also use supervised learning with a subset of the LSST sample, with training labels coming from spectral redshift, black hole mass estimates from the broad-line region, or other parameters that have been independently measured through a variety of ways.This could partially address the issue in ML when a network is trained with simulations that do not exactly match the real data.In particular, the network could learn the complex relationships between the variability and accretion disk parameters, the quasar brightness, and the photometric redshifts that we did not include in the training set but will be present in the real data.These relationships would then be a result of physical processes and not artificially imposed as priors in the training set, and we can expect our parameter estimates to improve.A limited number of quasars will be monitored in the Deep Drilling Fields, and our network is well suited to take advantage of the much higher sampling rate to better constrain the quasar parameters.Fine-tuning our network with the higher cadences may be especially useful.
There is also the possibility that even for parameters that are difficult to predict for individual quasar light curves, hierarchical inference could be used to estimate the population-level distributions of the parameter space.For example, measuring the population-level distribution of the temperature slope could constrain accretion disk and wind outflow models.The main effect the transfer functions have on UV/optical variability is by introducing wavelengthdependent time delays (see Equation ( 2)).We can only observe the relative time delays between bands because a constant time delay across all wavelengths is degenerate with the start time of the driving variability.Figure 8 shows an example of the difference in the mean times between the r-band and the u-band transfer functions.This demonstrates the impact each parameter has on the differences in time delay between bands and approximately how well we can expect to measure the parameters.Parameters that only change the mean time delays by less than half the time binning of our time se-ries (< 12 hr) my be infeasible to recover.There is an additional effect in that the transfer functions will also have a wavelength dependence in their standard deviation, making the variability of higher wavelength bands smoother.The inclination angle, for example, has very little impact on the mean time delay, but has a large effect on the width of the transfer function so can still be constrained in many cases (see Figure 1 to see the transfer functions for different θ inc ).   .The difference in the mean time lags between the r-band and u-band transfer functions for different parameters.We also show the impact that a change in the wavelength has on the time delay in the bottom right panel.Any parameters not explicitly given are fixed: log 10 (M/M⊙) = 8, λ Edd = 0.15, h = 30rg, a = 0, β = 3/4, and z = 3. Figure 1 shows the r-band transfer functions for the same parameter range.

C. EXAMPLE OF STRUCTURE FUNCTION RECOVERY
In Figure 10 we show an example structure of function recovery using the mean recovery of our latent SDE and GPR baseline.We could also sample different predictions from the latent space of our latent SDE to get some uncertainty on the SF recovery.The mean recov-  4 using the mean prediction of our latent SDE (green) and GPR baseline (red).The true SF (blue) comes from using the light curve before any observational effects.The DRW SF (orange) is the SF corresponding to the X-ray driving variability given in Equation ( 5), but will diverge from the truth due to the convolution of the driving variability with the transfer functions.The horizontal dashed line shows when SF = SF∞ and the vertical dashed line when ∆t = τ Figure1.Example of r-band transfer functions showing the effect of changing each parameter.We also show the impact of using different LSST bands in the bottom right panel.Any parameters not explicitly given are fixed: log 10 (M/M⊙) = 8, λ Edd = 0.15, h = 30rg, a = 0, β = 3/4, and z = 3.The mean times of the transfer functions are found using Equation (2) and shown by the vertical dashed lines (in the same color as their respective transfer functions).We introduce t0 in order to get rid of the initial time lag where the transfer function would be zero.
Figure 2 shows examples of DRWs for two values of τ .

Figure 2 .
Figure 2. Damped random walk for two different values of τ generated using Equation (3) with X = 21 mag, SF∞ = 0.1 mag, and z = 0.For comparison, both DRWs use the same white noise process so that their stochastic dynamics are identical.

Figure 3 .
Figure 3. Diagram of the latent SDE architecture.

Figure 4 .
Figure 4. Example of a simulated 11 yr light curve (black) from our test set with 10 yr of mock LSST observations or context points (blue) and our latent SDE reconstruction (orange) including at unobserved times.

Figure 5 .
Figure 5.The fraction of the truth encompassed by some probability volume (credible interval) across our test set for our latent SDE model's light-curve reconstruction (blue) compared to the GPR baseline (orange).Perfect calibration is shown by the black dashed line along the diagonal.

Figure 6 .Figure 7 .
Figure 6.The mean predictions compared to the true values for each parameter across our test set.The black dashed lines along the diagonal represent perfect predictions.
This research was made possible by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.Matthew Graham acknowledges support from National Science Foundation (NSF) AST-2108402.V. Ashley Villar acknowledges support from NSF under grant AST-2108676.The data used in this publication were collected through the MENDEL high performance computing (HPC) cluster at the American Museum of Natural History.This HPC cluster was developed with National Science Foundation (NSF) Campus Cyberinfrastructure support through Award #1925590.The authors would like to thank Favio Neira for assistance with rubin sim.We also thank Xuechen Li and David Duvenaud for their thoughtful comments on our work.Software: Matplotlib (Hunter 2007), Numpy (Harris et al. 2020), scipy (Virtanen et al. 2020), Astropy (Astropy Collaboration et al. 2018), astroML (Vanderplas et al. 2012), corner.py(Foreman-Mackey 2016), BoTorch (Balandat et al. 2019), torchsde (Li et al. 2020), PyTorch (Paszke et al. 2019) APPENDIX A. EXAMPLE OF THE IMPACT OF PARAMETERS ON MEAN TIMES OF THE TRANSFER FUNCTIONS

Figure 9
Figure9shows an example of light-curve reconstruction using our GPR baseline.

Figure 8
Figure8.The difference in the mean time lags between the r-band and u-band transfer functions for different parameters.We also show the impact that a change in the wavelength has on the time delay in the bottom right panel.Any parameters not explicitly given are fixed: log 10 (M/M⊙) = 8, λ Edd = 0.15, h = 30rg, a = 0, β = 3/4, and z = 3. Figure1shows the r-band transfer functions for the same parameter range.

Figure 9 .Figure 10 .
Figure 9. Same as Figure 4 but using the GPR baseline described in Section 4.eries of both the latent SDE and GPR mean recoveries generally follow the true SF.D. EXAMPLE OF SCATTER IN MEAN PARAMETER INFERENCE WITH UNCERTAINTYFigure11gives 100 random examples of parameter predictions with uncertainties compared to the true values.

Figure 11 .Figure 12 .
Figure 11.Examples of median parameter predictions and 1σ uncertainties for 100 random light curves in our test set.The black dashed lines along the diagonal represent perfect predictions.