Using Bayesian Deep Learning to infer Planet Mass from Gaps in Protoplanetary Disks

Planet induced sub-structures, like annular gaps, observed in dust emission from protoplanetary disks provide a unique probe to characterize unseen young planets. While deep learning based model has an edge in characterizing the planet's properties over traditional methods, like customized simulations and empirical relations, it lacks in its ability to quantify the uncertainty associated with its predictions. In this paper, we introduce a Bayesian deep learning network"DPNNet-Bayesian"that can predict planet mass from disk gaps and provides uncertainties associated with the prediction. A unique feature of our approach is that it can distinguish between the uncertainty associated with the deep learning architecture and uncertainty inherent in the input data due to measurement noise. The model is trained on a data set generated from disk-planet simulations using the \textsc{fargo3d} hydrodynamics code with a newly implemented fixed grain size module and improved initial conditions. The Bayesian framework enables estimating a gauge/confidence interval over the validity of the prediction when applied to unknown observations. As a proof-of-concept, we apply DPNNet-Bayesian to dust gaps observed in HL Tau. The network predicts masses of $ 86.0 \pm 5.5 M_{\Earth} $, $ 43.8 \pm 3.3 M_{\Earth} $, and $ 92.2 \pm 5.1 M_{\Earth} $ respectively, which are comparable to other studies based on specialized simulations.


INTRODUCTION
The observed substructures, like concentric rings and gaps, from near-infrared images of protoplanetary disks (hereafter PPDs) in dust emission (ALMA Partnership et al. 2015;Andrews et al. 2016;Huang et al. 2018a) are often interpreted as signatures of an embedded planet. However, owing to limitations in the current planet search techniques (Fischer et al. 2014;Lee 2018), direct detection of planets in disk is still a challenge.
Only recently there has been measurements using gas kinematics (Teague et al. 2018;Pinte et al. 2018Pinte et al. , 2019Pinte et al. , 2020, that found planetary signatures, such as "kinks" due to velocity perturbations inside the observed dust gaps. Additionally, direct detection using Hα emission of an accreting planet (Keppler et al. 2018;Wagner et al. 2018;Haffert et al. 2019) inside the cavity of the transition disk PDS-70 provides compelling evidence in favour of planetary gaps.
Thus, characterising large scale planet-induced distortions observed in PPDs give an unique opportunity to probe these unseen young planets during the epoch of their formation. It is achieved by identifying and comparing these features with theoretical models of planetary gaps generated using customized simulations (Zhang et al. 2018b;Crida et al. 2006;Paardekooper & Papaloizou 2009;Duffell & Macfadyen 2013;Fung et al. 2014;Duffell 2015;Kanagawa et al. 2015a;Ilee et al. 2020) or empirical relations (Kanagawa et al. 2016;Lodato et al. 2019). However, owing to the increasing sample size of the observed disks and complexity of disk-planet simulations one needs to run hundreds of customized simulations for analyzing each observed system. This is too expensive and inefficient to be of practical use as a detection tool. The alternative is to use deep learning models, trained with simulation data (generated only once), to detect/characterise exoplanets from observed data in a more efficient and accurate way.
In Auddy & Lin (2020); Auddy et al. (2021), hereafter Paper 1 and 2 respectively, we successfully designed deep learning models using multi-layer perceptron (MLP) or/and Convolutional Neural Network (CNN) to predict planet mass from planetary gaps in PPDs. Both models (DPNNet-1.0 and DPNNet-2.0) are end-to-end pipelines that are capable of predicting mass of an exoplanet from observed protoplanetary disk harbouring a gap inducing planet. However, one of the main challenges with traditional deep learning models is quantifying the uncertainties associated with their predictions. This has direct implication on how a deep learning model performs when trained on limited and/or noisy data. Furthermore, traditional deep learning model's inability to answer "I don't know" or "I am not confident" is concerning, particularly, while treating out-of-training distribution points (i.e., input data points which are outside the parameter space explored in the training dataset). Thus it is crucial to understand the limitations of a deep learning model by quantifying the errors associated with its predictions and identifying the source of uncertainty.
Recently some of these shortcomings are addressed by introducing probabilistic methods, such as Bayesian techniques, to deep learning architectures (Kendall & Gal 2017;Jospin et al. 2020;Wilson & Izmailov 2020). Bayesian neural network (BNN) improves on traditional deep learning models by providing uncertainty estimates associated with the prediction of these models. BNNs are truly unique in their ability to distinguish between uncertainly caused by limitations of the model and uncertainty that is inherent to the input data. More so, these architectures can even respond to pathological outof-training cases with a large error bar.
For a classical neural network, one would have deterministic point estimates from a single optimized network as the output of the network (e.g., see Auddy & Lin 2020). A BNN considers aggregation of predictions (also known as marginalization) from multiple independent predictor networks and performs a model average to get the output as a distribution. BNN is a type of stochastic neural network in which the weight are given as a distribution. It enables us to understand the uncertainty associated with the network architecture. This is particularly powerful in constraining the error when performing a regression task to estimate parameters (such as the planet mass in our case) using deep learning models. Thus any prediction comes with a confidence interval, meeting the requirement of a scientific experiment, and one that can be compared with other methods of parameter estimation.
In this paper, we develop and introduce "DPNNet-Bayesian", a standard MLP based model designed using Bayesian formalism to predict planet mass from disk gaps as well as quantify the uncertainty associated with the prediction. Our network is trained with synthetic data from explicit planet-disk simulations using fargo3d hydrodynamics code with a newly implemented fixed grain size module and improved initial conditions.
The paper is organized as follows: In Section 2 we give an overview of the Bayesian deep learning, its implementation and its advantage over traditional deep learning approaches. In Section 3 we describe the disk-planet hydro-simulations as well as the parameter space considered for the current study. We introduce the BBN architecture implemented in DPNNet-Bayesian in Section 4 and discuss in details the steps used to pre-process the data. In Section 5 we give the predictions along with the error estimates from our trained DPNNet-Bayesian, based on simulated data. In Section 6 as a proof-ofconcept we deploy DPNNet-Bayesian to observed PPDs around HL Tau and AS 209 to infer planet masses and the associated errors. Finally, in section 7 we discuss the results and draw conclusions.

BAYESIAN DEEP LEARNING-A GENERAL OVERVIEW
A Neural network architecture consists of a few thousand (or even millions) nodes/neurons that are densely interconnected. For this architecture, let • θ represent all the internal model parameters • D represent the training data, which consists of a specified set of observed features D x (e.g., gap size) and their corresponding output labels D y (e.g., planet mass).
Deep learning is a technique for optimizing the model parameters θ based on a training dataset D (Goodfellow et al. 2016;Shrestha & Mahmood 2019). This is done by computing the minimum 'cost' using some method of back propagation. Usually the cost function is defined in terms of the log likelihood of the training dataset and thus the training process can be defined as a Maximum Likelihood Estimation. This approach might suffer from over-confidence as it is deterministic in nature. An alternative to this is to implement the use of stochastic components in the neural networks. This can account for the uncertainties in the determination of the network parameters θ that finally shows up as an estimated error in the predictions made by a deep learning architecture (Jospin et al. 2020). For these type of neural nets, the stochastic elements can be introduced in terms of stochastic model weights or activation. Thus, instead of a single deterministic output from the traditional deep learning model, for the stochastic networks a marginalization of the output is considered based on some statistical principle (Wilson & Izmailov 2020).
In order to design a predictive model based on BNN, one starts with a suitable choice of a deep learning architecture. The next step is to choose a systematic prescription in order to introduce the stochastic elements in the network architecture. This can be modeled in terms of some possible prior distribution of the model parameters, p(θ).
From Bayes' theorem we get where D y denotes the training labels and D x are the training features as mentioned earlier. Bayesian posterior distribution, p(θ|D), depicts the model parameters as a probabilistic distribution rather than some deterministic value. The Bayesian approach allows us to compute model averages by essentially integrating over models, parameterized by the model parameters θ. The distribution of possible values of y (e.g., planet mass) given some input x (e.g., dust gap width) is given by In equation (2), the variance of p(y|x, θ ) captures the uncertainty due to measurement noise of the input data while the variance of p(θ |D) corresponds to the uncertainty associated with the neural network architecture. The classical training would correspond to, p(θ|D) replaced by a constant value of θ, but in the Bayesian case this would typically be a probability distribution with some finite variance (Wilson & Izmailov 2020;Abdar et al. 2021). Also, computing p(θ|D) from the training data is one of the most challenging and difficult aspects of the BNN architectures due to the large size of the parameter space. This is further discussed in the following sections.

The variational inference approach to Bayesian deep learning
Due to the large dimensionality of the sampling space, it can get quite challenging to merge the concepts of a neural network with the Bayesian approach. As the posterior distribution is extremely complex, directly estimating p(θ|D) to perform the integral in Equation (2) is not always possible. Different methods, involving various approximations, have been implemented to do this. One of the most popular and direct method for sampling the posterior distribution, p(θ|D), is using Markov Chain Monte Carlo (MCMC) (Bardenet et al. 2017). However, MCMC often has issues with scalability owing to large model sizes and it can be computationally expensive as well.
Instead of sampling from the posterior directly, the method of variational inference can be used as an alternative (Blei et al. 2017;Mohan et al. 2022). A distribution (referred to as the variational distribution), parametrized by an arbitrary set of parameters φ, is used to approximate the posterior. The parameter φ is learned in a way that the distribution q φ (θ) becomes close to the exact posterior p(θ|D). The closeness between the posterior and q φ (θ) is quantified in terms of the KL-divergence (Zhang et al. 2018a;Jospin et al. 2020). More precisely, the formalism of evidence lower bound (ELBO) is used for this purpose.
Minimizing the ELBO is equivalent to minimizing the KL-divergence as ELBO = log(p(D)) − D KL (q φ ||p), where log(p(D)) depends on the prior and D KL is the KL-divergence.

Features and advantages of BNN
Once the ELBO function is optimized and thus the variational distribution is learned/trained we can use it to predict the labels of new (unseen) observations. The predictive posterior distribution p(y * |x * , D) is given by integrating out the model/variational parameters (trained using the dataset D): where (x * , y * ) represents the features and labels respectively and the super-script " * " denotes unseen data. In order to obtain this predictive distribution we use Monte Carlo sampling. First, we sample from the distribution of model parameters N times, Then x i is sampled n times from the distribution of the input features, whose width is determined by the errors associated with the feature itself, to estimate the predic- The error/uncertainty in the predictions of BNN can be broadly classified into two categories: epistemic and aleatoric uncertainty (Abdar et al. 2021;Hüllermeier & Waegeman 2021). "Aleatoric" uncertainty (AL) represents the uncertainty inherent in the data which is given as the input for the model (e.g measurement noise). In contrast "Epistemic" uncertainty (EP) represents limitations of the ML based model which usually stems from lack of information about the system. Taking a larger dataset for training the model (spanning a larger parameter space) or improving the ML architecture can typically reduce the Epistemic uncertainty. On the other hand as the aleatoric uncertainty is intrinsic to measurement noise of the input data it cannot be reduced by solely improving the ML model. Both these uncertainties propagate to the output to give the net predictive uncertainty (Abdar et al. 2021) as shown in Figure (1).
We can measure the epistemic uncertainty, σ EP , from the variance of the posterior distribution of the model parameters, p(θ j |D), given as σ EP quantifies the intrinsic uncertainty associated with the model. It enables us to understand the correlation between the model performance, the quantity/quality of the data and the parameters of the ML architecture. Thus, monitoring σ EP can give insights on improving the overall ML model. The aleatoric uncertainty, σ AL , is obtained from the variance of p(y * i |x * i , θ j ), given as σ 2 AL ∼ Variance of p(y|x, θ).
σ AL quantifies the statistical uncertainty/error associated with various measurements that goes as input features to the neural net at the training stage as well as when the model is deployed for real world application. Knowing σ AL , one can monitor the errors in the prediction of the model as the measurement noise of the input features changes. The advantage of knowing both σ EP and σ AL is that, while testing the model, one can trace the source of uncertainty and improve the model/dataset accordingly.

DISK-PLANET SIMULATIONS
We model the structural variations of a dusty protoplanetary disk due to an embedded gap-opening planet using fargo3d hydrodynamic simulation code (Benítez-Llambay & Masset 2016). We update the public version of the code by parametrizing the dust gas interaction with constant particle size, instead of fixed Stokes number. This includes a generalization of the steady state profiles derived by Benítez-Llambay et al. (2019), which are presented in Appendix A. We regenerate the dataset with the updated simulations to train the BBN model. The simulation setup and the parameter space is discussed in the next section.

Protoplanetary Disk Setup
We model a simple 2D razor-thin disk with an embedded planet of mass M P located at R = R 0 . The planet is on a Keplerian orbit around a central star of mass M * and is not migrating. The unit of time is the planet's orbital period P 0 = 2π/Ω K0 , where Ω K0 is the Keplerian frequency. The sub-script 0 denotes evaluation at the planet's location R 0 .
The disk's initial gas surface density profile is given as where σ is the exponent of the surface density profile and Σ g0 = 10 −4 is a constant in code unit, which is arbitrary for non-self-gravitating simulations without orbital migration. For reference, however, this surface density scale corresponds to Toomre parameters Q ≥ 80 for all of our disk models, which is sufficient for the disk to be gravitationally stable (Toomre 1964). The disk is flared with a constant flaring index F, such that the aspect ratio varies as where H is the disk scale height and h 0 is the aspect ratio at R = R 0 . The disk is locally isothermal with a sound-speed profile of c s (R) = h 0 RΩ K . To mimic the disks turbulence we include viscous forces in the gas disk using the standard α-prescription (Shakura & Sunyaev 1973), where the kinematic viscosity ν = αc 2 s /Ω K and α is set by Equation A21.
We consider a single population of dust particles modeled as a pressureless fluid (Jacquet et al. 2011). The dust particle size is parameterized using Stokes numbers S ≡ t s Ω K , where t s is the stopping time characterizing the strength of the dust-gas coupling (Weidenschilling 1977). The initial dust surface density is Σ d = Σ g , where is the dust-to-gas ratio set by Equation (A19). The dust back reaction onto the gas is included in the simulations.

Simulation Setup and Parameter Space
We run fargo3d planet-disk simulations on Graphical Procession Units (GPUs). Our computational domain is R ∈ [0.4, 3.0]R 0 and φ ∈ [0, 2π], and are discretized with 512 × 512 logarithmically-spaced cells. The radial and the azimuthal boundaries are set to their initial equilibrium solutions given in equations A10 -A13. The planet mass is allowed to grow slowly over first 100P 0 . We implement the wave-killing (de Val-Borro 2006) module to damp the planet generated waves to zero near boundaries to minimize reflection and interference with disk morphology.
We target super-Earth to Saturn-sized planets with masses M P ∈ [8, 120]M ⊕ around a 1 M central star. We model the disk for a range of values for disk aspect ratio h 0 ∈ [0.025, 0.1], viscosity parameters α 0 ∈ [10 −4 , 10 −2 ], flaring index F ∈ [0, 0.25], and power-law slope of the surface density profile σ ∈ [0.05, 1.2]. In each disk we consider a single dust species of fixed size characterized by Stokes number S t ∈ [10 −3 , 10 −1 ] and abundance 0 ∈ [0.01, 0.1]. This range of S t corresponds to particle radii ranging from ∼ 1 mm to ∼ 10 cm, for typical gas surface density of 100gcm −2 and internal grain density of 1gcm −3 .

DPNNET-BAYESIAN
In this section we introduce the Bayesian implementation of the DPNNet model.

Network architecture
The DPNNet model as presented in Auddy & Lin (2020) was based on artificial neural networks with a linear layer in the end for performing regression. In the Bayesian adaption of DPNNet the front end of the model includes a MLP layer with 256 units of neuron having a ReLu activation function (this part of the network is the same as DPNNet-1.0). This is followed by a fully connected variational dense layer having 128 neurons, in which the BBN algorithms are implemented to probe the model uncertainty. We use the method of variational inference, as described in Section 2.1, to account for the stochasticity of the model parameters. Each model parameter is treated as a distribution having some variance. The stochastic sampling from this variational distribution gives an estimate of the model uncertainty. Furthermore, we use a trainable prior distribution for the parameter space, p(θ). This is used to determine the approximate posterior distribution of the model weights using variational inference.

Data acquisition and pre-processing
For training and testing the model, we produce the new dataset following the same procedure as used in Paper 1. We generate an uniform random distribution of the parameters, each centered within the sampling interval, using the Latin hypercube sampling method (McKay et al. 1979;Iman et al. 1981). We run a total of 1800 simulations with a range of input parameters described in section (3.2). From each simulation we get the dust and the gas surface density maps of the PPD with the embedded planet. Figure 2, shows the sample images of dust surface density distribution and the Figure 2. The upper panel is the normalized dust surface density distribution of disks with different initial conditions and for different planet masses at 3000P0. The disk parameters are indicated on the top right of each plot along with the planet mass in the bottom right. The lower panel is the radial profile of the azimuthally averaged surface density at the beginning of the simulation (Σ d0 ; dashed black) and at the end of the simulation after the gap is formed (Σ d ; solid red). The horizontal arrow marks the dust gap width. The gap width is the distance between the inner and the outer edge of the gap (vertical dotted lines) where Σ d (R) reaches 50 % of the initial surface density Σ d0 . The cross indicates the minimum surface density Σ dmin .
corresponding radial profile of the azimuthally averaged surface density from some of the simulations. These improved disk-planet simulations (with fixed grain sizes and initial equilibrium disk setup) result in much physical model for dust-gas interaction compared to Paper 1 and Paper 2.
Following Paper 1 and Kanagawa et al. (2016), we define the dust gap width as where R d,in and R d,out are the inner and the outer edge of the gap where the dust surface density Σ d equals a predefined threshold fraction of Σ d0 . For consistency with Paper 1, we adopt a threshold fraction of 1/2 of the initial surface. We acknowledge this definition of w d has its limitations as it fails for shallow gaps (for low mass planets), when gap depth is less than the threshold function. However, this does not limit the model application as DPNNet-Bayesian can be retrained for any threshold function or other definition of the gap width.
In the current paper we only consider simulations with either one dust gap or with two dust gaps. We remove models with > 2 dust gaps as the number of samples are not adequate to effectively train the model. To build our feature space we simply define two dust gap widths for each simulations, w d1 and w d2 , but set w d2 = 0 if only single dust gap is formed.
The initial dataset consists of 1800 simulations. We pre-process the data by eliminating runs that do not open up detectable axis symmetric gaps or have more than two gaps. After the initial filtering we have data from 1098 simulations. Each simulation is characterised by its input parameters (M P , α 0 , h 0 ,σ,F ,S t 0 ) and the associated gap width(s) in dust (w d1 ,w d2 ) and gas (w g ). In this paper we will not use gas gap width as an input parameter for training the model since these are not directly measured from observations.
Once the gap widths are measured, we assign each simulation a set of feature variables (w d1 ,w d2 , 0 , α 0 , S t , h 0 , σ, F ) that would go as the input for our model and label the planet mass, M P , as the target variable. The data for each feature variable is then normalized using the standard (z-score) scaling by subtracting the mean and scaling it by the standard deviation.

Training Bayesian-DPNNet
The Bayesian DPNNet is implemented using Google TensorFlow Probability (Dillon et al. 2017), which is an open source platform that integrates probabilistic methods with deep neural networks. We split the data randomly into two blocks, a training set consisting of 80 % and a testing set consisting of 20 % of the simulation data. Furthermore, for model validation we use 20% of the training dataset. We use the Adam optimizer with a learning rate of 0.001. The model is trained for ∼ 2200 epochs and the early stopping callback algorithm is used to prevent the model from over-fitting.

RESULT
Once the model is trained it is ready to be deployed to predict planet mass from observed disk features. We test the networks performance on the test dataset which the model has not been exposed to. This enable us to quantify the networks accuracy in predicting planet mass from unseen data. The test dataset comprises of all the feature variables along with the target variable (the true value of the planet mass). The feature variables (w d1 ,w d2 , 0 , α 0 , S t , h 0 , σ, F ) are normalized with the same scaling that the network was trained on. The trained DPNNet-Bayesian network takes the feature variables as input and predicts the corresponding planet mass. Unlike the standard MLP model, for the same set of input feature variables, DPNNet-Bayesian gives the predicted output as a distribution. Figure 3 shows the correlation between the simulated and the predicted planet mass in Earth-mass (M ⊕ ) units for all the test samples. For each set of input variables we conduct the experiment (i.e., deploy the trained network) 200 times to obtain the marginalization of the predicted output. Thus we obtain a distribution for the predicted planet mass (as demonstrated in figure (4)) capturing the uncertainties associated with the parameters of the trained model. The error bar gives the standard deviation associated with each predicted value. The predicted value lies along the black line indicating a strong correlation with a r2-score of r b = 0.82.

Error Analysis using BNN
As mentioned before, the Bayesian approach offers a way to quantify and understand the uncertainties associated with predictions from the deep neural networks. As an example, in Figure 4 we demonstrate the different uncertainty estimates associated with prediction from DPNNet-Bayesian for sample disk features selected randomly from the test dataset. We pick a disk-planet simulation with parameters/features 0 = 0.04, α 0 = 0.008, S t = 0.08, h 0 = 0.03, σ = 1.11, and F = 0.21 and measured dust gaps widths w d1 =0.74 , w d2 = 0. Since these are simulated data the true value of the planet mass, M P = 46M ⊕ , is known. To capture the epistemic uncertainty we deploy the trained model N times for these fixed set of disk features. We set N = 200, with the i index fixed, in Equation (6) and for each iteration the model predicts a unique and independent value. This results in a distribution of predicted planet mass as seen in the left panel in Figure 4. The mean (µ EP = 47.6M ⊕ ) shown with the red dotted line, is taken as the final predicted value while the standard deviation (σ EP = 5.8M ⊕ ) represents the uncertainty associated with the trained model. The measured disk features are also prone to statistical errors due to measurement techniques and limitations in the observed data. While parameters in simulated models are well constrained, that is not the case for observed disk data. In order to simulate the uncertainties associated with the measured disk parameters we use Monte Carlo analysis. Instead of using fixed true values for the disk parameters we sample them randomly n times from their respective standard normal distribution with mean set to the true value and standard deviation equal to 1% spread for each of the parameters. We set n = 1000, with j fixed, in Equation (6) such that for each set of sampled input parameters the model predicts a planet mass. The center panel of the Figure 4 demonstrates the aleotoric uncertainty σ AL = 8.3M ⊕ corresponding to the predicted planet mass due to the measurement noise of the input disk parameters.
The right most panel in Figure 4 captures the total uncertainty in the predicted planet mass due to both the epistemic and aleatoric uncertainties . We sample the disk parameters randomly, n = 1000 times, from their standard normal distribution and for each draw we deploy the model N = 200 times. This results in distribution of predicted planet mass with mean µ = 49.8M ⊕ and standard deviation σ = 8.1M ⊕ .

APPLICATION TO OBSERVATION
As a first test of our newly improved DPNNet-Bayesian model, we deploy it to estimate planet masses from the observed dust gaps in the protoplanetary disks around HL Tau and AS 209. The results are elaborated and discussed below.

HL Tau
HL Tau is a well studied system, which shows distinct axisymmetric -possibly planet induced -gaps in the dust emission (ALMA Partnership et al. 2015). Several studies involving hydrodynamic simulations (Dong et al. 2015;Jin et al. 2016;Dipierro et al. 2015) suggest that each of the gaps are induced by a single planet. With the assumption that the observed gaps are all planet induced, we use the properties of the observed gaps as input features to our BNN model to predict the planet mass. DPNNet-Bayesian predicts a distribution for the planet masses in each of the gaps. For the three identified gaps at R = 10 au, 30 au and 80 au, we obtain the gap widths (w d1 = 0.81, 0.23, and 0.29) and adopt the disk aspect-ratios of h 0 = 0.05, 0.07, and 0.10 respectively from Kanagawa et al. (2015bKanagawa et al. ( , 2016. HL Tau is mostly laminar with viscosity α 0 = 10 −3 (Dipierro et al. 2015) and a central star mass of M * = 1M . We set a canonical value of the dust-to-gas ratio 0 = 10 −2 (ALMA Partnership et al. 2015;Dong et al. 2015) and adopt a mean Stokes number S t = 0.005 (Dipierro et al. 2015) on all three gaps. Since the uncertainty associated with the input parameters is not well constrained across the literature, we take a conservative approximation and consider 1 − 5% variation in the gap widths only. The top panel of Figure 5 shows the distribution of predicted masses in all the three identified gaps for the given disk parameters and gap widths. The mean value along with the standard deviation of the predicted planet masses at 10 au, 30 au, and 80 au are 86.0 ± 5.5M ⊕ , 43.8 ± 3.3M ⊕ , and 92.2 ± 5.1M ⊕ respectively. This standard deviation captures the total uncertainty, which includes the uncertainty associated with the BNN architecture (σ EP Figure 5. Top Panel: Distribution of the predicted planet mass at three identified gaps in HL Tau obtained using DPNNet-Bayesian. The mean values along with the standard deviations of the predicted planet masses at 10 au, 30 au, and 80 au are 86.0 ± 5.5M⊕, 43.8 ± 3.3M⊕, and 92.2 ± 5.1M⊕ respectively. Bottom Panel: shows the distribution of predicted masses for the two identified gaps in AS 209 for the given disk parameters and gap widths. The mean predicted planet masses along with the standard deviation are 20.6 ± 2.2M⊕ and 36.3 ± 2.5M⊕ in 9 au and 99 au respectively. estimated as 4.0M ⊕ , 3.3M ⊕ , and 5.0M ⊕ respectively for the three gaps) along with the aleatoric uncertainty. For the assumed variation of 1 − 5% in the gap widths, σ AL is comparable to σ EP . However, σ AL can vary depending on the noise in the input data. These estimates are mostly consistent with the inferred planet mass from direct numerical simulations like Dong et al. (2015); Jin et al. (2016); Dipierro et al. (2015) and DPNNet-1.0 in Paper 1.

AS 209
AS 209 is another well studied system hosting multiple gaps at 9, 24, 35, 61, 90, 105 and 137 au ). However, for simplicity we only consider the two most prominent gaps at R = 9 and 99 au. We adopt the disk parameters and dust gap widths from Zhang et al. (2018b) and use a range of values for dust abundance, particle size and disk viscosity which are within our training parameter space. For examples, we set α 0 = 10 −4 , S t = 156.9 × 10 −4 and surface density profile σ = 1 (Fedele et al. 2018) for both the gaps. Next, for the innermost gap at R = 9 au, we assign values to the other parameters as w d1 = 0.42, h 0 = 0.04 and 0 = 0.012. Similarly for the outer gaps at R = 99 au we select w d1 = 0.31, h 0 = 0.08 and 0 = 0.017. Once again we consider 1 − 5% variation in the gap widths as a conservative approximation to capture the uncertainty associated with its measurement. The bottom panel of Figure 5 shows the distribution of predicted masses for the two identified gaps in AS 209 for the given disk parameters and gap widths. The mean predicted planet masses along with the standard deviation are 20.6 ± 2.2M ⊕ and 36.3 ± 2.5M ⊕ at 9 au and 99 au respectively. The standard deviation captures the total uncertainty which includes the epistemic uncertainty (σ EP is 2.0M ⊕ and 2.1.M ⊕ for the two gaps respectively) along with the aleatoric uncertainty. σ AL is comparable to σ EP for the assumed variation of 1 − 5% in the gap widths. However, σ AL can vary depending on the noise in the input data. These estimates particularly at R = 9au are lower compared to Zhang et al. (2018b)1. The discrepancy is partially due to differences in the way the gap widths are measured, the uncertainty in the adopted disk parameters and simulation setup.

DISCUSSION AND CONCLUSION
In this paper we introduced DPNNet-Bayesian, a BNN based model, which characterizes mass of unseen exoplanets from observed PPDs. Compared to its predecessor DPNNet-1.0, the Bayesian approach has the following advantages: Firstly, it quantifies the errors associated with the prediction of the network and trace the source of the uncertainty. Secondly, it gives a framework to understand various regularization techniques used in traditional deep learning approaches (Jospin et al. 2020). Finally, compared to traditional deep learning models, where the weights are initialized by some implicit prior, the Bayesian framework accounts for this prior distribution explicitly.
For a BNN, one trains a distribution/ensemble of deep learning models and then based on the Bayesian formalism, performs model averaging/marginalization of the output. It is to be noted that the BNN approach is fundamentally different from "deep ensembles", where different initialization are used to train a set of network and then the outputs are combined/optimized. More so in the BNN approach, a weighted averaging is done by taking into account the posterior of the network parameters as weight. This is different from the deep ensembles where the averaging is directly done with the output of the network.
Furthermore, the error/uncertainty associated with the output of a BNN model is different from the standard metrics, like root mean square error (RMSE) and/or mean absolute error (MAE), used in traditional deep learning models. Both MAE and RMSE are measure of the error (|M P,simulation − M P,predicted |) when the model is deployed on a validation/testing dataset for which the "true" target values are known. However they do not necessarily quantify the error accurately (or provide a confidence interval) when applied to a unknown dataset for which the target values are not known. This can be problematic as there is no way of knowing how accurate the prediction is and one needs to trust on the validation/testing error as an indication of the accuracy of the prediction. Since a BBN model works on the principle of marginalization rather than optimisation, the output is obtained by sampling from a posterior dis-tribution, thus, providing the necessary confidence interval.
The performance of any ML based model depends on the quality and/or quantity of the training dataset. The Bayesian approach helps us to understand the correlation between the quality/quantity of the dataset and the model performance by monitoring the change in model uncertainty. The knowledge of the epistemic and aleatoric uncertainties enables the model to learn from a smaller dataset by regulating over-fitting or under-fitting. Thus the use of BNN architecture plays a pivotal role in improving the training process as well as determine the optimal deep learning architecture.
We conclude by highlighting the key results: • DPNNet-Bayesian estimates the planet mass from the gap width observed in dust emission. It takes as input dust properties (abundance, Stokes numbers), gas disk properties (aspect-ratio, surface density profile, viscosity), along with dust gap widths from observations and predicts the embedded planet mass.
• For training DPNNet-Bayesian we generated a dataset from disk-planet simulations using the fargo3d hydrodynamics code with a newly implemented fixed grain size module and improved initial conditions. The trained model is tested on the simulated dataset in order to illustrate its applicability. The predicted mass is closely correlated with the 'true' value of the simulated planet mass.
Here ρ p is the density of the dust particles, s is the radius of the dust particles and Σ g is the gas surface density. For fixed particle size s = constant while the Stokes number S t ∝ Σ −1 g is a function of disk radius since Σ g ∝ r −σ unlike the default setup where S t is set to constant. Figure 6. The radial profile of the azimuthally averaged gas and dust surface density at the beginning of the simulation and after 1000P0. The steady state profile is maintained with little evolution in the absence of an embedded planet.
To maintain a constant mass flux, i.e., ∂ r (rΣ g δv gr ) = ∂ r (rΣ d δv d ) = 0, we set rΣ 0 g δv gr = constant (A16) Here we only consider the gas component but same result can be arrive for dust as well. Furthermore, since rΣ g ∝ r r 1−σ , δv gr must scale as δv gr ∝ r r σ−1 such that the net mass flux is constant. Thus from equations A10 and A17 we can establish δv gr (r) = δv gr0 r r where a = −2β(β − 1)v K S t , b = β(2γ + 1), c = 2β 2 (γ + 1) + 2β 2 St 2 (γ + 1) and δv gr0 = a0 0 2 0 +b0 0+c0 , and the subscript "0" refers to the values at r = r 0 . Rearranging Equation A18 we solve for the dust-to-gas ratio and arrive at = −(bd 0 − a) ± ((bd 0 − a) 2 − 4d 2 0 c 2d 0 , where d 0 = a0 0 2 0 +b0 0+c0 r r σ−1 and the negative root gives a physical solution. Thus in order to satisfy the condition for constant mass flux for fixed particle size we adjusts the dust-to-gas ratio accordingly. Furthermore, the viscous velocity of the gas V 2D vis without dust grains is given as (Lynden-Bell & Pringle 1974), where ν = αc s h is the kinematic viscosity. For steady state background solution. i.e., V 2D vis = 0, we set νΣ g R 1/2 = constant in Eq (A20) which is equivalent to where the subscript "0" refers to the values at r = r 0 . We initialize the disk-planet FARGO3D simulation with the updated dust-to-gas ratio Equation (A19) and fix the particle size to a constant value using the fixed particle module. Figure (6) shows the azimuthally average initial dust and gas radial surface density and the final surface density profile after the system has evolved 1000P 0 . As demonstrated the steady state profile is maintained with little evolution is the absence of an embedded planet.