Disentangling CO Chemistry in a Protoplanetary Disk Using Explanatory Machine-learning Techniques

Molecular abundances in protoplanetary disks are highly sensitive to the local physical conditions, including gas temperature, gas density, radiation field, and dust properties. Often multiple factors are intertwined, impacting the abundances of both simple and complex species. We present a new approach to understanding these chemical and physical interdependencies using machine learning. Specifically, we explore the case of CO modeled under the conditions of a generic disk and build an explanatory regression model to study the dependence of CO spatial density on the gas density, gas temperature, cosmic-ray ionization rate, X-ray ionization rate, and UV flux. Our findings indicate that combinations of parameters play a surprisingly powerful role in regulating CO abundance compared to any singular physical parameter. Moreover, in general we find the conditions in the disk are destructive toward CO. CO depletion is further enhanced in an increased cosmic-ray environment and in disks with higher initial C/O ratios. These dependencies uncovered by our new approach are consistent with previous studies, which are more modeling intensive and computationally expensive. Our work thus shows that machine learning can be a powerful tool not only for creating efficient predictive models, but also for enabling a deeper understanding of complex chemical processes.


INTRODUCTION
The chemistry of protoplanetary disks, like the chemistry of the interstellar medium (ISM), is highly sensitive to the local physical conditions.However, the densities, high radiation fields, and dust growth in disks lead to unique chemical pathways that are not reflected in the ISM.To interpret line observations and effectively use molecules as tracers of important properties such as the disk mass or the cosmic ray ionization rate, it is crucial to determine how the wide range of physical conditions in disks impact chemical abundances.
CO, for example, has been recently shown to be more sensitive to the disk environment than initially thought.Specifically, gas-phase CO is found to be sub-interstellar in disks when comparing CO-derived gas masses with HD and dust-derived masses (Bergin et al. 2013;Favre et al. 2013;Ansdell et al. 2016;McClure et al. 2016;Long et al. 2017;Miotello et al. 2017;Bergin & Williams 2018;Calahan et al. 2021) even after accounting for freeze-out in the cold midplane, UV-driven photodisso- * Carnegie Postdoctoral Fellow † STScI Postdoctoral Fellow ciation in the surface layers, and isotope-selective photodissociation (Miotello et al. 2014(Miotello et al. , 2017)).This suggests that there are additional chemical processes impacting CO.Models suggest that CO can be chemically processed by cosmic rays and X-rays (e.g., Reboussin et al. 2015;Bosman et al. 2018;Dodson-Robinson et al. 2018;Schwarz et al. 2018Schwarz et al. , 2019)).These studies also find that CO processing is influenced by the temperature structure, UV radiation field, and dust distribution, confirming the significant impact of the disk environment on the chemistry.
A common approach to study how CO and other molecular species change with the environment is to generate a grid of full 2D chemical models with varying disk and stellar properties (e.g., Walsh et al. 2010Walsh et al. , 2012;;Reboussin et al. 2015;Schwarz et al. 2018Schwarz et al. , 2019)).This approach has greatly informed our understanding of molecules and has been especially helpful when interpreting observations.However, generating such a grid quickly becomes computationally expensive, which makes it challenging to carry out a detailed exploration of parameter space.
In the present paper, we explore whether complementing chemical modeling efforts with machine learn-Diop et al.
ing (hereafter ML) techniques can enable us to explore large parameter spaces and analyze large model data sets more efficiently and unveil relationships that deepen our understanding of chemical processes in disks.We focus on CO since, over the years, we have gained a solid understanding of how its chemistry is impacted by the environmental conditions, which provides a point of comparison for our purely ML-driven approach.ML is a subset of artificial intelligence that uses data-driven algorithms to unveil trends in large amounts of data.In recent years, ML has gained popularity in the field of astronomy given the increasing complexity and amount of astronomical data (see Baron 2019).It has been used, for example, to classify stellar spectra (Sánchez Almeida & Allende Prieto 2013), detect unusual galaxies (Baron & Poznanski 2017), derive star formation rates (Delli Veneri et al. 2019), calculate masses of protoplanets (Auddy & Lin 2020; Zhang et al. 2022), detect protoplanets in disks (Terry et al. 2022), search for exoplanets (Valizadegan et al. 2022), and analyze the spectral energy distributions of protoplanetary disks (Kaeufer et al. 2023).
Our goal is to assess the ability of ML in helping us understand the interplay between chemistry and physical conditions using CO as a test case.We focus on the explanatory power of ML rather than its predictive power, i.e., we aim to build a reliable predictive model that can also help us understand the relationship between the inputs (disk physical conditions) and the output (CO abundance) and whether these inputs favor the production or destruction of CO.The predictive power of ML has fully been exploited in recent studies to speed up astrochemical modeling (e.g., de Mijolla et al. 2019;Holdship et al. 2021;Grassi et al. 2021;Smirnov-Pinchukov et al. 2022).For our ML data set, instead of generating a grid of 2D chemical models, we focus on a single model since disks inherently have within them a diversity of environments due to the temperature, density, and radiation gradients.

Disk modeling
We adopt a chemical model published in Anderson et al. (2021).This model is not intended to represent the diversity of known disks (e.g., MAPS I ( Öberg et al. 2021)), but rather, it is a generic model that highlights the range of physical conditions naturally present within a single disk, which allows us to sample a large parameter space and investigate its impact on CO chemistry.Essentially, we aim to leverage the diversity of chemical environments within the disk to reveal how the CO abundance (the output of the ML model) and the phys-ical conditions (the inputs of the ML model) are linked with a regression-based analysis.We summarize here the main physical and chemical features of the model, but we direct the reader to Anderson et al. (2021) for further details.
The disk structure consists of a two-dimensional azimuthally symmetric disk with a gas mass of 0.01 M ⊙ and a dust mass of 10 −4 M ⊙ .The disk surrounds a 1 M ⊙ T Tauri star with a radius of 2.8 R ⊙ , an effective temperature of 4300 K, and an X-ray luminosity of 10 30 erg s −1 .We use two dust populations: a small dust population (0.005 -1 µm) and a large dust population (0.005 µm -1 mm), which are made up of 80 % astronomical silicates and 20 % graphite (Draine & Lee 1984).Each dust population follows an MRN size distribution (Mathis et al. 1977).The large dust population contains 90 % of the total dust mass, while the remaining mass is made up of the small dust population.The radial surface densities for the gas and dust are respectively described by: where R c = 30 au.The scale height for the gas and small grains is given by: The scale height of the large grains is five times lower in order to simulate dust settling, following Andrews et al. (2011).The densities for the small and large dust populations are given by: where f = 0.9 represents the fraction of mass in the large grains and χ = 0.2 denotes the ratio between the scale height of the large grains to that of the small grains.
Radiative transfer is performed using TORUS (Harries 2000; Harries et al. 2004;Kurosawa et al. 2004;Pinte et al. 2009) to calculate dust temperatures given the assumed gas and dust structure.The attenuation for the UV and X-ray radiation fields is carried out using the Monte Carlo radiative transfer code by Bethell & Bergin (2011a,b).The gas temperatures are estimated based on the local UV flux and gas density using fitting functions to thermochemical models of Bruderer (2013) as outlined in Cleeves et al. (2015).For the cosmic ray ionization, we use the solar system minimum model from Cleeves et al. (2015).The incident cosmic ray ionization rate is 1.1 × 10 −18 s −1 and is modulated with vertical depth (see Cleeves et al. 2015).
The model is sampled on a grid of 130 radii between 0.2 and 100 au, logarithmically spaced, and 50 heights linearly spaced in z/r (0 < z/r < 0.6) to capture the wide diversity of chemical environments spanning the disk.Figure 1 illustrates how the physical parameters vary from location to location in the disk model.
The abundance of CO as a function of time is calculated with the time-evolving gas-grain chemical code of Fogel et al. (2011), which is based on the Ohio State University gas-phase network (Smith et al. 2004).The code has been subsequently updated in Cleeves et al. (2013Cleeves et al. ( , 2014Cleeves et al. ( , 2015Cleeves et al. ( , 2018) ) and Anderson et al. (2021).The abundances are computed by applying the rate equation method.Our network is comprised of 654 species and 7039 reactions and includes neutral-neutral reactions, ion-neutral reactions, ion recombination with electrons, a limited number of grain surface reactions, freeze-out, cosmic ray ionization, thermal and non-thermal desorption, and photodissociation including self-shielding for H 2 , CO, and N 2 (Lee et al. 1996;Visser et al. 2009;Li et al. 2013).The abundances are calculated for 100 time steps between 1 yr and 3 Myr.The initial abundances are shown in Table 1.The disk has initial C/H and O/H ratios of 1.3 × 10 −4 and 2.28 × 10 −4 respectively, which corresponds to an initial C/O ratio of 0.57. Figure 1 (panel f) shows the results of the chemical calculations for CO at 1 Myr.
In addition to the fiducial disk model described above, we explore two more chemical models, (1) a model with an ISM-like cosmic ray ionization rate of ∼ 2 × 10 −17 s −1 (Webber 1998) and (2) a model with a higher initial C/O ratio (0.83) obtained by lowering the initial H 2 O abundance to 8 ×10 −6 .The latter model simulates water-rich ice sequestration motivated by Hogerheijde et al. (2011) and relatively high inferred C/O ratios in observed disk systems (Bergin et al. 2016;Cleeves et al. 2018;Miotello et al. 2019).

Machine learning approach
We aim to understand the complex web of dependencies between the physical environment of the disk and the resulting CO chemistry.Instead of taking a full 2D disk model as a single sample, we treat every (r,z) point location as a sample of the CO environment (Figure 1).As mentioned previously, CO abundance depends on many variables, however Figure 2 illustrates that while complex, the CO abundances vary smoothly across parameter space, allowing us to effectively apply ML to the CO problem.

ML data
The data for the regression model is made up of the grid of points representing the full disk (Figure 1).Each point has its own gas temperature (T gas ), gas density (n gas ), X-ray ionization rate (ζ xr ), cosmic ray ionization rate (ζ cr ), and UV flux (F uv ), which represent the independent variables, often called predictors or features in the field of ML.The CO abundances relative to hydrogen (CO/H) extracted at ∼ 1 Myr also vary across points and represent the dependent variable.

Preparing the data
We build the regression model using the Python libraries Scikit-learn (Pedregosa et al. 2011) designed for ML tasks and Statsmodels (Seabold & Perktold 2010), which is used for exploratory statistical analysis.During data cleaning, we excluded points where n gas or n CO = 0.After cleaning, the final number of data points is 3688 and their distribution in parameter space is shown in Figure 3. Since the data span many orders of magnitude in both independent and dependent variable values, we work in a base 10 logarithmic space.We randomly split the data set into a training set and a test set using the Scikit-learn function train test split.The training set is fed into the ML algorithm so that it can learn the trend.The test set simulates data that the trained model has never encountered and is used to test the model performance.Figure 2. CO abundance at ∼ 1 Myr as a function of different pairs of predictors.We note a smooth variation across parameter space for each plot, yet the relationship between CO and the entire set of predictors is complex.
We follow a 70/30 split: we use 70 % of the data for training and 30 % for testing.After splitting, we perform a feature scaling on the predictors/independent variables to ensure the coefficients of the regression model have the same units.This scaling step allows us to directly determine the significance of the predictors compared to each other.We use the robust scaling technique, defined as: where x i is the value of the predictor for the i th point, x ′ i the scaled value, median(x) the median value of the predictor, and Q 3 (x) − Q 1 (x) the interquartile range.This technique is robust against outliers and does not assume a normal distribution for the predictors.To prevent data leakage, which occurs when the test set information is available during the training phase, we exclusively use the training set to compute scaling parameters (median and interquartile range), which are subsequently used to scale the test set.

ML algorithms: multiple linear regression and polynomial regression
In this work, we use multiple linear regression and polynomial regression.A multiple linear regression is a model where the dependent variable Y and the predictors X i = {X 1 , X 2 , ..., X n } are related via: where β 0 is the intercept, β 1 , β 2 , ... β n the coefficients, and ϵ the error term.A polynomial regression has a similar form, except that it contains higher order terms and can contain cross-terms: For instance, in this work, Y corresponds to log 10 (CO abundance), while X 1 and X 2 correspond to the scaled log 10 (gas density) and log 10 (gas temperature).We feed the scaled training data to the ML algorithm, which searches for the best regression model, i.e., the model with the set of β values that minimize the error.We use the coefficients of the resulting model to determine the impact of the disk physical conditions on CO.

Evaluating model performance
We assess the performance of the model using the following metrics: the coefficient of determination or R 2 score, the root mean square error (RMSE), and the mean absolute error (MAE) defined as, where y i is the actual CO abundance (from the chemical model), ŷi the abundance predicted by the regression model, ȳ the mean of the actual abundances, and N the number of data points.The R 2 score ranges between 0 and 1 and describes the proportion of variation in the dependent variable that can be predicted from the independent variables.The closer the R 2 is to 1, the better the model.The RMSE and MAE are used to gauge the error of the model.To further evaluate the performance of the model, we visually examine residual plots and plots of the predicted vs actual CO abundances using the test set.Figure 4 summarizes our ML approach.

Molecular density is more robustly predicted than molecular abundance
For the multiple linear regression, we initially use T gas , n gas , ζ xr , ζ cr , and F uv (see Section 2.2.1) as our predictors and CO abundance (CO/H) as our dependent variable.We find that working with the abundance as the output challenges the ability of the algorithm to find a good linear model.This behavior is due to the fact that the chemical model naturally has a maximum value for the CO abundance at the limits of the C and/or O atomic budget and many CO-rich points have abundances close to this value.This aspect of the chemical .Flowchart summarizing our ML approach.We first apply a base 10 logarithmic transformation to the independent variables and the dependent variable since they span many orders of magnitude.The data are then split into a training set and a test set.The independent variables/predictors are then scaled to ensure that the coefficients of the regression model have the same units.The training data is fed to the ML algorithm, in this case, linear or polynomial regression.The performance of the resulting model is assessed using the metrics described in Section 2.2.4.We also visually examine residual plots and plots of the predicted (ML model) vs actual (chemical model) CO abundances after applying the trained ML algorithm to the test set.
model leads to the edge cluster in Figure 5 (panel A1).
To address this issue, we work instead in the smoother CO density space, i.e., abundance × hydrogen density, which is continuous and naturally removes the cluster as shown in Figure 5 (panels B1 and B2).The shift from CO abundance to CO density results in an improvement in the R 2 score, from 0.72 to 0.97.We note that this change causes CO density to be strongly predicted by the gas density, which must be taken into account in the interpretation of the coefficients of the ML model (see Section 4.1).

Even "simple" CO chemistry requires polynomial regression
While a linear model is generally preferred due to its simplicity, we find that it is not sufficiently accurate for predicting CO densities.Specifically, as shown in Figure 5 (panel B2), the residual plot reveals curvilinear trends, which suggests that the linear model does not fully capture the underlying CO behavior.
Instead, we find that a second degree polynomial provides a substantial improvement in our ability to produce accurate CO densities.Figure 6 (panels C1 and C2) shows the results of the polynomial model generated using the function PolynomialFeatures in Scikit-Learn.This model shows a good agreement between the predicted and actual CO densities and curvilinear trends are reduced.The predicted densities, especially between 10 −2.5 and 10 10 m −3 , match more closely the actual values in the polynomial model.The residual plot also shows that for most points, the predicted densities are off by 0.5 dex and the maximal deviation is ∼ 2 dex.Quantitative metrics likewise reveal that the polynomial is a better model: Table 2 shows that it has a slightly higher R 2 value as well as smaller RMSE and MAE values than the linear model.Thus, the seconddegree polynomial provides a more reliable description of CO chemistry in the disk model.

Simplified polynomial regression: how many terms are necessary?
The aim of this work is to create both an accurate and interpretable model that unveils the dependence of CO chemistry on the physical environment.To this end, we seek to analyze the coefficients of the terms in the ML model to determine the different processes affecting CO and whether these processes lead to its destruction or preservation.The polynomial model does capture the behavior of CO under the modeled disk conditions, but it is comprised of 20 terms, which makes interpretation challenging.Moreover, the terms suffer from multicollinearity, i.e., the independent variables/predictors are correlated with one another.Multicollinearity poses a challenge for our goal of building an explanatory model since it makes it difficult to isolate the impact of each predictor on CO without the influence of the remaining predictors.We thus reduce the number of terms in the polynomial to minimize the impact of multicollinearity and obtain a simplified model that is easier to interpret.
To do this, we use a combination of variation inflation factor analysis, centering, and feature importance scores from random forests (see Appendices A and B).
The results of the simplified polynomial model are shown in Figure 6 (panels D1 and D2).While the full polynomial model does have better a performance as shown by the plots (Figure 6), its slightly higher R 2 score and lower RMSE and MAE values (Table 2), the simplified model contains fewer terms, captures most of the underlying behavior of CO, and does not suffer as much from multicollinearity.Thus, we focus on this model for the remainder of the study.
The simplified model is made up of 10 terms with their coefficients shown in Figure 7.The dominant terms are log(n gas ) and [log(n gas )] 2 , which are partly influenced by the use of CO density (the product of CO/H and n gas ) instead of CO abundance.The remaining terms involve combinations of factors with T gas and n gas being the factors most commonly found in the cross-terms.The strongest correlation our analysis identifies is between CO and the gas density terms, log(n gas ) and [log(n gas )] 2 .This correlation is naturally expected since CO scales with H 2 in disks (Bergin & Williams 2018;Öberg & Bergin 2021).Our use of CO density as the dependent variable further strengthens this correlation.The negative relationship between CO density and [log(n gas )] 2 is at first glance surprising, but this term reflects the fact that denser regions tend to be colder and have higher degrees of UV extinction, resulting in CO freeze-out into ice.

Positive correlations
The terms with positive coefficients hint at processes that lead to the preservation of CO in the gas phase.The model contains four positive terms that we will label T1,  T2, T3, and T4 as follows: log(n gas ) × log(T gas ) (T1), log(T gas ) × log(F UV ) (T2), log(T gas ) × log(ζ cr ) (T3), and log(n gas ) × log(ζ xr ) (T4).T1 is picking up regions where n gas is sufficiently high to shield CO from UV radiation and T gas is high enough to prevent CO freeze-out.T2 corresponds to warm regions with moderate UV fields, which help maintain CO in the gas-phase.Since the cosmic ray rates do not vary substantially throughout the disk (Figure 1), T3 could reflect (1) warm regions where CO is in the gas phase and cosmic rays produce sufficient He + to keep CO from forming more complex species or (2) cool (≲ 20 K) midplane regions where cosmic rays induce non-thermal desorption (Hasegawa & Herbst 1993).Meanwhile, the positive correlation between CO and T4 is intriguing since high X-ray rates and densities should facilitate the formation of radicals that destroy CO.This term could be reflecting a zone with a sufficiently high X-ray ionization rate that destroys complex species that form out of CO, thus helping bring back CO to its original state.The negative terms log(T gas ) (T5), log(n gas )×log(ζ cr ) (T6), log(n gas )×log(F UV ) (T7), and log(T gas )×log(ζ xr ) (T8) highlight pathways that lead to CO destruction.T5 reveals a negative correlation between CO density and temperature, which could be highlighting the hot surface layers (T gas > 100 K) where CO is photodissociated.T6 captures regions where high densities and cosmic rays destroy CO: (1) regions where cosmic ray ionization leads to He + , which destroys CO gas and ultimately leads to the formation of hydrocarbons such as CH 4 (2) denser midplane regions where cosmic rays produce atomic hydrogen from the ionization of H 2 , which enables the hydrogenation of CO in the ice-phase, ultimately forming species like CH 3 OH.T7 is picking up a region of higher density, which causes more frequent gasgrain collisions.With a moderate UV field present, radicals like OH and CH 3 can be efficiently formed, which can react with more volatile species like CO and convert them over time into more complex molecules that freeze out, such as CH 3 OH or CO 2 (Schwarz et al. 2018).Finally, T8 is picking up warm, irradiated regions which are relatively harsh environments where X-rays can be another source of He + .

What do we learn from the coefficients?
The coefficient analysis reveals that CO correlates strongly with the amount of gas, but it is impacted by multiple processes that can either enhance it or destroy it.Disregarding the gas density terms log(n gas ) and [log(n gas )] 2 for the reasons previously discussed, we note that the remaining negative terms have coefficients that outweigh the positive ones, suggesting a tendency toward CO depletion in the disk model.Our analysis also points to the importance of considering combinations of physical conditions, which can trigger or hinder different chemical processes, in sometimes counterintuitive ways.Considering combinations of physical parameters is especially important when conducting sensitivity analyses of molecular species.In addition, the cross-terms typically involve T gas and n gas , which confirms previous results suggesting that the gas temperature and gas density structures are crucial factors regulating CO chemistry.

What about the warm molecular layer?
Our analysis up to this point focuses on the entire disk model, but CO observations usually probe the warm molecular layer where the temperatures maintain CO in the gas-phase and are expected to reduce depletion.We investigate this by subsampling the locations (1792/3688 points) associated with the warm molecular layer (20 K < T gas < 100 K) and look for a model that captures CO chemistry in this region.We perform a multiple linear regression with T gas , n gas , ζ xr , ζ cr , and F uv as the predictors and note a strong correlation between the predicted and actual densities (Figure 8).There are a few outliers (21/538 samples in the test set) with residuals > 3 σ (σ = MAE), which explains the asymmetric residual map.Nevertheless, most of the residuals lie between -0.3 and 0.3 dex.
Within the small number of outliers, most of them lie at the edges of parameter space (Figure 9) and are located in the transition region between the atomic and molecular layers in the r-z plane.Some outliers also occupy hot regions, ∼ 100 K, where water begins to desorb, changing the oxygen chemistry in the gas and impacting CO.Both regions, i.e., the transition region where photodissociation starts to become important and the hot region with gas-phase water, undergo rapid chemical changes, where not many points exist to sample their behavior.Thus, it is not surprising the algorithm struggles to reproduce CO chemistry at these locations without additional data.
On the whole, the multiple linear regression model does well capturing CO chemistry in the warm molecu- Figure 9. Parameter space sampling for the warm molecular layer (purple) and the outliers in that region (orange; see Figure 8).The outliers occupy regions close to the edges of parameter space where the model is less well sampled.They coincide with highly photo-irradiated gas, where CO is undergoing a rapid transition via UV photodissociation or water is sublimating rapidly (near 100 K), and thus there are not many points sampling these regions.lar layer.This suggests that CO is more robust in this sub-region as compared to the full extent of the disk, which requires a polynomial model.Warmer temperatures (20 K < T gas < 100 K) thus seem to reduce the complexity of CO chemistry by hindering its chemical destruction.The implication of this finding is that predictive models of CO chemistry should be more accurate for warm regions (e.g., for disks orbiting more massive Herbig stars), compared to cooler regions including the disks of lower mass stars.

Additional model variations: cosmic ray ionization rate, initial C/O ratio, and age
We explore additional chemical models to investigate the impact of using an ISM-like cosmic ray ionization rate (∼ 10 −17 s −1 ), an increased initial C/O ratio (0.83 instead of 0.57), and different ages compared to 1 Myr (∼ 0.5, ∼ 1.5, and ∼ 2 Myr).We use the feature im-portance score from random forests to determine the importance of the terms in each model (see Appendix A for details).For all of these variations, the relative behaviors of the dominant terms do not change substantially.For all three models, i.e., the fiducial, high cosmic-ray, and high C/O models at 1 Myr, the dominant terms are the log(n gas ), [log(n gas )] 2 , and log(n gas ) × log(ζ cr ) terms (Figure 10).This consistent behavior implies that predicting CO chemistry is fairly robust even as disk physical conditions vary.

An ISM-like cosmic ray rate enhances CO depletion
For the high cosmic ray model, the cross-terms involving the radiation field and cosmic rays are more important than in the fiducial model.This can be explained by the fact that the higher rate leads to more H + 3 and He + ions, which destroy CO to ultimately form species such as CH 3 OH and CH 4 respectively (Schwarz et al. 2018(Schwarz et al. , 2019)).

A higher initial C/O enhances CO depletion
For the high C/O model, the same cross-terms as in the high cosmic ray model are more important than in the fiducial model.This effect could be due to the fact that the lower initial H 2 O abundance creates an oxygenpoor environment, which increases the chances of some of the carbon being locked up in hydrocarbons such as C 2 H instead of CO, which further reduces CO's ability to self-shield (Bergin et al. 2016;Miotello et al. 2019).With the reduced shielding, more CO can be dissociated by the radiation field, thus freeing more carbon atoms, which are needed for C 2 H formation.

CO depletion increases over time
Over the 2 Myr timescale, the term log(n gas ) × log(ζ cr ) becomes the most important term in the disk model (Figures 11) and the cross-terms involving cosmic rays and the radiation field become more important.This can be explained by the timescales needed to convert CO into species such as CO 2 , CH 4 , CH 3 OH (Re-boussin et al. 2015).Over time, the abundance of these species is enhanced because they will have had enough time to form thanks to the processing of CO, which usually involves the radiation field and cosmic rays.

Putting it all together: performance of ML approach
Our purely ML-driven approach leads to findings consistent with what we have learned about CO chemistry over the years, notably, (1) our fiducial model shows a tendency toward CO depletion, which is in agreement with the CO depletion trend commonly observed in disks (Ansdell et al. 2016;Long et al. 2017;Miotello et al. 2017;Yu et al. 2017), (2) warmer temperatures (20 K < T gas < 100 K) hinder CO processing (e.g., Reboussin et al. 2015;Schwarz et al. 2018;Bosman et al. 2018;Schwarz et al. 2019), (3) higher cosmic ray rates lead to increased CO depletion (e.g., Reboussin et al. 2015;Eistrup et al. 2016;Dodson-Robinson et al. 2018;Schwarz et al. 2018;Eistrup et al. 2018;Bosman et al. 2018;Schwarz et al. 2019), (4) CO depletion is enhanced over time (e.g., Reboussin et al. 2015;Schwarz et al. 2018;Zhang et al. 2020).ML techniques are thus a powerful complement to chemical modeling efforts.With simple algorithms like multiple linear and polynomial regression that fit our data within minutes, we have shown how ML can be exploited not only for predictions, but also for explanatory purposes.ML can handle complex multi-parameter models and the large amounts of resulting data in an efficient fashion.We believe with even more advanced techniques, we can further deepen our understanding of complex chemical processes.
It is important to note that the disk model used for this study presents some limitations.First of all, beyond the chemical evolution, our disk is static and does not account for changing physical conditions of the gas over time.Likewise, our model does not include dynamical processes such as radial drift, grain growth, and vertical mixing, which have been theoretically demonstrated to impact CO chemistry (Krijt et al. 2018(Krijt et al. , 2020)).Nevertheless, this approach is flexible enough to include additional parameters for more complex models in the future.One possible expansion of this work is to include dust-related parameters such as dust sizes, compositions, temperatures, since the dust affects the disk opacity and is vital for grain-surface chemistry.In addition, this approach can be expanded to study disks with different physical structures as well as other molecules of interest.We work here with the main CO isotopologue, 12 CO, but a similar approach can be taken for rarer isotopologues such as 13 CO and C 18 O.

CONCLUSIONS
In this work, we test the ability of machine learning (ML) in helping us understand the interplay between chemistry and physical conditions in protoplanetary disks.Specifically, we investigate the impact of physical conditions, i.e., gas density, gas temperature, cosmic ray ionization rate, X-ray ionization rate, and UV flux on CO chemistry by tackling the diversity of chemical environments within a single disk using linear and polynomial regression.We also use the feature importance scores from random forests to understand the impact of the cosmic ray ionization rate, initial C/O, and disk age.We summarize our findings below: • Machine learning can be a highly efficient tool to understand the behavior of complex multi-parameter models, such as astrochemical simulations.
• A linear model does not accurately capture the underlying CO response to physical conditions in the disk model.We find that a polynomial model is required; however the number of terms makes such a model challenging to interpret and subject to multicollinearity.Using a combination of techniques, we are able to simplify our model, while maintaining its accuracy and making its meaning more transparent.The coefficient analysis reveals that there is a tendency toward CO depletion in the disk and that combination of physical parameters should be considered when conducting sensitivity analyses.Our analysis also reveals that the gas temperature and gas density are key factors that influence CO chemistry.
• CO chemistry in the warm molecular layer can be described with a multiple linear regression model, which suggests that all of the complex processes leading to CO depletion are reduced in this region.
• We note that an increased cosmic ray ionization rate and increased C/O ratio can lead to further depletion.The former effect does so by yielding more He + and H + 3 , which destroy CO and respec-tively lead to the formation of CH 4 and CH 3 OH.The latter effect creates an oxygen-poor environment that leads to carbon being locked up in hydrocarbons such as C 2 H rather than CO.
• Over time, the terms involving the radiation field and cosmic rays become more significant, suggesting that CO is destroyed due to the build up species like He + , H + 3 , OH that eventually contribute to the formation of CH 4 , CH 3 OH, CO 2 .
Our ML-driven approach shows that CO chemistry is multifaceted, consistent with previous work.Thus, machine learning is a powerful tool that can complement our chemical modeling efforts by aiding the analysis of large data sets and exploration of large parameter spaces.
We thank the anonymous referees for providing helpful comments.A.D. acknowledges support from the John.F. Angle Fellowship.L.I.C. acknowledges support from the David and Lucille Packard Foundation, the Research Corporation for Scientific Advancement Cottrell Scholar Award, and NASA ATP 80NSSC20K0529.J. Pegues is supported by an STScI Postdoctoral Fellowship.The National Radio Astronomy Observatory is a facility of the National Science Foundation operated under cooperative agreement by Associated Universities, Inc.

APPENDIX
As described in Section 3.2, the full polynomial model accurately captures the relationship between CO abundance and physical conditions in the disk environment.However, this model contains many terms, which makes interpretation challenging and it suffers from multicollinearity.Multicollinearity means that the independent variables/predictors are correlated with one another.Multicollinearity poses a challenge for our goal of building an explanatory model: it makes it difficult to isolate the impact of each predictor on CO without the influence of the remaining predictors.In this section, we explain how we reduce the number of terms in the polynomial model to facilitate interpretation and minimize multicollinearity.Specifically, we use random forests, centering, and a variation inflation factor analysis, which are discussed in more detail below.We ensure that the variation inflation factors for all terms in the ML model are below 10 to reduce the impact of multicollinearity (see Appendix B).

A. DECISION TREES AND RANDOM FORESTS
A decision tree is a non-parametric ML algorithm and it owes its name to the fact that it has a tree-like structure.It is made up of nodes, which yield the tree-like structure.The top node where the first split is performed is called the root node, the nodes within the tree are the internal nodes, and the terminal nodes where predictions are made are the leaf nodes or leaves.To determine the condition for the root node, the algorithm runs through all features and calculates possible thresholds to find the split that will minimize the mean squared error (mse).The best feature and threshold are then used for the first splitting condition, i.e., feature X < threshold.The split data set is then iteratively divided following a similar procedure until no further split can be performed.The prediction of a leaf is the average of the data points in that leaf.To predict the value of a new point, the point is passed through the tree until it reaches the leaf where it belongs to. Figure 12 shows an example of a decision tree.Decision trees are prone to overfitting thus Random forest (Ho 1995(Ho , 1998;;Breiman 2001) (RFs) are used since they overcome this issue by averaging the predictions of multiple trees.
An useful feature of RFs is the feature importance score, which is a dimensionless number between 0 and 1 that expresses how efficiently each predictor splits parameter space in the RF.The higher the score, the more important the predictor is.The scores of all predictors are normalized to sum up to 1.We use this to determine the most important terms in our polynomial model to minimize multicollinearity as well as to compare the fiducial disk model with the high cosmic ray and high C/O models and the models at different ages.

B. TACKLING MULTICOLLINEARITY
We search for multicollinearity by looking at the variation inflation factor (VIF) of each predictor.The VIF is a dimensionless metric expressed as: where R 2 i is the R 2 score obtained when the i th predictor is regressed on the other predictors.The higher the VIF of a predictor, the more that predictor is correlated with other predictors.As a rule of thumb, a VIF of 1 indicates no correlation, a VIF between 1 and 5 indicates low correlation, a VIF between 5 and 10 indicates moderate correlation, and a VIF greater than 10 indicates serious multicollinearity.Our initial predictors (T gas , n gas , ζ xr , ζ cr , and F uv ) display negligible multicollinearity since their VIFs are less than 5 (Table 3).However, once the predictors undergo the polynomial transformation, structural multicollinearity arises (Table 4).This multicollinearity is due to the creation of new predictors using existing predictors, e.g., generating X 2 using X, generating the cross-term X 1 X 2 using X 1 and X 2 .Multicollinearity is not an issue when the goal is to make predictions.However, since we want to isolate the contribution of each predictor, we need to minimize multicollinearity as much as possible.We first reduce it by centering the initial predictors (T gas , n gas , ζ xr , ζ cr , and F uv ).Centering is a technique that consists in taking the original predictors, calculating their respective means, and subtracting those mean values from each data point.The polynomial transformation is then applied to the centered data points.Mathematically, centering is expressed as: X cent,i = X i − mean(X) (B2) Centered terms in this work are denoted as X.For example: log(T gas ) = log(T gas ) − mean[log(T gas )] (B3) Centering drastically reduces multicollinearity as noted in Table 4.To further reduce it, we exploit the feature importance score from random forests to find the most important terms in the polynomial.The resulting scores are shown in Figure 13.We then incorporate terms one by one in the polynomial using their importance scores-following a decreasing order-until we reach the point where the VIF of one of the predictors exceeds 10, indicating the presence of multicollinearity.We then only keep the terms from the previous step.The VIF of the predictors in the simplified polynomial are shown in Table 5, which indicates that multicollinearity is minimized.Figure 12.An example of a decision tree using 70 random points from our data.The top box represents the root node, the boxes in between the internal nodes, and the boxes at the bottom the leaf nodes/leaves.Samples represent the number of points that belong to the node, value is the average of log(CO abundance) for points in the node, and mse is the mean square error, which the decision tree is trying to minimize.The values in the leaves represent the predictions.To determine the condition for the root node, the algorithm runs through all features and calculates possible thresholds to find the split that will minimize the mse.In this example, the splitting condition is log(ngas) < 12.085.Points that fulfill this condition go to the internal node on the left and those that do not go to the node on the right: there are 8 samples that fulfill the condition and 62 that do not.The same process is applied to the internal nodes until no further split can be performed.To predict the value of a new point, the point is passed through the tree until it reaches the leaf it belongs to.

Figure 1 .
Figure 1.Physical structure and CO abundance as a function of (r, z) for the fiducial disk model.The disk extends radially from 0.2 au to 100 au and vertically from 0 to 60 au.The CO abundance evolves over time, while the physical structure remains static.All the quantities are shown on a logarithmic scale (base 10).(a) Gas density (b) Gas temperature (c) Cosmic ray ionization rate.Note that the disk is largely transparent to cosmic rays, except within the inner few au.(d) X-ray ionization rate (e) UV flux (f) CO abundances at ∼ 1 Myr obtained from the chemical calculations.The abundances are expressed relative to the total amount of hydrogen, i.e.H + 2H2.Most locations have abundances around 10 −4 per H atom.

Figure 3 .
Figure3.Parameter space sampling for the independent variables: gas temperature, gas density, cosmic ray ionization rate, UV flux, and X-ray ionization rate.Besides the cosmic ray ionization rate, the parameter space samples multiple orders of magnitude in the values of these physical quantities.

Figure 4
Figure4.Flowchart summarizing our ML approach.We first apply a base 10 logarithmic transformation to the independent variables and the dependent variable since they span many orders of magnitude.The data are then split into a training set and a test set.The independent variables/predictors are then scaled to ensure that the coefficients of the regression model have the same units.The training data is fed to the ML algorithm, in this case, linear or polynomial regression.The performance of the resulting model is assessed using the metrics described in Section 2.2.4.We also visually examine residual plots and plots of the predicted (ML model) vs actual (chemical model) CO abundances after applying the trained ML algorithm to the test set.

Figure 5 .
Figure 5. (A) Results of the multiple linear regression in abundance space using the test set.A1: Predicted (ML model) vs actual (chemical model) CO abundances on a log-log plot.The purple line indicates the one-to-one correspondence.The black rectangle shows the region where there is a cluster of points due to the cap on the CO abundance in the chemical model.A2: Residuals vs predicted CO abundances.The residuals are defined as the difference between log (actual CO abundance) and log (predicted CO abundance).The black arrow shows the linear trend due to the cluster.The dashed purple lines indicate a difference of 1 dex between the actual and predicted abundance.(B) Same as panel (A) except CO densities are used.The residuals are defined as the difference between log (actual CO density) and log (predicted CO density).Using densities removes the cluster, which significantly improves the quality of the linear fit.However, some curvilinear trends (indicated by black arrows) are noted in the residual plot (B2), thus a more complex model is needed.4.1.Understanding CO chemistry using the coefficients of the ML modelOur simplified polynomial model encodes a substantial amount of information about the factors affecting CO.Its terms reveal what processes are at play and whether they destroy (negative terms) or preserve CO (positive terms).Here, we unpack the information contained in each term.4.1.1.Dominant terms: log(ngas) and [log(ngas)] 2

Figure 6 .
Figure 6.(C) Results of the polynomial regression using the test set.The full polynomial model fits the data better than the linear model and curvilinear trends in the residuals are reduced.(D) Results of the simplified polynomial.The full polynomial matches better the data, but the simplified polynomial facilitates interpretation and does not suffer as much from multicollinearity since it contains fewer terms.

Figure 7 .
Figure 7. Coefficients of the simplified polynomial with uncertainties calculated by statsmodels.

Figure 8 .
Figure 8. Results of the multiple linear regression on the warm molecular layer (20 K < Tgas < 100 K ) using the test set.The outliers (21/538 points) with residuals > 3 σ (σ = MAE) are colored in black.Most points lie along the line of one-to-one correspondence and have residuals between -0.3 and 0.3 dex as indicated by the dashed purple lines in the residual plot.This indicates that the linear model captures CO chemistry in the warm molecular layer.

Figure 10 .
Figure 10.Feature importance scores for the fiducial model, high cosmic ray model (∼ 2 ×10 −17 s −1 ), and high C/O model (0.83) at ∼ 1 Myr.The three models have the same dominant terms (log(ngas), [log(ngas)] 2 , and log(ngas) × log(ζcr)), but the cross-terms involving the radiation field and the cosmic rays are more important in the high cosmic ray and high C/O models.

Figure 11 .
Figure 11.Feature importance scores for the fiducial model at different ages.Over time, the term log(ngas) × log(ζcr) starts to dominate and cross-terms involving the radiation field and cosmic rays become more important.

Figure 13 .
Figure 13.Importance scores for the full polynomial model.

Table 1 .
Initial abundances (relative to hydrogen) for the disk chemical models.Ices are indicated by the notationgr.

Table 2 .
Metrics for the linear, full polynomial, and simplified polynomial models of the full disk and the linear model of the warm molecular layer.The full polynomial model is a better fit than the linear model given that it has a slightly higher R 2 score and smaller RMSE and MAE values.The full polynomial is a better model than the simplified polynomial since it has a slightly higher R 2 score and lower RMSE and MAE values.However, the latter does not suffer as much from multicollinearity and is easier to interpret.

Table 3 .
Initial predictors and their corresponding variation inflation factors.

Table 4 .
Variation inflation factors for the raw and centered predictors in the polynomial model.X denotes the centered predictors.T1-T8 denote the terms in the simplified polynomial model.