The ZTF Source Classification Project: III. A Catalog of Variable Sources

The classification of variable objects provides insight into a wide variety of astrophysics ranging from stellar interiors to galactic nuclei. The Zwicky Transient Facility (ZTF) provides time series observations that record the variability of more than a billion sources. The scale of these data necessitates automated approaches to make a thorough analysis. Building on previous work, this paper reports the results of the ZTF Source Classification Project (SCoPe), which trains neural network and XGBoost machine learning (ML) algorithms to perform dichotomous classification of variable ZTF sources using a manually constructed training set containing 170,632 light curves. We find that several classifiers achieve high precision and recall scores, suggesting the reliability of their predictions for 209,991,147 light curves across 77 ZTF fields. We also identify the most important features for XGB classification and compare the performance of the two ML algorithms, finding a pattern of higher precision among XGB classifiers. The resulting classification catalog is available to the public, and the software developed for SCoPe is open-source and adaptable to future time-domain surveys.


INTRODUCTION
Studying the variability of astronomical objects offers valuable insight into several open astrophysical questions, including the nature of stellar interiors (e.g.Goupil et al. 2013), dynamical interactions (e.g.Borkovits 2022), magnetic fields (e.g.Fabbian et al. 2017), cosmic distances (e.g.Fukugita et al. 1993), and galactic nuclei (e.g.Ulrich et al. 1997).Surveys such as ASAS (Pojmanski 2002), NSVS (Woźniak et al. 2004;Hoffman et al. 2009), PTF (Law et al. 2009), the Catalina Surveys (Drake et al. 2014(Drake et al. , 2017)), and ASAS-SN (Kochanek et al. 2017;Jayasinghe et al. 2018) have provided photometric time series data for millions of sources, facilitating the classification of variables and subsequent analyses.Over time, surveys have been further optimized for these purposes, covering more sky, reaching fainter limiting magnitudes, and observing at a faster cadence than their predecessors.
One such survey, the Zwicky Transient Facility (ZTF; Bellm et al. 2019;Graham et al. 2019;Masci et al. 2019;Dekany et al. 2020), covers the full observable sky from Palomar Mountain every two nights, providing time-series data for billions of sources.The survey's latest data release (DR20, spanning 2018 March -2023 October) includes 53.5 million single-exposure images yielding 4.75 billion light curves spanning g, r, and i bands.These data offer valuable insights into time-domain astronomy but also present unique challenges associated with their vast scale.While some manual studies have been performed for specific kinds of objects (e.g.van Roestel et al. 2021a), a fully manual analysis of a survey this size demands unrealistic human resources.
Any large-scale analysis therefore requires an automated approach (e.g.Huijse et al. 2014;Sen et al. 2022).One such approach involves the use of machine-learning (ML) algorithms at a survey-level scale (Mahabal et al. 2019).ML can be used to direct follow-up observations based on the latest survey data (e.g.Sravan et al. 2023).ML techniques have also been applied to large surveys to classify different kinds of sources, including microlensing events (e.g.Godines et al. 2019), transients (e.g.Stachie et al. 2020;Gomez et al. 2023;Rehemtulla & Miller 2023), and variables (e.g.Richards et al. 2011;García-Jara et al. 2022;Mistry et al. 2022).
The major variable source classification effort within ZTF is the Source Classification Project (SCoPe).Beginning with a set of sources built from existing catalogs and light-curve classifications from human experts, SCoPe maps light-curve data to standardized features, trains many binary classifiers1 using two supervised ML algorithms, and runs inference on unclassified sources with the goal of producing a variable source catalog for ZTF.These efforts have produced intermediate results reported in previous papers: van Roestel et al. (2021b) established the framework to train ML algorithms and run inference on an earlier ZTF data release.Coughlin et al. (2021) set the light-curve variability metrics and period-finding algorithms that comprise the features input to ML algorithms.
In this paper, we build on the foundation of these previous works to construct a publicly available ZTF variable source catalog containing ML classifications.Section 2 outlines the classification workflow and enumerates the standardized features generated from ZTF light curves.Section 3 describes the classification taxonomy and the two ML algorithms that label ZTF sources.Section 4 shares classifier performance and describes the resulting catalog of ZTF variables.We discuss these results in Section 5 before concluding in Section 6.

FEATURE GENERATION
Figure 1 depicts the SCoPe workflow that produces publicly available classifications starting with ZTF light curves on a kowalski2 mongoDB database (Duev et al. 2019).The scope-ml code underlying this workflow is available opensource on GitHub3 and is published on PyPI4 .Time-series data often have characteristics that make them unsuitable to be directly input to ML algorithms.These characteristics include multiple sampling rates, small gaps in the data due to changes in nightly observing conditions, and larger gaps due to a source's varying observability from Earth throughout the year (see Section 2.1 of Paper I).These light-curve qualities do not inform the astrophysical nature of sources but may still be learned by an ML classifier.To minimize the contribution of the above characteristics to source classification, we map each ZTF light curve to a set of standardized features to be input to our ML algorithms.This section describes in further detail these features and the process of their generation (covering all parts of the workflow in Figure 1 leading to and including "Features").

Selecting ZTF Sources
We worked with ZTF DR16, which contains observations between 2018 March and 2023 January.The feature generation process begins with a query5 of either user-specified ZTF IDs or all ZTF sources within a single quadrant.This granulates the process into numerous parallelizable portions (four quadrants per CCD, 16 CCDs per field).We utilized the SDSC Expanse cluster to run many instances of our script in parallel.

Identifying Close, Bright Sources with Gaia
Prior to computing features, we removed sources whose light curves may be influenced by nearby bright stars by querying Gaia EDR3 (Gaia Collaboration et al. 2016, 2021).We searched within a 300" radius around each ZTF source, corresponding to the maximum separation that produces a light-curve artifact for sources with Tycho B magnitudes of B < 13.Applying an empirical formula, we flagged ZTF sources for exclusion if we found neighboring stars bright enough to influence the source's light curve.We then queried light-curve data for the remaining sources.

Light-curve Features and External Catalog Data
We dropped all points from these light curves containing nonzero ZTF catflags (suggesting suboptimal data quality)6 , and we subsequently enforced a 50-epoch minimum.For light curves meeting this requirement, we began by generating the basic statistics summarized in Table 1.We also mapped each light curve to two-dimensional histograms showing the change in magnitude and time between each pair of points (dmdt, see Section 2.2.3 of Paper I and Mahabal et al. (2017) for more details).We supplemented these features with 2" cross-match queries to ZTF Alerts and external catalogs.Additional features from these queries include the number and mean BRAAI of ZTF Alerts for each source (Duev et al. 2019) and a combination of magnitudes, errors, and parallax values from AllWISE (Wright et al. 2010;Cutri et al. 2021), Pan-STARRS1 (PS1; Kaiser et al. 2002;Chambers et al. 2016), and Gaia EDR3 (Table 2).

Period Finding and Fourier Features
We continued by running three period-finding algorithms on each light curve.These algorithms use GPU-accelerated implementations7 of Lomb-Scargle (LS, Lomb 1976;Scargle 1982), conditional entropy (CE; Graham et al. 2013), and analysis of variance (AOV; Schwarzenberg-Czerny 1998) methods to determine periods and associated significance values.The three algorithms ran on a grid of periods between 30 minutes and half the longest time baseline from among each batch of 1000 light curves.We excluded period ranges associated with common aliases in ground-based data, including several multiples of 1 day and 1 yr, along with a ∼ 30-day period for the Moon's orbit (see Paper II; Kramer et al. 2023).For each algorithm, the period with the highest significance value is reported as that algorithm's associated period feature.
We also applied a fourth algorithmic approach (ELS ECE EAOV) that nested the previous three algorithms.For each light curve, we used the full AOV results to determine the significance values associated with periods having the highest 50 LS and CE significance values.We then selected the period with highest AOV significance from this Zero-phase of best-fitting series (Fourier analysis) f1 power Normalized χ 2 of best-fitting series (Fourier analysis) f1 relamp1 Relative amplitude, first harmonic (Fourier analysis) f1 relamp2 Relative amplitude, second harmonic (Fourier analysis) f1 relamp3 Relative amplitude, third harmonic (Fourier analysis) f1 relamp4 Relative amplitude, fourth harmonic (Fourier analysis) f1 relphi1 Relative phase, first harmonic (Fourier analysis) f1 relphi2 Relative phase, second harmonic (Fourier analysis) f1 relphi3 Relative phase, third harmonic (Fourier analysis) f1 relphi4 Relative phase, fourth harmonic (Fourier analysis) i60r Mag ratio between 20th, 80th percentiles i70r Mag ratio between 15th, 85th percentiles i80r Mag ratio between 10th, 90th percentiles i90r Mag ratio between 5th, 95th percentiles inv vonneumannratio Inverse of von Neumann ratio (von Neumann 1941(von Neumann , 1942 subset of 100 values.Using the resulting periods from all four algorithms, we generated additional features from the parameters of Fourier series fits (from zeroth to fifth order) to each light curve (see Eq. 1 of Paper I).We chose the ELS ECE EAOV algorithm to source the single set of Fourier features and periods input to the ML algorithms and reported with classification predictions, since ELS ECE EAOV combines the results of each individual algorithm.The goal of SCoPe is to use ML algorithms to reliably classify each ZTF source with as much detail as possible.The attainable quality of classifications varies across the broad range of ZTF sources.Factors that can affect the detail of source classifications include the quantity and quality of the data, the similarity of the training set to the source in question, and the existence of new kinds of variable sources in the data.With this in mind, we adopt two taxonomies that contain the labels we use to classify ZTF sources.

Ontological and Phenomenological Taxonomies
The first taxonomy is ontological (Figure 2) and contains specific kinds of astrophysical sources (see Table 3 for the ontological labels, training set abbreviations, and definitions, ordered by low to high detail).This list aims to include as many kinds of objects as feasible for expert classification review (see Section 3.4.3).When training ontological classifiers, we input the full set of features (Tables 1 and 2) to the ML algorithms.
In consideration of the value of having some information about a source (even if not a definitive ontological classification), we also employed a phenomenological taxonomy (Figure 3) with labels that describe light-curve-based features.Classifications with (p) in their definition in Table 3 denote the phenomenological labels with their training set abbreviations and definitions.Phenomenological classifiers trained on the phenomenological subset of features (Table 1) to ensure that their classification results were only dependent on ZTF light curves.

Dichotomous Classifiers
We trained independent dichotomous classifiers for labels in these taxonomies having more than 50 positive examples.The choice of dichotomous classifiers allows more than one label to be assigned to a source, often with varying levels of detail.This is important not only because of the practical challenges outlined above but also because some sources merit more than one classification (e.g. an eclipsing binary system containing a flaring star).The independence of dichotomous classifiers allows for future updates to the taxonomies without a revision of the current results from each existing classifier.Although dichotomous classifiers each only consider one label, we used the hierarchical structure of our taxonomies to assist in filling missing labels before training (see Figures 2 and 3 along with Section 3.4.2).

ML Algorithms
We employed a convolutional/dense neural network (hereafter DNN; e.g.LeCun et al. 2015) and XGBoost gradientboosted decision trees (XGB; Chen & Guestrin 2016) to perform classification (leading to "Trained models" in Figure 1).We applied a probability threshold on the input classification probabilities to determine whether to treat each source as a positive or negative example for training.A threshold that is too high could impede a classifier's ability to generalize, while too low a threshold could include too many false positives.We chose to set the threshold at 0.7 for all classifiers, thus treating moderate-to-high-confidence probabilities as positive examples during training.
Both algorithms initially performed regression to minimize a binary cross-entropy loss function, assigning a classification probability ranging between 0 and 1 for each source.We again used a probability threshold of 0.7 to map predicted classifications for training sources in order to quantify true and false positives/negatives.When a source was associated with multiple ZTF light curves, the same classification probabilities were assigned to each one.We then trained classifiers on the collection of light curves.We train classifiers on a mix of ZTF bands in an effort to make them robust to the systematics between light curves in each band.

Deep Neural Network (DNN)
Neural networks map input to output using connected layers of artificial "neurons" inspired by their biological counterparts; each neuron performs a linear transformation of its input followed by a nonlinear activation function.The output from one neuron becomes the input of the next, and network training occurs via back-propagation of the loss function gradient through each possible path in the network.This process optimizes the weight and bias values the network's neurons use for linear transformations.
The SCoPe DNN algorithm features two branches built using tensorflow8 with the keras API9 : the first is a series of fully connected dense layers interspersed with dropout layers.This branch receives all features for a given classifier except dmdt.Each dropout layer randomly sets a fraction of inputs to zero in the subsequent layer.This regularization process helps prevent overfitting, wherein the network only learns specific details of the training set at the expense of performance on unseen data.
The other DNN branch convolves a kernel function with the 2D dmdt histograms.This process, commonly used for image analysis tasks, generates a set of "features" for dmdt.Dropout and pooling layers provide regularization for this branch.The outputs of both branches are concatenated and fed through one more set of dropout and dense layers before passing through a sigmoid activation function to provide continuous outputs between 0 and 1.These outputs correspond to one classification probability per source passed into the network.Figure 4 shows a graph of DNN branches and layers.

Gradient-boosted Decision Trees (XGB)
The XGB algorithm10 takes a different approach to classification based on a collection of decision trees.Each tree makes splits in feature space in order to optimally separate positive and negative training examples.Instead of aggregating these results (a "bagging" approach that introduces regularization), XGB computes the gradient of the loss function with respect to the trees' predictions and uses this information to inform another round of training (a "boosting" approach).Boosting offers high performance while increasing the chance of overfitting.
The splits in the XGB tree-based approach facilitate a straightforward interpretation of the connection between input features and resulting classifications.As a result, the importance of each input feature in determining a classification is provided by each XGB classifier.This contrasts with DNN classifiers, whose hidden layer transformations preclude the same kind of interpretation without significant added computational expense.Phenomenological and ontological XGB classifiers were given the same feature sets as DNN classifiers except for the exclusion of the dmdt histograms.

Imputing Missing Features
The above algorithms cannot process missing features, so we adopted a heterogeneous feature imputation strategy.This strategy arose from a feature-by-feature consideration of how to appropriately fill the different kinds of missing quantities in our data.We performed no imputation for any phenomenological features, instead excluding the single light curve in the training set that was missing any of these features.We imputed zero for n ztf alerts and mean ztf alert braai.We imputed the median for uncertainties in survey magnitude and Gaia parallax.Finally, we used K-nearest neighbor regression to impute missing magnitudes and parallax values.

Assigning Upstream Labels
It is possible for the labeling process to produce an incomplete list of classifications for a source.For example, a source may be confidently labeled as periodic without having been labeled as variable.This will provide an incorrect input to the variable classifier, since this example source that displays periodic variability will be treated as nonvariable during training.To address this issue, we used the hierarchy of labels in each taxonomy (Figures 2 and 3) to enforce that any labels upstream of a manual classification must have at least the same probability as the manual classification.

Active Learning
We ran several rounds of algorithm training over the course of the project in an effort to increase the quantity and quality of our training set.After each round, we ran inference on unclassified sources using the trained classifiers ("Inference" in Figure 1).The predictions for each light curve were aggregated using the mean classification probabilities of all light curves sharing the same Gaia, AllWISE, or PS1 survey ID.This aggregation produced one set of predictions per ZTF source.We selected a subset of sources having at least one high-confidence probability (> 0.9) among the labels in our taxonomies.We used a hosted instance of the SkyPortal11 data platform (fritz; van der Walt et al. 2019;Coughlin et al. 2023) to visualize the light curves of this subset of sources along with any associated ML classifications having probability > 0.7.Including these moderate-confidence classifications allowed an evaluation of the classifier's decisionmaking threshold in addition to its highest-confidence results.
These sources were subjected to a round of human review to evaluate the classifier predictions and revise them as appropriate ("Active learning" in Figure 1).Reviewers could vote "up" (+1) or "down" (-1) on each classification, leave no vote (0), and add any labels thought to be missing.Figure 5 shows an example of the fritz interface for voting and labeling sources.
Once this process was complete, we identified all sources reviewed by at least one human and summed the numerical votes to determine which classifications to keep.If the vote sum for a classification was negative, we removed it from the source.Otherwise, we added the source and its remaining classifications to the existing training set.In this way, we could increase the number of manually labeled examples provided for classifier training without requiring a random search of current predictions.
The training set initially began as a compilation of existing source catalogs and a "seed" set of manual classifications ("initial labeling" in Figure 1).The cyclical process of active learning grew the training set to the point of containing manually reviewed labels for 85,136 sources along with features generated from 170,632 associated light curves, on which ran a final round of classifier training for all classes having 50 or more positive examples.The training set is available to the public electronically on Zenodo12 .

Hyperparameter Optimization and Training
We shuffled and divided the learning set into three partitions, reserving 81% of rows for training, 9% for model validation, and 10% to obtain the test scores we report in this paper.We optimized DNN and XGB hyperparameters via two unique tuning processes on a subset of 9% of the training data reserved for model validation.
For DNN, we used Weights and Biases Sweeps13 to investigate changes in model performance for different parameter combinations.Specifically, we optimized the amsgrad, epsilon and lr hyperparameters.Training ran for 200 epochs.The optimal DNN hyperparameters for each classifier are saved in the default configuration file in the SCoPe code repository.
For XGB, we performed an initial piecemeal grid search separately optimizing up to two hyperparameters at a time (max depth and min child weight, subsample and colsample bytree, and eta).We followed these optimizations with one additional round each for the former two pairs of parameters using a more granular grid.XGB training then ran for up to 999 epochs unless there was no improvement in the area under the ROC curve for 10 consecutive rounds.XGB hyperparameter tuning can be reproduced by running the training code from the SCoPe repository using the training set on Zenodo.The results of training and inference are presented in the next section.

Training
We report dichotomous classifier training results using precision and recall scores.The precision quantifies the fraction of sources labeled by the classifier as positive as compared to the number of true-positive examples in the test set (e.g.purity).The recall is the fraction of positive examples correctly labeled as positive by the classifier (e.g.completeness).Table 3 shares the precision and recall scores for each DNN and XGB classifier.
Recall scores are zero in cases where an algorithm did not correctly classify any of the positive training examples.In these cases, the precision can be either undefined (no light curves classified as positive) or zero (only false-positive classifications).For some classes, both the DNN and XGB algorithms failed to achieve nonzero recall.These classes are listed at the bottom of Table 3 and identified by half-filled circles in Figures 2 and 3.
Figure 6 shows the number of positive examples for each class in the learning set and compares them to the median of all classes.The top and bottom panels of Figure 7 plot the precision/recall scores for DNN and XGB classifiers, respectively.Figure 8 shows a scatter-plot comparison of DNN and XGB precision/recall scores with color-coding for the number of positive training examples.Figure 9 plots the difference between test and train scores to provide quantitative insight into overfitting.Figure 10 shows a histogram counting the number occurrences of each feature among the top three in importance for each XGB classifier.We used the trained models 14 to perform inference on 77 ZTF fields, including the original 20 studied in Paper I (see Figure 4 of that work).These fields represent different parts of the sky (e.g. in and out of the Galactic plane, toward and away from the bulge).To expand the sample to 77, we initially added a field on each "side" of the original 20.For example, we added fields 295 and 298 to the original pair of 296 and 297.We then added fields having immediately lower and higher declinations compared to the existing collection.Overall, the 77 fields yielded 209,991,147 sets of features and classifications, each corresponding to a ZTF light curve.The full collection of DNN and XGB classifications is available on Zenodo; 15 a partial table of the first 10 rows of predictions for field 487 is shown in Table 4.

Inference
Figures 11 and 12 show heatmaps of predicted DNN and XGB classification probabilities for sources in fields 487, 563, and 777 (containing 18,184,402 light curves).Figure 11 plots the binned probabilities for the top four most represented classes in the training set, while Figure 12 does the same for the four least represented classes that have nonzero precision/recall scores for both algorithms.

Precision and Recall
While DNN and XGB performance varied from class to class, for both algorithms there is an association between the number of positive examples and precision/recall scores.With some exceptions, we report high precision and recall scores for classes at or above the median number of positive examples (∼ 3000 light curves, or ∼ 2% of all learning set light curves).This pattern is understandable since a classifier that is given more positive examples may generalize better than one trained on a smaller sample, as long as there remains a sufficient amount of negative examples for comparison.Among the classes with the most positive examples even the vnv classifier is given ∼ 25% negative examples during training, avoiding a major class imbalance.
On the other end of the range, several classifiers were given fewer than 1000 positive examples for training, resulting in a > 99% rate of negative examples.While this does not guarantee poorer classifier performance, it may make it more difficult for a classifier to learn and generalize.Many of the least represented classes achieve greater precision than recall, a result of exposing each classifier on the entire training set rather than weighting classes based on representation.While the ideal classifier achieves both high precision and recall scores, we chose to train in a way that favors precision over recall to minimize the number of false-positive classifications.
An additional factor that may contribute to the lower scores of some classifiers is the consistency of manual labels across the entire training set.The sin classifier offers a useful example: sinusoidal light-curve phenomenology should be readily identifiable based on the features input to the classifier.However, the sinusoidal label may not have been consistently applied, as human experts focused on their specific area of ontological expertise.As a result, the  training set may contain sinusoidal light curves that are missing the appropriate label, reducing the sin classifier's performance.Finally, the intrinsic similarity of certain classes (e.g.contact binaries labeled as wuma and interacting blyr binaries, along with their ew and eb phenomenologies) adds further difficulty to the training of some classifiers.

DNN versus XGB Training Scores
The upper right corners of the scatter plots in Figure 8 show that for the classes containing higher numbers of positive examples the precision and recall scores are both high and comparable between DNN and XGB.There are also some classifiers trained on a more moderate number of positive examples that achieve high precision/recall.As the number of positive examples decreases, XGB tends to outperform DNN in precision scores.XGB also achieves higher recall for more classes than DNN, but the discrepancy is smaller than that for the algorithms' precision.
The general trend of higher precision for XGB classifiers may reflect a benefit of the algorithm's boosting approach as described in Section 3.3.2,despite the lack of the dmdt feature input to DNN classifiers.The boosting approach does increase the chance of overfitting, which is quantified by Figure 9.In these plots, differences between test and   training set precision/recall scores that are near zero suggest little overfitting.For more negative differences on the plots, the model's superior performance on the training set indicates a potential overfit.The majority of DNN and XGB classifiers have test/training score differences near zero.However, some classifiers show signs of overfitting, especially the sin, blend, ceph2, and mir DNN classifiers and the emsms, blyr, sin, rscvn, agn, rrd, ceph2, and mir XGB classifiers.These classifiers tend to be trained on fewer positive examples than classifiers that show minimal overfitting.

Feature Importance
According to Figure 10, the feature most frequently having top three importance among XGB classifiers is period ELS ECE EAOV.This quantity is the light curve's period determined by the nested approach described in Section 2.4.This feature was only input to ontological classifiers, and its high importance is consistent with the well-defined period ranges known to be indicative of many ontological classes in SCoPe.
The next feature among the top three, f1 power ELS ECE EAOV, is the normalized χ 2 value associated with the best-fitting Fourier series to the light curve.A value close to 1 means that the fourth-order Fourier series fit best, while a value close to zero indicates that the zeroth order (constant value) provided the best fit.Given the variable and periodic nature of most SCoPe classes, it is unsurprising that this indicator of variability was highly important in many cases.The ability to use a parallax value to convert apparent to absolute magnitudes supports the inclusion of Gaia EDR3 parallax among the top features in Figure 10.We note that even negative parallax values retain  a meaningful connection to an object's distance despite larger uncertainties (e.g.Luri et al. 2018), and thus their inclusion in our feature set is warranted.
Rounding out the top five features are Gaia EDR3 phot bp rp excess factor and inv vonneumannratio .The von Neumann ratio computes the ratio between the correlated variance and the variance, and it is thus sensitive to variability.The BP − RP excess factor evaluates the flux ratio (I BP + I RP )/I G .This statistic, originally intended as a measure of photometric quality, may also serve as a proxy for color, which often delineates different astronomical objects.
Our feature importance results show similarities with those obtained by Richards et al. (2012) when training a random forest classifier on ASAS sources.Both studies found the period (or fundamental oscillation frequency) to be the most important feature.Additionally, both list the Stetson J coefficient and skew among the top 10 most important features.Going forward, results like these may be used to reduce the number of features required for reliable classifications in the future.This reduction would expedite future classification projects, potentially to the point of enabling real-time classification based on a small number of important features.

Inference Results
The heatmaps in Figures 11 and 12 show a spread of DNN/XGB classification probabilities across the 2D space for a combination of fields 487, 563, and 777 (18,184,402 light curves).Perfect agreement between algorithms would be achieved if only the diagonals on these plots contained light curves.Multiple patterns are discernible from the heatmaps in Figure 11.The largest number of sources in each heatmap is in the lower left corner, where both algorithms share a probability < 0.05.This indicates agreement among both algorithms that many light curves are unlikely to be classified with each label.For many of the visualized classifiers, there is also a concentration of light curves sharing high DNN/XGB confidence in each classification.This agreement appears as the darker squares in the upper right corners of some panels.In other cases (such as the srv classifier), very few probabilities approach unity, but a correlation is visible between DNN/XGB probabilities.
Especially for the pnp, e, and bis heatmaps, there are single columns or rows indicating a wide range of one algorithm's probabilities paired with a narrow range from the other.For example, the pnp heatmap shows a column with many light curves having a wide range of XGB probabilities but DNN probabilities between 0 and 0.05.Similarly, there is a row in the bis heatmap showing many DNN probabilities for a narrow range of XGB probabilities.While these features indicate some level of disagreement between the regression performed by the algorithms, the mapping of probabilities to dichotomous classifications using a threshold (probability > 0.7) results in the high precision/recall scores for all four classifiers shown in Figure 11.
Figure 12 shows some familiar patterns from Figure 11 and some new ones for the four least represented labels.Again, the lower left corner contains the most light curves, indicating agreement between DNN and XGB for many near-zero probabilities.Some panels also show columns or rows indicating a wide range of probabilities from one algorithm paired with a narrow range from the other algorithm (especially for the agn heatmap).
In the case of the mir heatmap, a pattern of agreement is visible between DNN and XGB, as indicated by the high density of light curves in plot's upper right and lower left corners.The other heatmaps in Figure 12 do not have any light curves along their top and right edges, showing that some classifiers do not yield probabilities near unity.For rrd and wvir, very few probabilities are > 0.5.This highlights the importance of considering not only absolute probabilities when evaluating SCoPe predictions but also the probabilities relative to each classifier's distribution.For example, one might select candidate W Vir variables by considering light curves having a top percentile of DNN and XGB probabilities, even if they are not high on an absolute scale.

DNN versus XGB Proabilities
Although inference results for unclassified light curves cannot be compared with ground truth in the same way as our training data, there are still useful insights to be learned from the collection of predictions.For example, for fields 487, 563, and 777 we study the agreement among DNN and XGB classifications on a per-classifier basis, considering the same probability threshold of 0.7 we used for algorithm training.We find that across all classifiers having nonzero recall scores for both ML algorithms, an average of 99% of light curves have both DNN and XGB probabilities either greater than or less than 0.7 (i.e.not conflicting given this threshold).The range of this agreement is between 85% and 100% depending on the classifier.
However, the classifiers showing the greatest agreement using the above method typically score so highly because few or no light curves are classified with probabilities greater than 0.7.For these classifiers, we therefore adjust the test by iteratively decreasing the DNN and XGB probability thresholds independently until at least 1000 light-curve probabilities were above each threshold.This produces little change in the agreement fractions between DNN and XGB for all classifiers, except for a very slight (∼ 0.003%) decrease in the maximum fraction of agreeing classifications.
While the above results imply strong agreement between DNN and XGB, they remain biased by the fact that, especially for the more specific ontological classifications, most light curves will have probabilities close to zero, indicating a nonclassification.This is visualized by the dark squares in the lower left corners of each heatmap, especially those in Figure 12.While agreement on nonclassifications is important, an additional useful test of agreement is to consider only the top N light-curve probabilities from each classifier.
For this test of high-confidence classification agreement between DNN and XGB, we analyze a different number of N light curves for each classifier in order to account for differences in the classifiers' levels of specificity.For example, we expect far more light curves to be classified with high probabilities from the vnv classifier than we do from the mir classifier.
We therefore use the sum of all light-curve probabilities from a given classifier as a rough proxy for the relative frequency of that class.This produces a number between 0 and the 18,184,402 light curves across the three fields, thus weighting the value N on a per-class basis.For the classifiers mentioned above, we consider the top 3,691,883 vnv light curves and the top 1,710 mir light curves.Across all classifiers, we find that an average of 36% of top N classifications agree between DNN and XGB, with a range between 1% and 85%.While this analysis does not offer insight into the ground truth of these classifications, it shows that there is a mix of agreement and disagreement between the two algorithms.Areas of agreement among both algorithms correspond to the highest-confidence classifications in the sample, while areas of disagreement represent interesting conflicts that may indicate a preferred ML algorithm for that class or anomalous light curves.
Finally, we study the connection between hierarchical classifications.SCoPe classifiers are trained independently, and the predicted classifications are not influenced by each other.To explore the results of this approach, we examine the immediate subclasses of the vnv label, pnp and i.Among light curves in fields 487, 563, and 777 with pnp probability > 0.7, 74% also have vnv probability > 0.7 for DNN.The same is true for 94% of XGB light curves.Using the same probability threshold, 91% of DNN i light curves are also vnv (92% for XGB).These results suggest that a logical hierarchy generally persists among these independent classifications, and we see the most consistency for the XGB algorithm.

CONCLUSION
In this paper, we have reported the training and inference results for the open-source ZTF Source Classification Project, which trains dichotomous ML classifiers on ZTF data.The two algorithms we used, a neural network and XGBoost, achieved comparable precision and recall scores for several well-represented classes.As the number of positive training examples decreased, the classifiers displayed more noticeable differences in performance.In particular, XGB often scored higher in precision than DNN as the number of positive examples decreased.Recall scores were more comparable between the algorithms.
We used the XGB algorithm to determine feature importance across classifiers, finding that the light-curve periods were most often among the features of highest importance.This feature was followed in importance by a quantity encoding the order of the best-fitting Fourier series to the light curve, the Gaia EDR3 parallax and BP − RP excess factor, and the inverse von Neumann ratio.Future work could maximize computational efficiency by reducing the number of included features to the minimal amount required for reliable results.
We reported classification predictions for 209,991,147 light curves using 34 dichotomous classifiers.This catalog of DNN and XGB classification predictions, as well as the training set, is available electronically on Zenodo.The computational demands of running inference on all ZTF fields limits this paper's reported predictions to these 77 fields, which represent a wide range of regions on the sky.This variable source catalog will continue to grow as additional ZTF fields are run through the SCoPe workflow.
Future time-series ML work may streamline the feature generation process, reducing the resources required to classify a larger collection of light curves.The incorporation of additional ML algorithms may lead to improved performance on a broader variety of classes and numbers of positive examples.Finally, upcoming sources of new data will support future time-domain studies: the Legacy Survey of Space and Time at Rubin Observatory (Ivezić et al. 2019) will provide time-series data for nearly an order of magnitude more sources than ZTF, and the NEO Surveyor Mission (Mainzer et al. 2023) will similarly succeed the Wide-field Infrared Survey Explorer in the mid-IR.The SCoPe code is readily adaptable to data with different cadences and bands, and we look forward to the continued contributions this project can make to time-domain astronomy.
We are grateful to the referee for providing helpful comments that strengthened the paper.B.F.H. and M.W.C. acknowledge support from the National Science Foundation with grant Nos.PHY-2308862 and PHY-2117997.This work used Expanse at the San Diego Supercomputer Cluster through allocation AST200029, "Towards a complete catalog of variable sources to support efficient searches for compact binary mergers and their products," from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grant Nos. 2138259, 2138286, 2138307, 2137603, and 2138296.The Gordon and Betty Moore Foundation, through both the Data-Driven Investigator Program and a dedicated grant, provided critical funding for SkyPortal.

Figure 1 .
Figure 1.Workflow for SCoPe showing initial database queries, feature generation, and initial labeling followed by cycles of training, inference, and active learning.The training set and inference results are available publicly on Zenodo.

Figure 2 .
Figure 2. Hierarchy of ontological classes used in SCoPe.Not intended to be an exhaustive taxonomy, this collection of labels organizes the intrinsic classifications that compose the training set.Filled circles indicate labels for which we trained a classifier.Open circles show labels that were not used for training owing to their few (< 50) positive examples or solely organizational nature ("Variable source," "Accretor").Half-filled circles identify labels with enough positive examples but lacking a successfully trained classifier (see Section 4.1).

Figure 3 .
Figure 3. Hierarchy of phenomenological classes in SCoPe, following the filled-circle pattern of Figure 2. As with the ontological hierarchy, only non-top-level classes having 50 or more positive examples were used for training.Additionally, "nonvariable" is an organizational label, since the nonvariable probability is defined as 1 − the "variable" probability.

Figure 4 .
Figure 4. DNN architecture graph showing keras layer names and activation functions along with the shapes of their input and output."None" dimensions indicate the networks's ability to work with any number of sources.The network combines a convolutional branch (taking dmdt as input) and fully connected layers (taking all other features) to yield classification probabilities.Dropout layers introduce regularization, helping to reduce overfitting.

Figure 5 .
Figure 5. fritz interface used to vote, add, and remove classifications from sources.At top are the source's name, coordinates, and current classifications.Voting options become visible when mousing over a classification.Beneath are cutouts from the Sloan Digital Sky Survey, Legacy Survey, and Pan-STARRS, along with time-series and phase-folded photometry.Below is the slider interface used to add new classifications from a selected taxonomy.

Figure 6 .
Figure 6.Number of positive examples for each classification in the learning set.The dashed line shows the median number of positive examples among all classes.

Figure 7 .
Figure 7. Test precision and recall stats for DNN (top) and XGB (bottom) classifiers.The number of positive examples in the training set decreases from left to right.

Figure 8 .Figure 9 .
Figure 8. Scatter plots of XGB vs. DNN test precision (left) and recall (right) scores.Points are color-coded by the logarithm of the number of positive examples in the learning set.

Figure 10 .
Figure 10.Number of occurrences of features among the top three in importance for each XGB classifier.

Figure 11 .
Figure 11.Heatmaps of XGB and DNN classification probabilities (fields 487, 563, and 777) for the four classes with the most positive training examples.

Figure 12 .
Figure 12.Heatmaps of XGB and DNN classification probabilities (fields 487, 563, and 777) for the four classes with the fewest positive training examples.
Based on observations obtained with the Samuel Oschin 48-inch telescope and the 60-inch telescope at the Palomar Observatory as part of the Zwicky Transient Facility project.ZTF is supported by the National Science Foundation under grant Nos.AST-1440341 and AST-2034437 and a collaboration including current partners Caltech, IPAC, the Weizmann Institute of Science, the Oskar Klein Center at Stockholm University, the University of Maryland, Deutsches Elektronen-Synchrotron and Humboldt University, the TANGO Consortium of Taiwan, the University of Wisconsin at Milwaukee, Trinity College Dublin, Lawrence Livermore National Laboratories, IN2P3, University of Warwick, Ruhr University Bochum, and Northwestern University and former partners the University of Washington, Los Alamos National Laboratories, and Lawrence Berkeley National Laboratories.Operations are conducted by COO, IPAC, and UW.

Table 1 .
Definitions of Features Input to Phenomenological and Ontological Classifiers.

Table 2 .
Definitions of Additional Features Input to Ontological Classifiers.

Table 3 .
Classification Abbreviations, Names, Definitions, Number of Positive Training Examples, Precision, and Recall.