Alert Classification for the ALeRCE Broker System: The Light Curve Classifier

P. Sánchez-Sáez; I. Reyes; C. Valenzuela; F. Förster; S. Eyheramendy; F. Elorrieta; F. E. Bauer; G. Cabrera-Vives; P. A. Estévez; M. Catelan; G. Pignata; P. Huijse; D. De Cicco; P. Arévalo; R. Carrasco-Davis; J. Abril; R. Kurtev; J. Borissova; J. Arredondo; E. Castillo-Navarrete; D. Rodriguez; D. Ruz-Mieres; A. Moya; L. Sabatini-Gacitúa; C. Sepúlveda-Cobo; E. Camacho-Iñiguez

doi:10.3847/1538-3881/abd5c1

1. Introduction

Brightness variations of astrophysical objects offer key insights into their physical emission mechanisms and related phenomena. In stars, both radial and non-radial pulsations, can result from a thermodynamic engine operating in partial ionization layers when stars are located inside one of the several so-called instability strips that are found in the Hertzsprung–Russell diagram. Eruptive events can be generated by material being lost from a star or occasionally accreted onto it, as is typical in protostars and young stellar objects (YSOs). Explosive events can occur when material is accreted onto compact objects, such as white dwarfs in the case of cataclysmic variables (CVs) or neutron stars in the case of X-ray binaries, or star mergers. Brightness changes can also originate from the rotation of stars, caused by surface features such as starspots, and/or by stars' ellipsoidal shapes. Finally, eclipses can occur, depending on the observer's line-of-sight due to the presence of binary companions, planets, and/or other circumstellar material. These and other classes of stellar variability are reviewed and summarized, for instance, in Catelan & Smith (2015), where extensive additional references can be found. In addition, there are a wide array of transients such as kilonovae (Metzger et al. 2010), supernovae (SNe; Woosley et al. 2002), and tidal disruption events, which are beacons of destructive episodes in the life of a star (Komossa 2015). Galaxies, in turn, can also present a wide array of variability phenomena. In those hosting strongly accreting massive black holes (BHs), for instance, variations develop due to the stochastic nature of the accretion disk, corona, and jet emission, potentially related to both the BH properties and the structure of the material in the immediate vicinity (e.g., MacLeod et al. 2010; Caplar et al. 2017; Sánchez-Sáez et al. 2018).

To study the variability of individual objects in detail and use this information to probe different physical models, observations over a wide range of timescales are required. Hence, long and intensive campaigns of a large number of targets are crucial. In recent years, surveys covering a significant part of the sky, revisiting the same regions on timescales from days to years, and containing a large sample of serendipitous objects, are now becoming available as predecessors of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST; Ivezić et al. 2019).

Among these is the Zwicky Transient Facility (ZTF; Bellm 2014; Bellm et al. 2019), which had first light in 2017 and employs a powerful 47 deg² field-of-view camera mounted on the Samuel Oschin 48 inch Schmidt telescope. ZTF is designed to image the entire northern sky every three nights and scan the plane of the Milky Way twice each night to a limiting magnitude of 20.5 in gri, thus enabling a wide variety of novel multiband time series studies in preparation for the LSST.

LSST, which aims for first light in 2022, will revolutionize time domain astronomy, enabling for the first time the study of transient and variable objects over long periods of time (∼10 yr) with ≳1000 visits, down to very faint magnitudes (r ∼ 24.5 for single images of the entire sky every 3 days, ∼26.1 for yearly stacks, and ∼27.5 at full depth; 5σ), over a large sky area (>18,000 deg²).

Given the large number of sources that ZTF and LSST will observe (∼1–40 billion objects), it is critical to develop reliable and efficient variability-based selection techniques. This new information allows us to see through degeneracies that might exist from color characterization alone. These selection techniques should ideally take advantage of the multiband light curves provided by surveys like LSST and ZTF and separate different subclasses of variable and transient objects without the need for optical spectra, which are still quite expensive to obtain for such large samples.

This new generation of large etendue survey telescopes has demonstrated a growing need for sophisticated astronomical alert processing systems (i.e., systems that are able to detect changes in the sky of an astrophysical origin). These systems involve the real-time processing of data for alert generation, real-time annotation and classification of alerts (up to 40 million events per night) and real-time reaction to interesting alerts using available astronomical resources (e.g., via Target Observation Managers, or TOMs). In order to use these resources intelligently and efficiently, the astronomical community has been developing a new generation of alert filtering systems known as "brokers." One such community broker is the project Automatic Learning for the Rapid Classification of Events (ALeRCE; Förster et al. 2020). ALeRCE is an initiative led by an interdisciplinary and inter-institutional group of scientists from several institutions both in Chile and the United States. The main aim of ALeRCE is to facilitate the study of non-moving variable and transient objects.

ALeRCE is currently processing the ZTF alert stream, providing classifications of different variable and transient objects, in preparation for the LSST era. Two classification models are currently available in the ALeRCE pipeline: a stamp classifier (or early classifier; Carrasco-Davis et al. 2020), which uses a convolutional neural network on the first detection stamp of a source to classify it among five broad classes, namely variable star, active galactic nuclei, SN, asteroid, or bogus, and a light curve classifier (or late classifier), which uses variability features computed from the light curves to classify each source into finer (currently 15) subclasses among three of the five broad classes.

In this work we present the first version of the ALeRCE light curve classifier. This classifier uses several novel features (see Section 3), and employs machine learning (ML) algorithms that can deal with the high class imbalance present in the data, following a two-level scheme. A key goal of ALeRCE is to provide fast classification of transient and variable objects in a highly scalable framework, thus we only include in this model features that can be computed quickly, avoiding features that require more than one second to compute, based on the computational infrastructure currently at our disposal¹⁸ (for more details see Förster et al. 2020). The main advantage of this classifier is that it can separate multiple classes of transient and variable objects, using features computed from real data that would be measured from LSST data. Particularly, the light curve classifier can deal with multiple classes of stochastic variable objects (including core-, host-, and jet-dominated active galactic nuclei, YSOs, and CVs), which have been normally not included by previous classifiers that use real data and classify periodic and transient objects (e.g., Richards et al. 2009; Kim et al. 2014; Nun et al. 2016; Villar et al. 2019).

This work attempts to separate an unprecedentedly large number of classes (15) of both transients and variable objects using real data (as opposed to using only simulated data). Previous works using real data have mostly focused on selecting either a variety of variable star classes (e.g., Debosscher et al. 2009; Richards et al. 2012; Elorrieta et al. 2016; Hosenie et al. 2019; Kim & Bailer-Jones 2016; Rimoldini et al. 2019; Zorich et al. 2020), different classes of variable objects, including variable stars and active galactic nuclei (e.g., Kim et al. 2014; Nun et al. 2016), or different classes of transients (Villar et al. 2019).

To the best of our knowledge three previous works have used real data to classify transients and variable objects, albeit considering a lower number of classes: Martínez-Palomera et al. (2018) used data from the HiTS survey (Förster et al. 2016, 2018) to classify eight transient, active galactic nuclei, and variable star classes; Narayan et al. (2018) used data from The Optical Gravitational Lensing Experiment (OGLE; Udalski et al. 1992) and the Open Supernova Catalog (OSC; Guillochon et al. 2017) to classify seven transient and variable star classes; and D'Isanto et al. (2016) used Catalina Real-Time Transient Survey (CRTS; Drake et al. 2009) data to classify six transient and variable object classes. Other works have tested techniques to classify different classes of variables and transients using synthetic data (e.g., Boone 2019) or a combination of synthetic and real data (e.g., Carrasco-Davis et al. 2019).

In addition, this work is the first attempt to separate three different classes of active galactic nuclei (core-dominated or quasi-stellar objects, hereafter "QSO;" host-dominated, hereafter "AGN;" and jet-dominated, hereafter "Blazar"). Previous works have mostly focused on separating active galactic nuclei from the rest (e.g., Butler & Bloom 2011; Peters et al. 2015; Palanque-Delabrouille et al. 2016; De Cicco et al. 2019; Sánchez-Sáez et al. 2019).

The paper is organized as follows. In Section 2 we describe the data used for this work, the procedure for the light curve construction, as well as the taxonomy and the labeled set used to train the classifier. In Section 3 we define the set of features used by the light curve classifier. In Section 4 we describe the different ML algorithms tested for the classifier. In Section 5 we compare the performance of the different models, and report the results obtained for the labeled and unlabeled ZTF sets. Finally, in Section 6 we summarize the paper, provide conclusions, and discuss the challenges found during the development of the classifier and the future work.

2. Data

2.1. Reference Data

ALeRCE has been processing the public ZTF alert stream since 2019 May, which includes g and r photometry. The ALeRCE pipeline is described in detail by Förster et al. (2020); for clarity, we provide a brief description of the light curve construction process.

The ALeRCE pipeline processes the ZTF Avro alert files.¹⁹ These files contain metadata and contextual information for a single event, which is defined as a flux-transient, a reoccurring flux-variable, or a moving object (Masci et al. 2019). To construct light curves, the ALeRCE pipeline uses the photometry of the difference image and reference image (detections), possible non-detections associated with the target during the previous 30 days of the event (5σ magnitude limit in the difference image based on PSF-fit photometry, called diffmaglimby ZTF), the real-bogus quality score reported by ZTF (rb, which ranges from 0 to 1, with values closer to 1 implying more reliable detections), and the morphological classification of the closest object obtained from PanSTARRS1 (Tachibana & Miller 2018). An overview of the pipeline is presented in Figure 1. In summary, the different stages of the pipeline are:

(1)
Ingestion: the ZTF public stream is ingested using Kafka;
(2)
S3 upload: the alert Avro packets are stored in AWS S3 for later access;
(3)
Crossmatch: the position of the alert is used to query external catalogs;
(4)
Stamp classifier: alerts from new objects are classified using their image cutouts (stamps);
(5)
Preprocessing: the photometry associated with a given alert is corrected to take into account the use of difference image fluxes (see details below), and simple statistics associated with the aggregated light curve are computed;
(6)
Light curve features: advanced light curve statistics (features) are computed when there are at least six detections in a given band;
(7)
Light curve classifier: the light curve classifier described in this work is applied;
(8)
Outliers: an outlier detection algorithm is applied; and
(9)
ALeRCE stream: the aggregated, annotated, and classified light curves are reported in a Kafka stream.

**Figure 1.** A scheme of the ALeRCE pipeline. ZTF alerts are ingested using Kafka and a series of sequential and parallel steps are initiated. Alerts are stored in AWS S3, classified based on its image stamps, crossmatched with other catalogs, and their photometry corrected to take into account difference fluxes. Aggregated light curves are used to compute basic statistics (for internal use) and, if enough data points exist, features are computed, and a light curve and outlier classifiers are applied before sending an output stream. A PostgreSQL database is populated along the way, which can then be queried.
Download figure:
Standard image High-resolution image

In step (3) we are experimenting with several catalogs, but for this work we use the AllWISE²⁰ public Source Catalog (Wright et al. 2010; Mainzer et al. 2011), invoking a match radius of 2'', to obtain W1, W2, and W3 photometry (using magnitudes measured with profile-fitting photometry, e.g., w1mpro).

The preprocessing procedure (step 5) is described in detail in Förster et al. (2020; see their Appendix A). In particular, for the light curve classifier we use the corrected light curves (lc_corr; ${\hat{m}}_{\mathrm{sci}}$ in Förster et al. 2020) for sources whose closest source in the reference image coincides with the location of the alert (in a radius of 1 farcs 4). It is important to use the corrected light curves for variable sources in order to take into account changes in the sign of the difference between the reference and the science images, or possible changes of the reference image. For the rest of the sources, the correction is not possible to perform, and thus we use the light curves obtained using the difference images (lc_diff; m_diff in Förster et al. 2020), which correspond in general to transient sources. Note that these criteria do not require prior knowledge about the class of the source. Therefore, in this work we use lc_corr for sources with available corrected light curves, otherwise we use lc_diff, except for the supernova parametric model features and some optical colors, for which we use lc_diff for all the sources (for more details, see Section 3.1 and Appendix A).

2.2. Classification Taxonomy

The first version of the ALeRCE light curve classifier considers 15 subclasses of variable and transient objects, presented as a taxonomy tree defined by the ALeRCE collaboration in Figure 2. The taxonomy is subdivided in a hierarchical fashion according to both the physical properties of each class and the empirical variability properties of the light curves as follows (in parentheses we indicate the class name used by the classifier):

1.
Transient: SN Ia, SN Ibc, SN II, and super luminous supernova (SLSN);
2.
Stochastic: type 1 Seyfert galaxy (AGN; i.e., host-dominated active galactic nuclei), type 1 quasar (QSO; i.e., core-dominated active galactic nuclei), blazar (Blazar; i.e., beamed jet-dominated active galactic nuclei), young stellar object (YSO), and cataclysmic variable/nova (CV/Nova);
3.
Periodic: long-period variable (LPV; includes regular, semi-regular, and irregular variable stars), RR Lyrae (RRL), Cepheid (CEP), eclipsing binary (E), δ Scuti (DSCT), and other periodic variable stars (Periodic-Other; this includes classes of variable stars that are not well represented in the labeled set, e.g., sources classified as miscellaneous, rotational or RS Canum Venaticorum-type systems in CRTS).

**Figure 2.** Taxonomy tree used in the current version of the ALeRCE light curve classifier.
Download figure:
Standard image High-resolution image

Figure 3 shows examples of light curves of the different classes considered by the light curve classifier, obtained using ZTF data.

It is important to note that there are a number of less common classes which have not been separated out yet in the ALeRCE taxonomy tree because the number of crossmatched objects in these classes is too low to train a good classification model (e.g., SNe IIb, TDEs, KNe, among others). There is a catch-all "Periodic-Other" class for periodic classes excluded in the taxonomy tree, but not for transient or stochastic classes, and thus, for the moment, these missing classes are being grouped into one or more of the existing ones.

2.2.1. Labeled Set

The labeled set (i.e., the set of sources used to define the training and testing sets) for the light curve classifier was built using sources observed by ZTF (i.e., with ZTF light curves), with known labels obtained via spectroscopic and/or photometric analysis by previous works. Further description of the labeled set construction strategy can be found in Förster et al. (2020). We obtained labels from the following catalogs: the ASAS-SN catalog of variable stars (ASASSN; Jayasinghe et al. 2018, 2019a, 2019b, 2020), the Catalina Surveys Variable Star Catalogs (CRTS; Drake et al. 2014, 2017), the LINEAR catalog of periodic light curves (LINEAR; Palaversa et al. 2013), Gaia Data Release 2 (Gaia DR2; Mowlavi et al. 2018; Rimoldini et al. 2019), the Transient Name Server database (TNS),²¹ the Roma-BZCAT Multi-Frequency Catalog of Blazars (ROMABZCAT; Massaro et al. 2015), the Million Quasars Catalog (MILLIQUAS, version 6.4c, 2019 December; Flesch 2015, 2019), the New Catalog of Type 1 AGNs (Oh2015; Oh et al. 2015), and the SIMBAD database (Wenger et al. 2000). Some additional CV labels were obtained from different catalogs (including Ritter & Kolb 2003), compiled by Abril et al. (2020; JAbril). It is worth mentioning that we only use the labels provided by these catalogs to build our data sets and not any other information, such as periods, colors, or redshifts. A catalog containing the labeled set can be downloaded at Zenodo:10.5281/zenodo.4279623.

Table 1 lists the number of sources in the labeled set belonging to each class (with their corresponding percentages according to their hierarchical group) and the catalogs from which the classifications were obtained. Only sources with ≥6 detections in g or ≥6 detections in r were included (considering data obtained until 2020 June 9). It is clear from the table that there is a high imbalance in the labeled set, with some classes representing less than 5% of their respective hierarchical group. Figure 4 shows the (ordered) number of sources per class for the labeled set and Figure 5 shows the fraction of sources in each class with photometry only in the g band, only in the r band, or in both bands.

**Figure 4.** Number of sources per class for the labeled set, as reported in Table 1.
Download figure:
Standard image High-resolution image

**Figure 5.** For the sources in the labeled set, this figure shows the fraction of sources in each class with photometry: only in the g band (green); only in the r band (red); or in both bands (gray). The reasons for the non-uniformity of coverage may be physical (strongly red or blue source) or organizational (survey focused on one band only). For most classes, the vast majority of the sources (≳92%) have photometric detections in both g and r; the exceptions are the YSO, LPV, and CEP classes, where only 76% of the sources have photometry in both bands.
Download figure:
Standard image High-resolution image

Table 1. Labeled Set Definition

Hierarchical Class	Class	# of Sources ^a	Source Catalogs
	SN Ia	1272 (74.0%)	TNS
	SN Ibc	94 (5.5%)	TNS
Transient	SN II	328 (19.1%)	TNS
	SLSN	24 (1.4%)	TNS
	Total	1718
	QSO	26168 (75.4%)	MILLIQUAS (sources with class "Q")
	AGN	4667 (13.4%)	Oh2015, MILLIQUAS (sources with class "A")
Stochastic	Blazar	1267 (3.6%)	ROMABZCAT, MILLIQUAS (sources with class "B")
	YSO	1740 (5.0%)	SIMBAD
	CV/Nova	871 (2.5%)	TNS, ASASSN, JAbril
	Total	34713
	LPV	14076 (16.2%)	CRTS, ASASSN, Gaia DR2
	E	37901 (43.5%)	CRTS, ASASSN, LINEAR
	DSCT	732 (0.8%)	CRTS, ASASSN, LINEAR, Gaia DR2
Periodic	RRL	32482 (37.3%)	CRTS, ASASSN, LINEAR, Gaia DR2
	CEP	618 (0.7%)	CRTS, ASASSN
	Periodic-Other	1256 (1.4%)	CRTS, LINEAR
	Total	87065

Note.

^aValues in parentheses correspond to the fraction of sources of a given class (second column) within its corresponding hierarchical class (first column).

Download table as: ASCII Typeset image

3. Features Used by the Classifier

The light curve classifier uses a total of 152 features. We avoid including features that require a long time to compute, for example, features that require the use of Markov Chain Monte Carlo techniques, since one of the goals of the light curve classifier is to provide a fast and highly scalable classification. One hundred forty-two of these features are computed using solely the public ZTF g and r data. We excluded the mean magnitude as a feature to avoid any bias in the labeled set magnitude distribution affecting the classification of sources that are fainter (or brighter). Features obtained using the ZTF observed magnitudes are called detection features (56 features in the g band, 56 features in the r band, and 12 multiband features, giving a total of 124 features), and features computed using the ZTF non-detection 5σ magnitude limits diffmaglim's are called non-detection features (nine features for each g and r band, giving a total of 18 features). These features are described in the following Sections (3.1 and 3.2), as well as in Appendix A. Considering the LSST Data Products Definition Document (Jurić et al. 2019), we expect that all these features would be measured using LSST data.

We also included as features the galactic coordinates of each target (gal_b and gal_l), the W1-W2 and W2-W3 AllWISE colors, and the g−W2, g−W3, r−W2, and r−W3 colors, where g and r are computed as the mean magnitude of the g band and r band light curves for a given source. In addition, we use information included in the Avro files' metadata: the sgscore1 parameter, which corresponds to a morphological star/galaxy score of the closest source from PanSTARRS1 (Tachibana & Miller 2018) reported in the ZTF Avro files, with 0 ≤ sgscore1 ≤ 1, where values closer to 1 imply a higher likelihood of the source being a star; and the median rb (real-bogus) parameter. With these 10 extra features, the total number of features used by the classifier sum 152.

As we mentioned in Section 2.1, in this work we only consider light curves with ≥6 epochs in g or ≥6 epochs in r. If a given source has ≥6 epochs just in one band, it is included in the analysis, and the features associated with the missing band are considered as −999 values. This rule applies to all the features used by the classifier; whenever a feature is not available for a given target, we assume a value equal to −999.

3.1. Detection Features

Most of the features used by the light curve classifier are computed using the observed magnitudes in the g and r bands (i.e., the detections). There are 56 different features computed for each band, and 12 features computed using a combination of both bands, yielding a total of 124 detection features. The definition of all these features can be found in Table 2. We split the table into three blocks. The first block contains new features defined by this work (i.e., novel features). Some of these features are further described in Section 3.1.1. The second block contains features that correspond to new variants of descriptors included in other works. Some of them are further described in Appendix A. Finally, the third block includes 22 features that come from the Feature Analysis for Time Series (FATS; Nun et al. 2015) Python package. Hereafter, features ending with "_1" are computed for the g band, and features ending with "_2" are computed for the r band, following the notation used in the ZTF Avro files.

Table 2. List of Detection Features Used by the Light Curve Classifier

Feature	Description	References
`delta`_`period`	Absolute value of the difference between the `Multiband`_`period` and the MHAOV period obtained using a single band	This work
`IAR`_`phi`*	Level of autocorrelation using a discrete-time representation of a DRW model	Eyheramendy et al. (2018)
MHPS parameters*	Obtained from an MHPS analysis (three in total)	Arévalo et al. (2012)
`positive`_`fraction`	Fraction of detections in the difference images of a given band which are brighter than the template image	This work
`Power`_`rate`* ♦	Ratio between the power of the multiband periodogram obtained for the best period candidate (P) and 2 × P, 3 × P, 4 × P, P/2, P/3, or P/4	This work
`PPE`* ♦	Multiband Periodogram Pseudo Entropy	This work
(g−r)_`max` ♦	g−r color obtained using the brightest `lc`_`diff` magnitude in each band	This work
(g−r)_`max`_`corr` ♦	g−r color obtained using the brightest `lc`_`corr` magnitude in each band	This work
(g−r)_`mean` ♦	g−r color obtained using the mean `lc`_`diff` magnitude of each band	This work
(g−r)_`mean`_`corr` ♦	g−r color obtained using the mean `lc`_`corr` magnitude of each band	This work
`delta`_`mag`_`fid`	Difference between maximum and minimum observed magnitude in a given band	This work
`ExcessVar`**	Measure of the intrinsic variability amplitude	Allevato et al. (2013)
`GP`_`DRW`_`tau`**	Relaxation time τ from DRW modeling	Graham et al. (2017)
`GP`_`DRW`_`sigma`**	Amplitude of the variability at short timescales (t ≪ τ), from DRW modeling	Graham et al. (2017)
Harmonics parameters**	Obtained by fitting a harmonic series up to the seventh harmonic (14 in total)	(Stellingwerf & Donohoe 1986)
`Multiband`_`period`** ♦	Period obtained using the multiband MHAOV periodogram	Mondrik et al. (2015)
`Pvar`**	Probability that the source is intrinsically variable	McLaughlin et al. (1996)
`SF`_`ML`_`amplitude`**	rms magnitude difference of the SF, computed over a 1 yr timescale	Schmidt et al. (2010)
`SF`_`ML`_`gamma`**	Logarithmic gradient of the mean change in magnitude	Schmidt et al. (2010)
SPM features**	Supernova parametric model features (seven in total)	Villar et al. (2019)
`Amplitude`	Half of the difference between the median of the maximum 5% and of the minimum 5% magnitudes	Richards et al. (2011)
`AndersonDarling`	Test of whether a sample of data comes from a population with a specific distribution	Nun et al. (2015)
`Autocor`_`length`	Lag value where the autocorrelation function becomes smaller than `Eta`_`e`	Kim et al. (2011)
`Beyond1Std`	Percentage of points with photometric mag that lie beyond 1σ from the mean	Richards et al. (2011)
`Con`	Number of three consecutive data points brighter/fainter than 2σ of the light curve	Kim et al. (2011)
`Eta`_`e`	Ratio of the mean of the squares of successive mag differences to the variance of the light curve	Kim et al. (2014)
`Gskew`	Median-based measure of the skew	Nun et al. (2015)
`LinearTrend`	Slope of a linear fit to the light curve	Richards et al. (2011)
`MaxSlope`	Maximum absolute magnitude slope between two consecutive observations	Richards et al. (2011)
`Meanvariance`	Ratio of the standard deviation to the mean magnitude	Nun et al. (2015)
`MedianAbsDev`	Median discrepancy of the data from the median data	Richards et al. (2011)
`MedianBRP`	Fraction of photometric points within amplitude/10 of the median mag	Richards et al. (2011)
`PairSlopeTrend`	Fraction of increasing first differences minus fraction of decreasing first differences over the last 30 time-sorted mag measures	Richards et al. (2011)
`PercentAmplitude`	Largest percentage difference between either max or min mag and median mag	Richards et al. (2011)
`Psi`_`CS`	Range of a cumulative sum applied to the phase-folded light curve	Kim et al. (2011)
`Psi`_`eta`	`Eta`_`e` index calculated from the folded light curve	Kim et al. (2014)
`Q31`	Difference between the 3rd and the 1st quartile of the light curve	Kim et al. (2014)
`Rcs`	Range of a cumulative sum	Kim et al. (2011)
`Skew`	Skewness measure	Richards et al. (2011)
`SmallKurtosis`	Small sample kurtosis of the magnitudes	Richards et al. (2011)
`Std`	Standard deviation of the light curve	Nun et al. (2015)
`StetsonK`	Robust kurtosis measure	Kim et al. (2011)

Note. Features marked with ♦ are computed using both g and r bands at the same time. Features marked with * and ** are further described in Section 3.1.1 and Appendix A, respectively.

Download table as: ASCII Typeset image

3.1.1. Description of a Relevant Set of Detection Features

Table 2 summarizes the definitions of the detection features used by the light curve classifier. Some of these features are worth describing in more detail since they are novel features. All other relevant features are described in Appendix A:

1.
Periodogram pseudo entropy: To have an estimate of the confidence of the candidate period (obtained with the multiband MHAOV method), we developed a heuristic based on the entropy of the normalized periodogram peaks, which we denote as periodogram pseudo entropy (PPE). This value is computed by recovering the 100 largest values of the periodogram, normalizing them and computing the entropy of that vector. This feature is computed as
$\begin{eqnarray}&&\mathrm{PPE}=1+\displaystyle \frac{1}{\mathrm{log}(100)}\displaystyle \sum _{i=1}^{100}\left(\displaystyle \frac{{p}_{i}}{Z}\right)\mathrm{log}\left(\displaystyle \frac{{p}_{i}}{Z}\right),\end{eqnarray} \tag{ 1 }$
where p_i is the value of the ith largest peak of the periodogram and $Z={\sum }_{i=1}^{100}{p}_{i}$ . This feature takes values between zero (no clear period stands out) and one (periodogram has a single large peak).
2.
Power rate: Ratio between the power of the multiband periodogram obtained for the best period candidate (P) versus the power of the multiband periodogram obtained for n times this period [Power_rate_n = Power(P)/Power(n × P)], where n can take values of 2, 3, 4, 1/2, 1/3, and 1/4. We computed these ratios in order to detect cases where we measure an aliased multiple of the period instead of the true period, which is particularly a common issue for some classes of eclipsing binaries (e.g., Catelan et al. 2013; Graham et al. 2013; McWhirter et al. 2018; VanderPlas 2018, and references therein).
3.
Irregular autoregressive (IAR) model: Eyheramendy et al. (2018) introduced this model. It is a discrete-time representation of the continuous autoregressive model of order 1 [CAR(1)], which has desirable statistical properties such as strict stationarity and ergodicity without a distributional assumption. The IAR model is defined by
$\begin{eqnarray}&&{y}_{{t}_{j}}={\phi }^{{t}_{j}-{t}_{j-1}}\,{y}_{{t}_{j-1}}+\sigma \,\sqrt{1-{\phi }^{2({t}_{j}-{t}_{j-1})}}\,{\varepsilon }_{{t}_{j}},\end{eqnarray} \tag{ 2 }$
where ${\varepsilon }_{{t}_{j}}$ is a white noise sequence with zero mean and unit variance, σ is the standard deviation of ${y}_{{t}_{j}}$ , and {t_j} are the observational times for j = 1, ..., n. We used a modified version of the IAR model, which considers the estimated variance of the measurement errors ${\delta }_{{t}_{j}}^{2}$ in the likelihood of the model. Thus, by assuming a Gaussian distribution, the negative log-likelihood of the process is given by
$\begin{eqnarray}&&{\ell }(\phi ,{\sigma }^{2})=\displaystyle \frac{n}{2}\mathrm{log}(2\pi )+\displaystyle \frac{1}{2}\displaystyle \sum _{j=1}^{n}\mathrm{log}{\nu }_{{t}_{j}}+\displaystyle \frac{1}{2}\displaystyle \sum _{j=1}^{n}\displaystyle \frac{{e}_{{t}_{j}}^{2}}{{\nu }_{{t}_{j}}},\end{eqnarray} \tag{ 3 }$
where ${e}_{{t}_{1}}={y}_{{t}_{1}}$ , ${\nu }_{{t}_{1}}={\sigma }^{2}+{\delta }_{{t}_{1}}^{2}$ , and ${\hat{y}}_{{t}_{1}}=0$ are the initial values, while ${\hat{y}}_{{t}_{j}}={\phi }^{{t}_{j}-{t}_{j-1}}\,{y}_{{t}_{j-1}}$ , ${e}_{{t}_{j}}={y}_{{t}_{j}}-{\hat{y}}_{{t}_{j}}$ , and ${\nu }_{{t}_{j}}\,={\sigma }^{2}(1-{\phi }^{2({t}_{j}-{t}_{j-1})})+{\delta }_{{t}_{j}}^{2}$ for j = 2, ..., n. Particularly, ϕ describes the autocorrelation function of order 1 for a given light curve. We computed the maximum likelihood estimation of the parameter ϕ (obtained directly from the light curves), and we used this as a feature for our classifier. We denoted this parameter as IAR_phi.
4.
Mexican hat power spectrum (MHPS): Arévalo et al. (2012) proposed a method to compute low-resolution power spectra from data with gaps, where the light curves are convolved with a Mexican hat filter: $F(x)\,\propto \left[1-\tfrac{{x}^{2}}{{\sigma }^{2}}\right]{e}^{-{x}^{2}/2{\sigma }^{2}}$ . Gaps, or generally uneven sampling, are corrected for by convolving a unit-valued mask with the same sampling as the light curve and dividing the convolved light curve by it. This method can be used to isolate structures with a characteristic timescale ( $t\sim \sigma /\sqrt{2{\pi }^{2}}$ ) in a given light curve, in order to estimate the light curve variance associated with that timescale. We compute the light curve variance at two different timescales of 10 and 100 days. The variance associated with the 10 days timescale ("high" frequency) is denoted MHPS_high, while the variance associated with the 100 days timescale ("low" frequency) is denoted MHPS_low. We also compute the ratio between the low and high frequency variances for a given band, denoted as MHPS_ratio. The logarithm of MHPS_ratio is therefore an estimate of the power law slope of the power spectrum of the light curve.

3.2. Non-detection Features

For each detection, the ZTF alert stream includes 5σ magnitude limits (diffmaglim), which are computed from the g and r difference images of the same area of the sky obtained in the previous 30 days, where the target associated with the alert was not detected (non-detections). These non-detections are very informative, since they can, for instance, inform us whether a transient has not been detected before, whether a non-variable source has begun to exhibit a variable behavior, or which range of observed magnitudes we should expect to measure when there are not significant differences between the science and template images, and an alert is not generated. The light curve classifier uses nine different features defined using all the non-detections associated with a given source, computed for both g and r bands, yielding a total of 18 non-detection features. Note that all these features are new and have not been used before for classification. Table 3 lists the non-detection features used by the light curve classifier. As before, non-detection features ending with "_1" are computed for the g band, and features ending with "_2" are computed for the r band.

Table 3. List of Non-detection Features Used by the Light Curve Classifier

Feature	Description
`dmag`_`first`_`det`_`fid`	Difference between the last non-detection `diffmaglim` in band "x" before the first detection
	in any band and the first detected magnitude in band "x"
`dmag`_`non`_`det`_`fid`	Difference between the median non-detection `diffmaglim` in the "x" band before the first
	detection and in any band the minimum detected magnitude (peak) in the "x" band
`last`_`diffmaglim`_`before`_`fid`	Last non-detection `diffmaglim` in the "x" band before the first detection in any band
`max`_`diffmaglim`_`before`_`fid`	Maximum non-detection `diffmaglim` in the "x" band before the first detection in any band
`max`_`diffmaglim`_`after`_`fid`	Maximum non-detection `diffmaglim` in the "x" band after the first detection in any band
`median`_`diffmaglim`_`before`_`fid`	Median non-detection `diffmaglim` in the "x" band before the first detection in any band
`median`_`diffmaglim`_`after`_`fid`	Median non-detection `diffmaglim` in the "x" band after the first detection in any band
`n`_`non`_`det`_`before`_`fid`	Number of non-detections in the "x" band before the first detection in any band
`n`_`non`_`det`_`after`_`fid`	Number of non-detections in the "x" band after the first detection in any band

Note. Note that "x" stands for either g or r band.

Download table as: ASCII Typeset image

4. Classification Algorithms

The labeled set used in this work presents a very high imbalance (see Table 1). For instance, QSOs represent 75.4% of the stochastic sources, while CV/Novae represent just 2.5%. To deal with this issue, we looked for ML algorithms available in the literature that are designed to mitigate the imbalance problem. In particular, we worked with the imbalanced-learn Python package (Lemaître et al. 2017). Imbalanced-learn includes implementations of several resampling algorithms that are commonly used to handle data sets with strong between-class imbalance. The algorithms available in this package are fully compatible with scikit-learn methods.

In the following sections we describe the Random Forest (RF) algorithm used by the light curve classifier, as well as other tested ML algorithms. To train each classifier we randomly split the labeled set into a training set (80%) and testing set (20%) in a stratified fashion, preserving the percentage of samples for each class.

4.1. Balanced Random Forest

A Decision Tree (Rokach and Maimon 2008) is a predictive algorithm that uses a tree structure to perform successive partitions on the data according to a certain criterion (e.g., a cut-off value in one of the descriptors or features) and produces possible decision paths, providing a final outcome for each path (the leaves of the tree). Decision trees are commonly used for classification, where each final leaf is associated with a given class. RFs (Breiman 2001) are algorithms that build multiple decision trees, where each tree is trained using a random sub-sample of elements from a given training set, selected allowing repetition (bootstrap sample of the training set), and using a random selection of features. The final classification is obtained by averaging the classifications provided by each single tree. This average score can be interpreted as the probability (P_RF) that the input element belongs to a given class. One of the main advantages of RF is that it naturally provides a ranking of features for the classification, by counting the number of times each feature is selected to split the data.

Chen et al. (2004) proposed a modified RF that can deal with the imbalanced data classification. In their model each individual tree is trained using a sub-sample of the training set that is defined by generating a bootstrap sample from the minority class and then randomly selecting the same number of cases, with replacement, from the majority classes. Imbalanced-learn implements the balanced RF classifier proposed by Chen et al. (2004). For the ALeRCE light curve classifier we use their BalancedRandomForestClassifier method, selecting the hyper-parameters (number of trees, maximum number of features per tree, and maximum depth of each tree) with a K-fold cross-validation procedure available in scikit-learn, with k = 5 folds and using the "macro-recall" as target score (see its definition in Section 5.1).

4.1.1. The Two-level Classifier Approach

Considering the hierarchical structure of the taxonomy (see Section 2.2), we decided to construct a balanced RF (BRF) classifier with two-level scheme. The first level consists of a single classifier that separates the sources into three broad classes. The second level consists of three distinct classifiers, which further resolve each class in the first level into subclasses. We then use the probabilities obtained for each independent classifier to obtain the final classification.

In more detail, the first level (top level hereafter) consists of a single classifier which classifies every source as periodic, stochastic, or transient. The second level (bottom level hereafter) consists of three distinct classifiers: transient, stochastic, and periodic. The classes considered by each of these three classifiers are the ones shown in Table 1 and Figure 2. Each classifier in the bottom level is trained using a training subset having only those classes included in the primary top class (for instance, the transient classifier only includes sources classified as SN Ia, SN Ibc, SN II, and SLSN). It is important to note that these four classifiers are independent and process the same input features set described in Section 3. The final classification is constructed by multiplying the probabilities obtained for each class of the top level [P_top(transient), P_top(stochastic), and P_top(periodic)] with the individual probabilities obtained by their correspondent classifier in the bottom level. Namely, the probabilities of the transient classifier (P_T) are multiplied by P_top(transient), the probabilities of the stochastic classifier (P_S) are multiplied by P_top(stochastic), and the probabilities of the periodic classifier (P_S) are multiplied by P_top(periodic). We denote the product of these probabilities as P. For instance, the probability of a given source being an RRL corresponds to the product of its probability of being periodic (according to the top level) and its probability of being an RRL (according to the periodic classifier):

$\begin{eqnarray}&&P(\mathrm{RRL})={P}_{\mathrm{top}}(\mathrm{periodic})\times {P}_{P}(\mathrm{RRL}),\end{eqnarray} \tag{ 4 }$

while the probability of being a Blazar is computed as:

$\begin{eqnarray}&&P(\mathrm{Blazar})={P}_{\mathrm{top}}(\mathrm{stochastic})\times {P}_{S}(\mathrm{Blazar}).\end{eqnarray} \tag{ 5 }$

Following this, the sum of the probabilities of the 15 classes for a given source adds up to one. Finally, the class of a given object is determined by selecting the class with the maximum P. Hereafter, we refer to the results presented for the bottom level of the classifier as the final predictions.

The best cross-validation performance was obtained with the following hyper-parameter setting: 500 trees in each classifier, maximum depth trees (the nodes are expanded until all leaves are pure), and a maximum number of features equal to the square root of the total number of features, except for the stochastic classifier, where we used 20% of the features. In Section 5 we present the results obtained when applying the BRF classifier to the ZTF data.

4.2. Additional ML Algorithms Tested

In addition to RF, we also tested two other supervised classification algorithms: gradient boosting and multilayer perceptron. These tests were done as a complementary analysis with the purpose of guiding future efforts in improving the light curve classifier.

None of these methods has a Python implementation particularly designed to handle imbalanced data sets; however, using imbalanced-learn we can generate balanced training sets. We present the results obtained using both classifiers in Section 5.

4.2.1. Gradient Boosting

Gradient boosting (GBoost; Friedman 2001) is an ML algorithm that uses an ensemble of weak prediction models (e.g., decision trees) to produce a more robust classification. The method implements a boosting algorithm (using a gradient descent algorithm) that trains a sequence of weak models, each compensating the weaknesses of its predecessors. eXtreme Gradient Boosting (XGBoost; Chen & Guestrin 2016) is a package available in several computing languages (including Python) that implements GBoost algorithms for classification and regression in an efficient and scalable way. It has become one of the most used packages for regression and classification in recent years.

For the case of GBoost we followed the same two-level strategy described in Section 4.1.1. However, since the current version of the XGBoost multi-class classifier was not designed to deal with highly imbalanced data sets (e.g., Wang et al. 2019), we tested a model that uses XGBoost and is trained with a balanced training set. We constructed this balanced training set using the RandomUnderSampler and RandomOverSampler methods available in imbalanced-learn. For the case of the top level, periodic, and stochastic classifiers, we constructed a balanced training set by generating 10 random samples using the RandomUnderSampler method, resampling all classes, and concatenating them, in order to obtain a training set with more than 10,000 objects in total for each classifier. For the case of the transient classifier we used the RandomOverSampler method, resampling all classes, to generate one random sample with ∼600 objects. Each classifier uses the default hyper-parameters defined by the XGBoost Python package with the exception of the boosting rounds where we used 500 and the objective function, which was set up to do multi-class classification, using the softmax function for multi-class predictions. As in the case of the BRF model, the class of a given object is determined by selecting the class with the maximum probability (obtained by multiplying the probabilities of the top and bottom levels).

4.2.2. Multilayer Perceptron

Artificial neural networks (ANNs) are mathematical models inspired by the human brain. ANNs are composed of elemental computational units called neurons (Haykin 1994). ANNs can be used to perform complex tasks such as classification or regression. A multilayer perceptron (MLP) corresponds to an ANN whose neurons are ordered by layers, where all neurons belonging to a given layer receive the same input vector and each unit processes this vector independently according to its own parameters. The outputs of all neurons in a layer are grouped and form the input vector for the next layer. For the case of classification, when using the softmax activation function, the final layer provides the probabilities that a given element belongs to a given class. One way of obtaining the final class is to assign the label with the maximum probability in the output layer.

For this model, we also followed the two-level strategy described in Section 4.1.1. We tested different MLP architectures, changing the number of layers and the number of neurons per layer. We used the Keras API provided by the Python version of TensorFlow 2.0 (Abadi et al. 2016). We split the original training set defined above into a new training set (80%) and a validation set (20%). In order to deal with the high imbalance of the training set we used the balanced mini-batches generator for Keras provided by the imbalanced-learn Python package. The best performance (considering the categorical cross-entropy loss and accuracy curves for the training and validation sets) was obtained using MLPs with two hidden layers with 256 and 128 neurons for all the classifiers (top level, transient, stochastic, and periodic). Regularization via the dropout method (Srivastava et al. 2014) is used to prevent overfitting. The dropout fraction is set at 0.50.

5. Results

5.1. Results for the BRF Classifier

In order to test the performance of our BRF classifier we generated 20 different training and testing sets using the ShuffleSplit iterator provided by scikit-learn, which uses random permutations to split the data, using each time 80% of the labeled set as training set and 20% as testing set, preserving the percentage of samples for each class in the original labeled set. Then, we trained 20 different BRF models using each training set, and tested their performance using the corresponding testing sets. We emphasize that the testing sets are never used in training their respective models.

Table 4 lists three different scores: precision, recall, and F1-score. For an individual class these scores are defined as:

$\begin{eqnarray}&&{\mathrm{Precision}}_{i}=\displaystyle \frac{{\mathrm{TP}}_{i}}{{\mathrm{TP}}_{i}+{\mathrm{FP}}_{i}},\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&{\mathrm{Recall}}_{i}=\displaystyle \frac{{\mathrm{TP}}_{i}}{{\mathrm{TP}}_{i}+{\mathrm{FN}}_{i}},\end{eqnarray} \tag{ 7 }$

$\begin{eqnarray}&&{\rm{F}}1 \mbox{-} {\mathrm{score}}_{i}=2\times \displaystyle \frac{{\mathrm{Precision}}_{i}\times {\mathrm{Recall}}_{i}}{{\mathrm{Precision}}_{i}+{\mathrm{Recall}}_{i}},\end{eqnarray} \tag{ 8 }$

where n_cl is the total number of classes, TP_i is the number of true positives, FP_i is the number of false positives, and FN_i is the number of false negatives, for a given class i. Despite the high imbalance present in the labeled set, all classes are equally important, and thus we compute macro-averaged scores:

$\begin{eqnarray}&&{\mathrm{Precision}}_{\mathrm{macro}}=\displaystyle \frac{1}{{n}_{\mathrm{cl}}}\displaystyle \sum _{i=1}^{{n}_{\mathrm{cl}}}{\mathrm{Precision}}_{i},\end{eqnarray} \tag{ 9 }$

$\begin{eqnarray}&&{\mathrm{Recall}}_{\mathrm{macro}}=\displaystyle \frac{1}{{n}_{\mathrm{cl}}}\displaystyle \sum _{i=1}^{{n}_{\mathrm{cl}}}{\mathrm{Recall}}_{i},\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&{\rm{F}}1 \mbox{-} {\mathrm{score}}_{\mathrm{macro}}=\displaystyle \frac{1}{{n}_{\mathrm{cl}}}\displaystyle \sum _{i=1}^{{n}_{\mathrm{cl}}}{\rm{F}}1 \mbox{-} {\mathrm{score}}_{i}.\end{eqnarray} \tag{ 11 }$

Table 4. Macro-averaged Scores Obtained for the BRF Classifier in the Testing Set

Classifier	Precision	Recall	F1-score
BRF—top	0.96 ± 0.01	0.99 ± 0.01	0.97 ± 0.01
BRF—bottom	0.57 ± 0.01	0.76 ± 0.02	0.59 ± 0.01

Note. The reported values correspond to the mean and standard deviation obtained from the trained 20 models.

Download table as: ASCII Typeset image

For the particular case of the BRF classifier, Table 4 reports the mean and the standard deviation of the macro-averaged scores obtained by the 20 models when applying them to their respective testing sets.

In addition, Figures 6 and 7 show the confusion matrices obtained for the top and bottom levels, respectively. To generate these confusion matrices we used the results obtained when applying each of the 20 BRF models to their corresponding testing sets, providing for each level the median and 5th and 95th percentiles of the confusion matrices obtained by each of the 20 models.

**Figure 6.** Confusion matrix for the top level BRF classifier. The confusion matrix was obtained by generating 20 different training and testing sets, and by training 20 independent models using each training set separately. After the training, each model is applied to their respective testing set. We provide the median and 5th and 95th percentiles of the confusion matrices obtained for the 20 testing sets. To normalize the confusion matrix results as percentages, we divide each row by the total number of objects per class with known labels. We round these percentages to integer values. This level shows a high degree of accuracy with a low percentage of misclassifications.
Download figure:
Standard image High-resolution image

The confusion matrix of the top level shows that the classifier can recover more than 97% of the true labels, and that the contamination between classes is below 3%. The scores obtained reflect the good performance of the top level classifier.

For the case of the bottom level we obtained an F1-score of 0.59, implying significant confusion between classes. From Figure 7 we can see that the fraction of true positives in the confusion matrix of the bottom level has values between 50% and 100%, with mean, median, and standard deviation of 78%, 76%, and 14%, respectively. In addition, from the figure it can be observed that the confusion is most often observed among classes with similar characteristics, such as among the SN classes (particularly among SN Ia versus SN Ibc and SN II versus SLSN); among the blazar, AGN, and QSO classes; and among various periodic classes. The highest standard deviation of the predictions is observed for the case of SLSN. This is a result of the low number of SLSN in the labeled set.

To complement our analysis, in Appendix B we provide the results obtained by a one-level multi-class RF model. From its results we can conclude that the light curve classifier improves considerably when a two-level strategy is followed.

5.1.1. Comparison with the GBoost and MLP Classifiers

In this work we tested two other classifiers: GBoost and MLP. For these models we present the results obtained by using 80% of the labeled set as a training set and the remaining 20% as a testing set, preserving the percentage of samples for each class in the original labeled set.

For the case of the GBoost classifier, the macro-averaged precision, recall, and F1-score of the top level have a value of 0.99. For the bottom level, the macro-averaged precision, recall, and F1-score are, respectively, 0.72, 0.72, and 0.71. On the other hand, for the MLP classifier the macro-averaged precision, recall, and F1-score of the top level are 0.94, 0.99, and 0.96, respectively. For the bottom level the macro-averaged precision, recall, and F1-score are, respectively, 0.54, 0.69, and 0.58. The confusion matrices obtained for the bottom level of the GBoost and MLP classifiers in the testing set are presented on the left and right sides of Figure 8, respectively.

**Figure 8.** Confusion matrices of the bottom level obtained when applying the GBoost (left) and the MLP (right) classifiers to the testing set. The black squares highlight the three hierarchical classes (from top to bottom, transient, stochastic, and periodic, respectively). To normalize the confusion matrix results as percentages, we divide each row by the total number of objects per class with known labels. We round these percentages to integer values. These matrices present a high percentage of misclassifications compared to the bottom level of the BRF model (Figure 7).
Download figure:
Standard image High-resolution image

The precision and F1-score obtained by GBoost are in general better than the ones obtained by the BRF classifier, with the exception of the recall score of the bottom level. However, the fraction of true positives in the confusion matrix of the bottom level range between 5% and 100%, with a mean, median, and standard deviation of 72%, 83%, and 29%, respectively, which explain the lower recall obtained by GBoost, compared to the BRF classifier. In addition, the classes with the largest fraction of true positives in the confusion matrix of GBoost correspond to the most populated classes in the labeled set, like QSOs, which represent 75.4% of the stochastic sources; SN Ia, which represent 74.0% of the transients; or LPV, E, and RRL, which represent 16.2%, 43.5%, and 37.3% of the periodic sources, respectively. This is not observed in the results obtained by BRF, where there is no evidence of correlation between the representativity of a given class and its fraction of true positives in the confusion matrix shown in Figure 7.

The results obtained using GBoost are promising. We obtained good scores although the current versions of GBoost available in the literature have not been designed to deal with high imbalance for the case of multi-class classification. Therefore, further efforts should be done to implement GBoost in future versions of the light curve classifier. In particular, new implementations of GBoost for multi-class classification that follow similar approaches to the ones proposed by Chen et al. (2004) or Wang et al. (2019) should be tested, as should combinations of GBoost with data augmentation techniques (i.e., generating synthetic light curves of less populated classes using physical and/or statistical models).

For the case of the MLP classifier the scores obtained are in general lower compared to BRF and GBoost. Its confusion matrix for the bottom level (see Figure 8) presents the same issues already discussed for the case of GBoost. The fraction of true positives in the confusion matrix ranges between 11% and 98%, with a mean, median, and standard deviation of 69%, 74%, and 22%, respectively. Therefore, we conclude that more work should be done in order to obtain better results with MLP.

From these tests we can conclude that the BRF is the model that currently achieves results that are less biased toward the most populated classes in the labeled set, i.e., it is able to predict all subclasses, and the fraction of true positives does not correlate with how representative a given class in the labeled set is, compared to GBoost and MLP. Thus, we decided to use BRF as the final model for the first version of the ALeRCE light curve classifier. Future work will further exploit the potential of the GBoost and MLP classifiers. For the rest of this paper, presented results correspond to the BRF classifier.

5.1.2. Results for the BRF Classifier Excluding AllWISE Data

We tested a version of the BRF classifier that excludes features computed using AllWISE data. The macro-averaged precision, recall, and F1-score of the top level are 0.93, 0.97, and 0.95, respectively. For the bottom level the macro-averaged precision, recall, and F1-score are, respectively, 0.53, 0.72, and 0.55. These scores are slightly smaller than the ones obtained using the original version of the BRF classifier. As can be observed in the confusion matrix shown in Figure 9, the stochastic classes are the most affected by the lack of AllWISE data, particularly YSOs and Blazars, whose fraction of true positives decreased 10% and 16%, respectively. This happens because of the similarities observed between the light curves of these and other classes, which are not easily separated using variability features alone. However, the results obtained by this version of the classifier are still good enough to be used in the case that AllWISE data are not available, as occurs, for instance, for faint objects.

**Figure 9.** As in Figure 7, but for a model that excludes AllWISE data. We drop the errors, which are comparable to those listed in Figure 7, for simplicity. The results obtained for this model are reasonable, although the classification of some stochastic classes are affected by the lack of AllWISE data.
Download figure:
Standard image High-resolution image

5.2. Performance of the BRF Classifier as a Function of Magnitude and Number of Detections

As can be seen in Figure 5 for some classes, like YSOs, Cepheids, and LPVs, a non-negligible fraction of sources in the labeled set have photometry available only in one band. It is therefore important to know how well the classifier behaves when a single band is available for a given source.

To evaluate this, we created 20 new testing sets defined considering only those sources with ≥6 detections in both g and r bands, from the 20 testing sets previously generated using the ShuffleSplit iterator (see Section 5.1). We then classified each new testing set with its respective model, considering: (a) the features available for the g and r bands, (b) the features available only for the g band (i.e., we hide the features measured using the r band), and (c) the features available only for the r band (i.e., we hide the features measured using the g band). Figures 10 and 11 illustrate the results of this analysis. Figure 10 reports the recall values as a function of the average magnitudes per class (i.e., the recall values are computed by comparing the true and predicted labels for the objects within each magnitude bin), with the SN subtypes grouped in the unique class SN. Figure 11 reports the recall values as a function of the number of detections (i.e., the recall values are computed for each bin of the ZTF number of detections). From both figures we can infer that in general the best results are obtained when photometry from both g and r are available, with the exceptions of QSO, CEP, and periodic-other classes.

**Figure 11.** As in Figure 10, but plotting the recall as a function of the number of detections in the light curve (in logarithmic scale). The x-axis ranges from 6 to 150 detections, this range includes ∼90% of the sources. Again, there is a wide variety of behaviors (see discussion in the text).
Download figure:
Standard image High-resolution image

From Figure 10 we can also conclude that the reliability of the classification versus the average magnitude is different for each class. These distributions in general follow the magnitude distribution in the labeled set of each class considered in this model (with the exception of RRL). For instance, for the labeled set, the CEP class corresponds to one of the brightest classes, having in general r < 16, while the LPV class covers a broader range of magnitudes. On the other hand, from Figure 11, we can infer that in general the classification improves when more detections are available in both bands, with the exceptions of the QSO and periodic-other classes.

The results obtained for periodic-other are not surprising since this class includes all the periodic classes not considered in the classifier (including several different types of pulsating stars as well as the rotational variables). The results observed for the CEP class are probably due to the large fraction of Cepheids in the labeled set with photometry only in the g band, which is produced by the saturation limit of the ZTF survey (12.5–13.2 mag), and the fact that Cepheids tend to be very bright, particularly in the r band, thus for bright Cepheids, the r band light curve is not available. The results obtained for AGNs and QSOs are likely related with incorrect labels, which we discuss further in Appendix C.1.

Peculiar results of the recall values of some classes are reported in Figures 10 and 11 when only one band is available. For SNe, there is a decrease in the recall curve in the g band, presumably due to the fact that, in general, the g-band light curves of SNe tend to decay faster than the r-band light curves, producing shorter (and thus fewer detected) g-band light curves. This trend can be seen in the SN shown in Figure 3, as well as more generally in the light curve statistics for SNe. The average number of detections of SNe light curves is 12 and 16 in the g and r bands, respectively, and the total time length of SNe light curves corresponds to 53 and 64 days in the g and r bands, respectively. The low recall obtained for bright RRL when only the g band is available may be produced by differences in the variability features measured for different RRL subtypes. This issue is further discussed in Appendix C.2. The zero value recall curves obtained for the CV/nova class when only the g band is available is produced by the similarities in the AllWISE+ZTF colors of CV/novae and some periodic classes. This is discussed in Appendix C.3. Despite this, most of the sources in the labeled set have photometry in both g and r bands (see Figure 5), and thus this low performance obtained for some classes when only one band is available does not highly affect the presented results.

5.3. The Deployed BRF Classifier

In order to use the BRF classifier to classify the ZTF alert stream we need to train a single BRF model. We call this model "the deployed BRF classifier." This "deployed" model corresponds to a totally independent classifier (i.e., it is different from the previously trained 20 models), and uses the same 152 features presented in Section 3. As in the previous sections, we trained the deployed BRF model using 80% of the labeled set as the training set and the remaining 20% as the testing set. The macro-averaged precision, recall, and F1-score obtained for the top level are 0.96, 0.98, and 0.97, respectively, while for the bottom level they are 0.58, 0.79, and 0.61. The reason for these large differences in the macro-averaged metrics between the top and bottom levels can be understood from Figure 12, which shows the confusion matrix of the bottom level of the deployed BRF model. This confusion matrix is in agreement with the results presented in Figure 7, and shows that some classes, such as the the SN Ibc, CV/nova, YSO, CEP, and periodic-other classes, require further work in order to improve the results obtained by the classifier; this may involve data augmentation procedures, and better period estimations, as discussed in the following sections.

**Figure 12.** As in Figure 8, but for the bottom level of the deployed BRF classifier.
Download figure:
Standard image High-resolution image

5.3.1. Feature Ranking of the Deployed BRF Classifier

In Table 5 we list the feature ranking (top 30) for each classifier within the two-level BRF classifier (top level, transient, stochastic, and periodic). The feature ranking is computed considering which features separate better the subclasses within each classifier, with more informative features having higher ranks (for more details, see Hastie et al. 2009). From the table we can see that for all the classifiers, a considerable fraction of the top 30 features correspond to colors computed using the AllWISE and ZTF photometry as well as new detection features (i.e., features not included in the FATS package) and non-detection features. Moreover, it can be observed that the ranking of features changes for each classifier.

Table 5. Feature Ranking (Top 30) for Each Layer of the Deployed BRF Classifier

Top level		Transient		Stochastic		Periodic
Feature	Rank	Feature	Rank	Feature	Rank	Feature	Rank
`W1-W2`⋆	0.094	`SPM`_`t0`_`1`	0.033	`W1-W2`⋆	0.109	`Multiband`_`period`	0.089
`sgscore1` ‡	0.053	`SPM`_`gamma`_`2`	0.029	`sgscore1` ‡	0.058	g−`W2`⋆	0.062
`positive`_`fraction`_`2`	0.050	`SPM`_`tau`_`rise`_`2`	0.028	r−`W2`⋆	0.049	r−`W2`⋆	0.034
`positive`_`fraction`_`1`	0.048	`SPM`_`tau`_`rise`_`1`	0.025	(g−r)_`mean`_`corr`	0.048	(g−r)_`max`_`corr`	0.030
`SPM`_`tau`_`rise`_`1`	0.035	(g−r)_`max`	0.023	g−`W2`⋆	0.046	g−`W3`⋆	0.028
`LinearTrend`_`2`	0.032	`SPM`_`t0`_`2`	0.022	`gal`_`b` ‡	0.045	(g−r)_`mean`	0.027
`SPM`_`chi`_`1`	0.031	`LinearTrend`_`2`	0.019	g−`W3`⋆	0.037	(g−r)_`max`	0.025
g−`W2`⋆	0.031	`AndersonDarling`_`2`	0.018	(g−r)_`max`_`corr`	0.035	`GP`_`DRW`_`tau`_`1`	0.023
g−`W3`⋆	0.031	`SPM`_`gamma`_`1`	0.017	`ExcessVar`_`2`	0.033	(g−r)_`mean`_`corr`	0.022
`n`_`non`_`det`_`after`_`fid`_`2` †	0.026	`SPM`_`tau`_`fall`_`2`	0.017	`Meanvariance`_`2`	0.026	`IAR`_`phi`_`1`	0.022
`W2-W3`⋆	0.025	`dmag`_`first`_`det`_`fid`_`1` †	0.015	(g−r)_`mean`	0.025	`Amplitude`_`1`	0.017
`SPM`_`gamma`_`1`	0.024	`MHPS`_`low`_`1`	0.013	`delta`_`mag`_`fid`_`2`	0.024	`ExcessVar`_`1`	0.017
`SPM`_`tau`_`rise`_`2`	0.023	`LinearTrend`_`1`	0.013	r−`W3`⋆	0.023	`delta`_`mag`_`fid`_`1`	0.016
`SPM`_`A`_`2`	0.023	(g−r)_`mean`	0.012	`W2-W3`⋆	0.022	`Meanvariance`_`1`	0.016
`SPM`_`chi`_`2`	0.020	`MHPS`_`ratio`_`1`	0.011	`Amplitude`_`2`	0.022	r−`W3`⋆	0.016
`SPM`_`A`_`1`	0.019	`SPM`_`tau`_`fall`_`1`	0.011	`Std`_`2`	0.015	`Std`_`1`	0.015
r−`W2`⋆	0.018	`SPM`_`beta`_`2`	0.011	(g−r)_`max`	0.015	`GP`_`DRW`_`sigma`_`1`	0.015
`ExcessVar`_`1`	0.017	`MHPS`_`ratio`_`2`	0.010	`SPM`_`A`_`2`	0.014	`GP`_`DRW`_`tau`_`2`	0.012
`ExcessVar`_`2`	0.016	`Skew`_`2`	0.009	`PercentAmplitude`_`2`	0.013	`PercentAmplitude`_`1`	0.012
r−`W3`⋆	0.014	`sgscore1` ‡	0.009	`SPM`_`A`_`1`	0.012	`W1-W2`⋆	0.009
`Rcs`_`2`	0.013	`SPM`_`beta`_`1`	0.009	`ExcessVar`_`1`	0.011	`W2-W3`⋆	0.009
`GP`_`DRW`_`sigma`_`2`	0.013	`Power`_`rate`_`2`	0.009	`IAR`_`phi`_`1`	0.010	`SF`_`ML`_`amplitude`_`1`	0.009
`SPM`_`tau`_`fall`_`1`	0.013	`IAR`_`phi`_`2`	0.009	`GP`_`DRW`_`sigma`_`2`	0.009	`SPM`_`A`_`1`	0.009
`Meanvariance`_`2`	0.012	`dmag`_`first`_`det`_`fid`_`2` †	0.009	`delta`_`mag`_`fid`_`1`	0.009	`Gskew`_`1`	0.009
`LinearTrend`_`1`	0.012	`IAR`_`phi`_`1`	0.009	`Pvar`_`2`	0.007	`Q31`_`1`	0.009
`SPM`_`gamma`_`2`	0.010	`last`_`diffmaglim`_`before`_`fid`_`1` †	0.009	`IAR`_`phi`_`2`	0.007	`Autocor`_`length`_`1`	0.009
`Pvar`_`2`	0.010	`PPE`	0.008	`GP`_`DRW`_`tau`_`2`	0.007	`IAR`_`phi`_`2`	0.008
`MHPS`_`low`_`2`	0.009	`Harmonics`_`mag`_`6`_`1`	0.008	`GP`_`DRW`_`tau`_`1`	0.007	`SF`_`ML`_`gamma`_`1`	0.008
`SF`_`ML`_`amplitude`_`2`	0.009	`MHPS`_`low`_`2`	0.008	`MHPS`_`high`_`1`	0.007	`Amplitude`_`2`	0.007
(g−r)_`max`_`corr`	0.009	`Gskew`_`2`	0.008	`MHPS`_`low`_`1`	0.006	`delta`_`mag`_`fid`_`2`	0.007

Note. Features marked with † correspond to non-detection features, features marked with ⋆ correspond to features computed using AllWISE data, and features marked with ‡ correspond to metadata features. Subscripts "_1" and "_2" refers respectively to g and r bands.

Download table as: ASCII Typeset image

The top level classifier is dominated by different types of features: ZTF and ALLWISE colors, morphological properties of the images (sgscore1), variability features related with the amplitude of the variability at short and long timescales (MHPS_low, GP_DRW_sigma, Meanvariance, ExcessVar, and SPM_A), variability features that detect smooth decrease or increase of the luminosity (LinearTrend, SPM_tau_rise, SPM_tau_fall), features related to the quality of a supernova parametric model fitting (SPM_chi), and features related to transient appearance or disappearance (positive_fraction, and n_non_det_after_fid).

On the other hand, the transient classifier is dominated by the SPM features (e.g., SPM_gamma, SPM_beta, SPM_t0, SPM_tau_rise, and SPM_tau_fall). Other relevant features are the optical colors in the peak and the mean of the light curve, measured from the difference image light curves, features that detect smooth increase or decrease of the observed flux (LinearTrend), features that measure the level of correlation in the light curve (IAR_phi), features related with the amplitude of the variability (MHPS_low), and features related with the appearance of a transient source (dmag_first_det_fid, last_diffmaglim_before_fid_1). Note that SN rise related features, such as SPM_t0 and SPM_rise, are some of the most relevant features for the classification of transients and are crucial for the early classification of SNe. Also, note that SPM_t0 is not the explosion time, but some characteristic time where the SN has risen significantly.

For the stochastic classifier, 12 of the top 30 features are related to color, morphology, and distance to the Galactic plane, and the rest correspond to features related to the amplitude of the variability observed at different timescales (e.g., ExcessVar, SPM_A,Meanvariance, GP_DRW_sigma, and Amplitude) and features related to the timescale of the variability (IAR_phi, GP_DRW_tau).

Finally, the periodic classifier is clearly dominated by the Multiband_period feature, but also by different colors, by features related to the amplitude of the variability (e.g., delta_mag_fid, Amplitude, ExcessVar, Meanvariance, and GP_DRW_sigma), and features related with the timescale of the variations (e.g., GP_DRW_tau, and IAR_phi).

5.3.2. Results for the Unlabeled ZTF Set

We now turn to discuss the results obtained when applying the final BRF classifier to all the sources in ZTF with ≥6 detections in g or ≥6 detections in r. Considering the data obtained by ZTF until 2020 June 9, there are 868,371 sources that satisfy this condition, hereafter defined as the "unlabeled ZTF set." We define the class of a given object in the unlabeled ZTF set by selecting the class with the maximum probability obtained by the deployed BRF classifier. However, users of this classifier can use the obtained probabilities to make their own probability cuts and select samples for their science. The features, classifications, and probabilities obtained for the unlabeled ZTF set with data until 2020 June 9 can be downloaded at Zenodo:10.5281/zenodo.4279623.

It is important to note that the classifications obtained by the light curve classifier are updated every day, as new alerts are received. Whenever a new alert is received for a given source, the ALeRCE pipeline recomputes its variability features and provides an updated classification. These updated classifications can be found at the ALeRCE Explorer website²² using the "light curve classifier" tab and specifying the desired class. Considering the results shown in Figure 11, we expect that the quality of the classification for a given source will improve as more detections are added to the light curve. In addition, with new alerts, more objects will satisfy the condition of having ≥6 detections in g or ≥6 detections in r, and thus, the size of the unlabeled ZTF set increases every day. Moreover, with new detections the size of the labeled set will increase, allowing the training of new BRF models. Updates regarding any possible modification to the light curve classifier (e.g., labeled set and models) will be published on the ALeRCE Science website.²³

Figure 13 shows the number of candidates per class obtained for the 868,371 sources with enough alerts until 2020 June 9. Compared to Figure 4, it can be noticed that there is no correlation between the number of sources per class for the unlabeled set and the number of sources per class for the labeled set. The periodic-other, E, and LPV classes have the highest number of candidates, while the SN classes have the lowest. The distribution of candidates per class is consistent with the astrophysical number densities (i.e., we are likely not misclassifying large numbers of sources). For instance, blazars are relativistically beamed (and thus seen to have farther distances), but only over very small viewing angles, and hence are expected to be less common than QSOs and AGNs. In the case of SNe, not factoring in the amount of time a particular SN is above the magnitude limits of the search (the "control time"), we find ratios of SNe II/SNe Ia and SNe Ibc/SN Ia of 0.21 and 0.41, respectively. Computing these ratios using the number of such classes reported from ASAS-SN discoveries in Holoien et al. (2019) yields 0.36 and 0.09, respectively. The significant differences between the SNe Ibc/SN Ia ratios implies that we are strongly overestimating the numbers of SN Ibc; given the similarities between SN Ibc and SN Ia light curves, we are likely classifying a non-negligible fraction of SN Ia as SN Ibc. This highlights the importance of including distance estimations to improve the classification of transients. We are currently working to include distance-based features in future versions of the classifier.

To investigate the quality of the predictions, we plotted the probability distributions of the top and bottom levels on the left and right sides of Figure 14. The red lines denote the position of the median probability for each class and the green lines denote the 5th and 95th percentiles. It is clear from the figure that the distribution of probabilities for the top level are higher compared to the bottom level. For the top level, the classes with the lowest probabilities are CV/nova and YSO. For the bottom level, the classes with the highest probabilities are LPV, QSO, and AGN, and the lowest probabilities are for the different classes of SNe, YSO, CV/nova, and some periodic variables.

**Figure 14.** Left: normalized probability distributions of the top level of the deployed BRF classifier, split by subclass, for candidates from the unlabeled ZTF set. The reported values correspond to the probabilities obtained for each class of the top level, as indicated below the class name. Right: normalized probability distributions of the bottom level of the deployed BRF classifier, split by subclass, for candidates from the unlabeled ZTF set. The red lines show the median probability for each class. The green lines show the 5th and 95th percentiles of the probabilities. Some subclasses show broad distributions to low values, implying that they are not so well represented or characterized by the highest ranked features within the hierarchical class.
Download figure:
Standard image High-resolution image

The low probabilities obtained for some classes are related to the confusion between classes observed in Figure 7. For instance, in Figure 7 we can see that the SN classes present a high confusion among them. On the other hand, the SN Ia, SN Ibc, SN II, and SLSN median probabilities of sources classified as SN II are 0.16, 0.19, 0.28, and 0.19, respectively. The high confusion among the SN classes may be due to the low number of sources in the labeled set, but also to the intrinsic similarities among these classes. For example, the physical mechanism responsible for the main peak of the light curve of SNe Ia and SNe Ib/c is the same, the diffusion of energy deposited by radioactive ⁵⁶Ni (Arnett 2008). Indeed, Villar et al. (2019) report that 15% of their Type Ibc SNe are classified as Ia. This might be improved by performing data augmentation using, for example, Gaussian process modeling (e.g., Boone 2019). On the other hand, the low probabilities observed for CV/novae and YSOs can be produced by the similarities between their colors and the colors of some periodic sources and the fact that some CV/novae and YSOs present very rapid variability compared to the ZTF cadence, which produces light curves with low autocorrelation, and thus low values of the IAR_phi parameter, which is normally observed for periodic sources (excluding LPVs). These similarities can be seen in Figure 15, where we show the distributions of IAR_phi_1 and g−W3 for YSOs and CV/novae (grouped), the rest of the stochastic classes, and periodic sources from the labeled set.

**Figure 15.** Normalized IAR ϕ distribution (top) in the g band (`IAR`_`phi`_`1`; cut in the y-axes at 0.4), and normalized g−`W3` distribution (bottom), for YSOs and CV/novae (blue), stochastic sources (red; excluding CV/novae and YSOs), and periodic sources (yellow). It can be seen that there is an overlap in the `IAR`_`phi`_`1` and g−`W3` distributions of CV/novae, YSOs, and periodic sources.
Download figure:
Standard image High-resolution image

Figure 16 shows the normalized r-band magnitude distribution of the different classes considering sources present in the labeled set and candidates from the unlabeled ZTF set. In general, the distributions of magnitudes of the candidates are similar to or fainter than found among sources from the labeled set. For instance, the SNe classes have candidates that are ∼0.5 mag fainter than the labeled set, and the YSO, CV/nova and E classes have candidates that are ∼1 mag fainter than the labeled set. These results show that the classifier is able to detect faint and bright candidates regardless of the luminosity biases present in the labeled set, which can be dominated by the brightest tail of the true magnitude distribution of each class. This will be particularly relevant for surveys like LSST, since, in general, the available training set will be ∼2–3 mag brighter than the limiting magnitudes of the single images, thus we would expect that currently existent bright samples will allow us to detect fainter candidates.

A simple way to test the performance of the BRF classifier is to verify whether the results obtained when the model is applied to the unlabeled ZTF set are in agreement with what is astrophysically expected from previous works. For instance, younger Galactic targets like YSOs, classical Cepheids, and LPVs should reside near the Galactic Plane (e.g., Catelan & Smith 2015; Mowlavi et al. 2018), while extragalactic sources like AGNs, QSOs, blazars, and SNe should have roughly isotropic distributions, perhaps with fewer sources near the Galactic Plane due to attenuation/reddening from gas and dust (e.g., Calzetti et al. 2000; Padovani et al. 2017). On the left side of Figure 17 the sky distribution, in Galactic coordinates, of LPV, CEP, and YSO candidates is shown. It is clear from the figure that most of them are located in the Galactic Plane and that sources located outside the plane have a low BRF probability. This is consistent with the results obtained by previous works (e.g., Mowlavi et al. 2018; Rimoldini et al. 2019). The right panel of Figure 17 shows the Galactic latitude versus the g−r color obtained using the mean magnitude of the light curves in each band, for extragalactic candidates (QSO, AGN, blazar, SN Ia, SN Ibc, SN II, and SLSN). From the figure we can see that the fraction of extragalactic candidates observed around the Galactic Plane is low and that most of the candidates located in the plane have low probabilities. Moreover, the g−r colors of the extragalactic candidates are consistent with what is expected for these classes, with clear evidence of reddening for the candidates located around the Galactic Plane. The sky distribution, in Galactic coordinates, of the extragalactic candidates can be found Appendix D.

**Figure 17.** Left: Galactic latitude (gal_b, in degrees) vs. Galactic longitude (gal_l, in degrees) for candidates expected to be mostly observed around the Galactic Plane (LPV, CEP, and YSO classes). Right: Galactic latitude (gal_b, in degrees) vs. g−r_max for extragalactic candidates (QSO, AGN, blazar, SN Ia, SN Ibc, SN II, and SLSN classes). The bottom level probability computed by the deployed BRF classifier are color-coded according to the color bar to the right. The red contours show the density of points in each plot. It can be seen that most of the extragalactic candidates are located outside the Galactic Plane, while most of the YSO, LPV, and CEP candidates are located in the Galactic Plane.
Download figure:
Standard image High-resolution image

6. Conclusions

6.1. Summary

In this paper we presented the first version of the ALeRCE light curve classifier. This classifier uses a total of 152 features, including variability features computed from ZTF light curves with ≥6 epochs in g or r bands, and colors computed using ZTF and AllWISE photometry (see Section 3), to classify each source into 15 subclasses, including periodic, transient, and stochastic variable sources (see Section 2.2). The light curve classifier uses a balanced RF classifier (see Section 4.1), constructed with a two-level scheme. The first level (top level) classifies each source as periodic, stochastic, or transient. The second level (bottom level) consists of three classifiers that further resolve each hierarchical class into subclasses.

We trained and tested the BRF classifier using a labeled set obtained by cross-matching the ZTF database with different catalogs of transients, stochastic, and periodic sources (see Section 2.2.1). For the top level we obtained macro-averaged precision, recall, and F1-score values of 0.96, 0.99, and 0.97, respectively, while for the bottom level we obtained macro-averaged precision, recall, and F1-score values of 0.57, 0.76, and 0.59, respectively.

We used the BRF classifier to classify 868,371 sources from ZTF (unlabeled ZTF set), obtaining results that are in agreement with what we expect astrophysically. For instance, most of the high probability extragalactic candidates are located outside the Galactic Plane and most of the high probability YSO, LPV, and CEP candidates are located in the Galactic Plane.

The condition of ≥6 detections in g or r normally equates to a timespan between 3 and 30 days since the first detection. Whenever a new detection is received for an object, the ALeRCE pipeline processes it and provides an updated classification in ∼1 s. The light curve classifier provides updated classifications for objects with new ZTF alerts every day. These updated classifications can be found on the ALeRCE Explorer website,²⁴ selecting the "light curve classifier" option, and specifying the desired class. Catalogs containing the labeled set, and the features and RF probabilities obtained by the top and bottom levels for the unlabeled ZTF set (up to 2020 June 9) can be downloaded at Zenodo:10.5281/zenodo.4279623. In addition, more examples and instructions on how to use the ALeRCE database and classifications can be found on the ALeRCE Science website²⁵ and in Förster et al. (2020), and a detailed description of the ALeRCE database can be found in the database schema.²⁶ Finally, the code used to train the deployed BRF classifier can be found on the "light_curve_classifier_SanchezSaez_2020" GitHub repository,²⁷ and the implementation of this classifier in the ALeRCE pipeline can be found on the "lc_classifier" GitHub repository,²⁸ which version 1.0.1 is archived in Zenodo (Sánchez-Sáez et al. 2020).

6.2. Final Remarks and Perspectives

One of the main challenges found during the development of the ALeRCE light curve classifier was the high imbalance present in the labeled set. For instance, the transient sources represent 1.4% of the labeled set, while the periodic sources represent 70.5%. Each hierarchical class also suffers from high imbalance among its subclasses; for example, in the case of the transient class, SN Ia comprise 74.0% of the sample, while SLSN correspond to only 1.4%. We addressed this problem by using the balanced RF implementation of the imbalanced-learn Python package, which follows the procedure proposed by Chen et al. (2004). This method uses a downsampling majority class technique to train each tree with a balanced sub-sample. We also tested two other algorithms, GBoost and MLP, but concluded that more work is needed if we want to obtain better results from those.

Another challenge was to find features useful to separate the different classes. Previous works have normally used features similar to those available in the FATS Python package (e.g., Kim et al. 2014; Martínez-Palomera et al. 2018), however, our first tests demonstrated that these features were not informative enough to separate the 15 classes considered by the light curve classifier, in particular, the stochastic and transient classes. Thus, novel features were designed and implemented for this work, like the IAR_phi parameter, the MHPS features, and the non-detection features. In addition, during the development of the light curve classifier we realized that some stochastic classes are hard to separate using just variability features. In particular, the separation of YSOs from the other stochastic classes improved significantly once we included AllWISE colors in the set of features.

Furthermore, the computation of reliable periods was quite challenging, particularly considering that the ALeRCE pipeline requires fast computation of features. Huijse et al. (2018) demonstrated that very good results can be achieved by quadratic mutual information (QMI) estimators, however, these techniques are computationally expensive [O(n²)]. We solve this issue by using the MHAOV periodogram, which provides less reliable periods, but is much faster to compute [O(n)]. Periods become increasingly unreliable as the number of data points decreases, but the classification of periodic variables can still be accurate, as other features can compensate for the decreasing quality of the periodogram (e.g., features related with the amplitude and the timescale of the variability). ALeRCE is currently working to implement methods for period estimation that are both accurate and fast. Computing the periodogram is expensive for sources with a large number of detections. We are currently exploring so-called "online" periodograms, which are updated as new samples arrive, at a fraction of the computational cost of recomputing the period each time from scratch (Zorich et al. 2020), as well as other techniques that might work better with eclipsing binary light curves (e.g., Kovács et al. 2002; Mighell & Plavchan 2013).

Moreover, the classification of the different SN classes was particularly challenging. First, the number of SNe in the labeled set was very small compared to other classes, and second, the light curves of SN classes can present similarities, which makes their separation difficult, as discussed in Section 5.3.2. We solved this issue by using the BalancedRandomForestClassifier method from imbalanced-learn, and by including the SPM features, whose definition is a modification of the work of Villar et al. (2019). In the future we plan to test other techniques to improve the separation of SN classes. Previous works have performed Gaussian process regression to model SNe light curves and generate new light curves with different cadences therefrom (e.g., Boone 2019). Moreover, better results can be obtained if we use information regarding the SN host galaxy (e.g., Foley & Mandel 2013; Baldeschi et al. 2020).

In the future we also plan to perform data augmentation to improve the classification of variable objects. For the case of variable stars, light curves can be modeled with Gaussian process or with a combination of harmonics, and then basic transformations can be applied to these models to obtain light curves with different periods and amplitudes (e.g., Elorrieta et al. 2016; Castro et al. 2018; Martínez-Palomera et al. 2018; Aguirre et al. 2019; Hosenie et al. 2020). To the best of our knowledge for the case of AGNs, QSOs, and blazars, no previous attempts to perform data augmentation have been made. A promising option is to use synthetic light curve generators that consider the physical processes behind the variability (e.g., Sartori et al. 2019).

Most of the features used by this classifier can be implemented and used to classify light curves from other data sets. In particular, for the case of LSST, the non-detection features can be adapted to work with the forced photometry that will be provided for each alert (DIAForcedSources in the Data Products Definition Document; Jurić et al. 2019). LSST will also benefit from the multiband ugrizy light curves. As we demonstrated in this work, in general the light curve classification improves when both ZTF g and r data are available. For the case of LSST this would be the same, and probably we should even be able to further resolve some of the subclasses presented in this work. For instance, using the zy light curves, we should be able to separate local type 1 and type 2 active galactic nuclei since for low-redshift sources, we can detect variability from the dusty torus at these wavelengths (see Sánchez et al. 2017 and references therein) or identify high-redshift QSOs, whose emission is expected to be absorbed in the bluer bands. We encourage researches interested in classifying stochastic and transient sources in particular to use the novel (or modified) features presented in this work, like the IAR_phi parameter, the MHPS features, the SPM features, and the non-detection features.

It is worth noting that the ALeRCE light curve classifier is being constantly improved and this work describes its Version 1.0. Future versions of this classifier may include new classes of variable and transient objects, as well as subclasses of sources already present in the taxonomy (e.g., RRL types ab and c; classical and type II Cepheids; contact, detached, and semi-detached eclipsing binaries; among others). We are also working to find new features and techniques that can improve the performance of the classifier. Future work will report any changes included in the classifier model, like the inclusion of data augmentation or the use of other classification strategies (e.g., semi-supervised training). We recommend that the users of this classifier check the ALeRCE Science website²⁹ to get updates related to the different classifiers and the data processing. We are exploring different classification algorithms that are not based on manually designed features, but on automatically derived, recurrent, implicitly extracted features, via deep learning (e.g., Naul et al. 2018; Muthukrishna et al. 2019; Becker et al. 2020). However, up to this point we have found that the former produce better results when applied to real data. Most likely, a combination of simulated and real data will be required to train reliable deep learning classification models in the future, as found by Carrasco-Davis et al. (2019).

The authors acknowledge support from ANID Millennium Science Initiative ICN12_009; ANID grants PIA AFB-170001 (F.F., I.R., C.V., E.C., D.R., L.S.G., C.S.C., D.R.U.), PIA ACT172033 (P.A.), ANID Quimal190012 (F.F.), Basal-CATA AFB-170002 (P.S.S., F.E.B., M.C., D.D.), FONDECYT Regular No. 1200710 (F.F.), 1190818 (F.E.B.), 1200495 (F.E.B.), 1171678 (P.E.), 1171273 (M.C.), and 1170305 (P.H.); FONDECYT Initiation No. 11200590 (F.E.) and No. 11191130 (G.C.V.); and FONDECYT Postdoctorado No. 3200250 (P.S.S.) and 3200222 (D.D.); and the Max-Planck Society, Germany, through a Partner Group grant (P.A.), Vicerectoría de Investigación de Pontificia Universidad Católica de Chile (ECI).

Powered@NLHPC: This research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02). This work has been possible thanks to the use of AWS-U.Chile-NLHPC credits.

Based on observations obtained with the Samuel Oschin 48 inch Telescope at the Palomar Observatory as part of the Zwicky Transient Facility project. ZTF is supported by the National Science Foundation under grant No. AST-1440341 and a collaboration including Caltech, IPAC, the Weizmann Institute for Science, the Oskar Klein Center at Stockholm University, the University of Maryland, the University of Washington, Deutsches Elektronen-Synchrotron and Humboldt University, Los Alamos National Laboratories, the TANGO Consortium of Taiwan, the University of Wisconsin at Milwaukee, and Lawrence Berkeley National Laboratories. Operations are conducted by COO, IPAC, and UW. AllWISE makes use of data from WISE, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, and NEOWISE, which is a project of the Jet Propulsion Laboratory/California Institute of Technology. WISE and NEOWISE are funded by the National Aeronautics and Space Administration.

Software: FATS (Nun et al. 2015), IAR³⁰ (Eyheramendy et al. 2018), Imbalanced-learn (Lemaître et al. 2017), Keras (Chollet et al. 2015), P4J (Huijse et al. 2018), Scikit-learn (Pedregosa et al. 2012), Tensorflow (Abadi et al. 2016), XGBoost (Chen & Guestrin 2016). For graphical representations and data manipulation: Astropy (Astropy Collaboration et al. 2013, 2018), Jupyter (Kluyver et al. 2016), Matplotlib (Hunter 2007), Numpy & Scipy (van der Walt et al. 2011), Pandas (McKinney et al. 2010), Python (Van Rossum & Drake 1995), and Seaborn (Waskom et al. 2017).

Appendix A: Further Description of Some Variability Features

In this section we provide additional description of some of the features listed in Table 2 (those marked with **). These features correspond to new variants of features included in the FATS package and other works:

1.
Damp random walk (DRW) parameters: a DRW model is defined by a stochastic differential equation that includes a damping term that pushes the signal back to its mean: ${dX}(t)=-\tfrac{1}{{\tau }_{\mathrm{DRW}}}X(t){dt}+{\sigma }_{\mathrm{DRW}}\sqrt{{dt}}\,\epsilon (t)+b\,{dt}$ , τ_DRW, σ_DRW, t > 0. τ_DRW corresponds to the characteristic time for the time series to become roughly uncorrelated, σ_DRW corresponds to the amplitude of the variability at short timescales (t ≪ τ_DRW), and (t) is a white noise process with zero mean and variance equal to 1. DRW modeling is typically used to describe light curves of active galactic nuclei (Kelly et al. 2009). In this case we obtained the σ_DRW and τ_DRW parameters using Gaussian process regression, with a Ornstein-Uhlenbeck kernel, as in Graham et al. (2017). GP_DRW_sigma denotes σ_DRW, while GP_DRW_tau denotes τ_DRW.
2.
Excess variance (σ_rms): measure of the intrinsic variability amplitude in a given band (see Sánchez et al. 2017, and references therein). ${\sigma }_{\mathrm{rms}}^{2}=({\sigma }_{\mathrm{LC}}^{2}-{\overline{\sigma }}_{m}^{2})/{\overline{m}}^{2}$ , where σ_LC is the standard deviation of the light curve, ${\overline{\sigma }}_{m}$ is the average photometric error, and $\overline{m}$ is the average magnitude. We denoted ${\sigma }_{\mathrm{rms}}^{2}$ as ExcessVar.
3.
Multiband period: the period is estimated using the Multi-Harmonic Analysis of Variance (MHAOV) periodogram (Schwarzenberg-Czerny 1996), which is based on fitting periodic orthogonal polynomials to the data. A single period estimate per light curve is computed by fitting both bands using the MHAOV multiband extension proposed by Mondrik et al. (2015). We denote this period as Multiband_period. For sources with detections only in g or only in r, the Multiband_period reports the single-band period. To avoid overfitting the data when few samples are available, we set the number of harmonics to one. This might not capture the best period for non-sinusoidal light curves, e.g., detached and semi-detached eclipsing binaries, returning a harmonic instead. We found that having a harmonic of the true period is in general sufficient to classify non-sinusoidal light curves correctly given that other features such as the power rate are included. We choose MHAOV for this analysis as it provides a good trade-off between performance and computational complexity. This method is now implemented and available in the P4J package (Huijse et al. 2018).
4.
Harmonics parameters: harmonic series (Stellingwerf & Donohoe 1986) are commonly used to model and classify periodic light curves (Debosscher et al. 2007; Sarro et al. 2009; Richards et al. 2011; Elorrieta et al. 2016). In this work we fit a harmonic series up to the seventh harmonic, according to the expression
$\begin{eqnarray}&&y({t}_{j})=\displaystyle \sum _{k=1}^{7}\left[{A}_{k}\cos \left(\displaystyle \frac{2\pi {{kt}}_{j}}{P}\right)+{B}_{k}\sin \left(\displaystyle \frac{2\pi {{kt}}_{j}}{P}\right)\right]+C,\end{eqnarray} \tag{ A1 }$
where t_j corresponds to the observational time of the jth detection, P is the best candidate period computed from the multiband periodogram as above, and y(t_j) is the magnitude estimated by the harmonic model. Even though we use the Multiband_period, the harmonic model is computed using the detections in each band independently. A_k, B_k, and C for k = 1, ..., 7 are obtained by minimizing the weighted mean square error between the observed magnitudes and the model.

Note that the model is linear with respect to its parameters, so the latter can be computed using weighted linear regression. The inverse of the square of each observational error is used as a weight, which minimizes contributions from noisier observations. The cost function is given by

$\begin{eqnarray}&&\mathop{\min }\limits_{{A}_{k},{B}_{k}}\displaystyle \frac{1}{J}\displaystyle \sum _{j=1}^{J}\displaystyle \frac{{\left[\mathrm{mag}({t}_{j})-y({t}_{j})\right]}^{2}}{\mathrm{sigma}{\left({t}_{j}\right)}^{2}},\end{eqnarray} \tag{ A2 }$

where J is the number of observations, mag(t_j) is the observed magnitude at time t_j, and sigma(t_j) is the observational error at time t_j. The solution to the weighted least squares optimization problem is found using the Moore–Penrose pseudoinverse. This solution has the additional property of having the minimum Euclidean norm when the problem is underdetermined (Ben-Israel & Greville 2003), which in this case corresponds to having less than 15 observations.

Once the parameters are learned, Equation (A1) can be rewritten as

$\begin{eqnarray}&&y({t}_{j})=\displaystyle \sum _{k=1}^{7}{M}_{k}\cos \left(\displaystyle \frac{2\pi {{kt}}_{j}}{P}-{\phi }_{k}\right)+C,\end{eqnarray} \tag{ A3 }$

with ${M}_{k}=\sqrt{{A}_{k}^{2}+{B}_{k}^{2}}$ and ${\phi }_{k}=\arctan ({B}_{k}/{A}_{k})$ . In this way, the harmonics are now described by the amplitude and phase of each component. The model is shifted in time in order to have zero phase in the first harmonic, which is done following the expression $\phi {{\prime} }_{k}={\phi }_{k}-k{\phi }_{1}$ , replacing ϕ_k by $\phi {{\prime} }_{k}$ in Equation (A3).

Finally, the parameters M_k for k = 1,..., 7, $\phi {{\prime} }_{k}$ for i = 2,..., 7, and the mean square error are used as features, which are denoted Harmonics_mag_1,..., Harmonics_mag_7, Harmonics_phase_2,..., Harmonics_phase_7, Harmonics_mse, respectively.

1.
P_var: Probability that the source is intrinsically variable in a given band (see Paolillo et al. 2004 and references therein). It considers the χ² of the light curve respect to its mean and calculates the probability P_var = P(χ²) that a χ² lower or equal to the observed value could occur by chance for an intrinsically non-variable source, assuming that for each light curve its χ² will follow a probability distribution described by an incomplete gamma function Γ(ν/2, χ²/2), where ν corresponds to the degrees of freedom. We denoted P_var as Pvar.
2.
Structure function (SF) parameters: the SF quantifies the amplitude of the variability as a function of the time difference between pairs of detections (τ). In this work we consider the definition provided by Caplar et al. (2017). We model the SF as a power law: $\mathrm{SF}(\tau )={A}_{\mathrm{SF}}{\left(\tfrac{\tau }{1\mathrm{yr}}\right)}^{{\gamma }_{\mathrm{SF}}}$ , where γ_SF corresponds to the logarithmic gradient of the change in magnitude, and A_SF corresponds to the amplitude of the variability at 1 yr. SF_ML_amplitude denotes A_SF, while SF_ML_gamma denotes γ_SF.
3.
Supernova parametric model (SPM): Villar et al. (2019) introduced an analytic model describing SN light curves as a six-parameter function, which they used to characterize and classify SN light curves from the PanSTARRS1 Medium-deep Survey (Chambers et al. 2016). This model is an extension of previous empirical efforts to analytically describe supernova light curves, including the effects of different explosion times, normalization factors, initial rise timescales, rate of decline after peak, plateau lengths, or tail decay timescales. We introduce two modifications to this model. First we reparameterize the function to always remain positive in a simple validity range by a set of inequalities. After the first modification, the model is the following:
$\begin{eqnarray}F=\left\{\begin{array}{cc}\displaystyle \frac{A\left(1-\beta ^{\prime} \tfrac{t-{t}_{0}}{{t}_{1}-{t}_{0}}\right)}{1+\exp \left(-\tfrac{t-{t}_{0}}{{\tau }_{\mathrm{rise}}}\right)} & {\rm{if}}\ t\lt {t}_{1}\\ \displaystyle \frac{A(1-\beta ^{\prime} )\exp \left(-\tfrac{t-{t}_{1}}{{\tau }_{\mathrm{fall}}}\right)}{1+\exp \left(-\tfrac{t-{t}_{0}}{{\tau }_{\mathrm{rise}}}\right)} & {\rm{if}}\ t\geqslant {t}_{1},\end{array}\right.\end{eqnarray} \tag{ A4 }$
where we also use γ ≡ t₁ − t₀ as a parameter instead of t₁. This function is positive valued when A > 0, γ > 0, τ_rise > 0, τ_fall > 0, and 0 < β'< 1.

The second difference with respect to Villar et al. (2019) is replacing the piecewise-defined function for a soft transition between the two components. This is done by including a sigmoid function σ(t) = 1/(1 + exp(−t)), which allows a soft transition between zero and one. As the parameter t₁ defines the transition between the two pieces of the model in Equation (A4), it cannot be optimized properly using first-order methods. Our proposed model allows using this technique effectively to learn the parameters of the model, which is given by the following equation:

$\begin{eqnarray}\begin{array}{rcl}F & = & \ \displaystyle \frac{A\left(1-\beta ^{\prime} \tfrac{t-{t}_{0}}{{t}_{1}-{t}_{0}}\right)}{1+\exp \left(-\tfrac{t-{t}_{0}}{{\tau }_{\mathrm{rise}}}\right)}\cdot \left[1-\sigma \left(\displaystyle \frac{t-{t}_{1}}{3}\right)\right]\\ & & \ +\displaystyle \frac{A(1-\beta ^{\prime} )\exp \left(-\tfrac{t-{t}_{1}}{{\tau }_{\mathrm{fall}}}\right)}{1+\exp \left(-\tfrac{t-{t}_{0}}{{\tau }_{\mathrm{rise}}}\right)}\cdot \left[\sigma \left(\displaystyle \frac{t-{t}_{1}}{3}\right)\right].\end{array}\end{eqnarray} \tag{ A5 }$

In this particular model, for all the sources we use the light curves based on the difference images (lc_diff). This is done to avoid the contamination from unrelated host galaxy emission, which can distort the real shape of the SNe light curves. We also subtract from t the MJD value of the first detection observed for a given source. We computed A (SPM_A), β' (SPM_beta), t₀ (SPM_t0), γ (SPM_gamma), τ_rise (SPM_tau_rise), and τ_fall (SPM_tau_fall), for each band independently. In addition, we computed the reduced χ² of the fit for the light curve, denoted as SPM_chi. The parameters are found using the function curve_fit provided by the Scipy library (Virtanen et al. 2020).

Appendix B: One-level Multi-class RF Model

The first model tested for the ALeRCE light curve classifier was a simple one-level RF model with 15 classes, implemented using the imbalanced-learn Python package. This model uses 500 trees, maximum depth trees, and the maximum number of features equal to the square root of the total number of features. Figure 18 shows the confusion matrix obtained by this model. The precision, recall, and F1-score obtained are 0.49, 0.68, and 0.50, respectively. Clearly this model has a lower performance compared to the BRF classifier.

**Figure 18.** Confusion matrix for a one-level multi-class RF model. The black squares highlight the three hierarchical classes (from top to bottom, transient, stochastic, and periodic, respectively). To normalize the confusion matrix results as percentages, we divide each row by the total number of objects per class with known labels. We round these percentages to integer values. The performance of this model is poorer compared to the BRF classifier (see Figure 7).
Download figure:
Standard image High-resolution image

Appendix C: Further Analysis of Classes with Anomalous Recall Curves

C.1. The Particular Case of AGN, QSO, and Blazar

In this work we present the first attempt to separate different types of active galactic nuclei according to their variability properties. As we mentioned in Section 2.2, we separate active galactic nuclei in the following way:

1.
AGN: type 1 Seyfert galaxies (i.e., active galactic nuclei whose emission is dominated by the host galaxy), selected from MILLIQUAS (broad type "A"), and from Oh et al. (2015).
2.
QSO: type 1 core-dominated active galactic nuclei (i.e., active galactic nuclei whose emission is dominated by their active nuclei), selected from MILLIQUAS (broad type "Q").
3.
Blazar: BL Lac objects and flat spectrum radio quasars (FSRQ), selected from ROMABZCAT and MILLIQUAS.

In Figure 11 we showed the recall curves for each class as a function of the number of detections and we could observe that these curves decrease for QSOs and AGNs when more detections are available, particularly in the r band. These results are puzzling, considering that we would normally expect to improve the classification of a given class as more detections are included in the light curves. In order to better understand the origin of these results, we show in Figure 19 the recall curves as a function of average magnitude and number of detections for AGNs, QSOs, and blazars grouped as a single class. In this case, the recall curves are around 0.8 and 1.0 for every bin of magnitude (specially for g > 16 or r > 16) and number of detections. From this we can infer that the light curve classifier has a very good performance selecting active galactic nuclei as a single class, but some issues still remain regarding the separation of AGNs, QSOs, and blazars. There are two main explanations for these results: (a) the method cannot properly separate QSOs from AGNs and blazars, or (b) there are sources in the labeled set with incorrect labels.

**Figure 19.** Similar to Figures 10 and 11, but treating AGNs, QSOs, and blazars as a single class. When grouped together, the recall is much better behaved than for any individual active galactic nuclei class, highlighting remaining difficulties distinguishing between these classes.
Download figure:
Standard image High-resolution image

A possible way to explore how well the light curve classifier can discriminate among AGNs, QSOs, and blazars, is to check whether the features available in the light curve classifier can separate these three populations. Figure 20 shows six different features used by the classifier, (g−r)_mean_corr, Meanvariance_1, ExcessVar_1, sgscore1, Meanvariance_2, and ExcessVar_2, for QSOs, AGNs, and blazars from the labeled set (most of these features are in the top 30-ranked features shown in Table 5). From the figure we can see that these three classes have different color distributions, different morphologies, and also different variability properties. AGNs and blazars tend to be redder than QSOs (see (g−r)_mean_corr), blazars and AGNs tend to have larger amplitudes (see Meanvariance_1 and ExcessVar_1), and AGNs tend to have more extended morphologies compared to QSOs and blazars. These are just some examples of features that can be used to separate the three classes mentioned above. After a visual inspection of the feature distribution of AGNs, QSOs, and blazars, we found that more than 30 features can be used to separate them, including, for instance, PercentileAmplitude, Q31, GP_DRW_sigma, GP_DRW_tau, MHPS_low, MHPS_high, SF_ML_amplitude, among others. From this we can infer that the light curve classifier should be able to separate these three populations.

In addition, Figure 20 highlights that the g-band features seem to separate the AGN and QSO classes better compared to the same features in the r band. This behavior is also seen in other features, like PercentileAmplitude, GP_DRW_sigma, Std, among others related to the amplitude of the variability. These differences might be produced by the combined effect of having a higher contamination from the host in the r band and intrinsically lower amplitude of the variability in the r band, due to the well-known anti-correlation between the amplitude of the variability and the wavelength of emission (see Sánchez et al. 2017 and references therein). From this, we can understand the differences observed in the g and r band recall curves of AGNs and QSOs shown in Figure 11, which should be produced by these differences in the features' distributions.

On the other hand, one key source of confusion between AGNs and QSOs is the definition of this division criteria in the labeled set. MILLIQUAS uses a luminosity-based division to separate AGNs and QSOs (see Section 5 of Flesch 2015), while the variability amplitude features are also affected by the ratio between core and host luminosities. Our variability-based criteria appear to separate host-dominated from core-dominated sources, in rough correspondence with low and high core luminosities, respectively. In addition, it is hard to make strict cuts to separate the AGN, QSO, and blazar classes, considering their observational properties (e.g., luminosity, redshift, orientation). In particular, the dividing line between AGNs and QSOs has never been defined properly (e.g., Netzer 2013; Padovani et al. 2017), and thus different catalogs will provide different classifications for the same object. Note that since the redshift of the sources in the unlabeled set is generally unknown, luminosity cannot be used as a feature by the classifier. Our aim is to separate host-dominated from core-dominated sources to favor the identification of different populations of active galactic nuclei by the light curve classifier, and use the AGN and QSO classifications as representative of these classes in the extremes of the luminosity distribution, but expect that there will be strong mixing of these classes close to the dividing luminosity used in the labeled set.

In order to see whether there are sources in the labeled set with incorrect QSO or AGN labels, we crossmatched our labeled set with the SIMBAD database. There are 26,168 QSOs in the labeled set (all obtained from MILLIQUAS), and 1590 of them are classified as Seyfert or AGN by SIMBAD (6%), with 1580 having a reported redshift <1. On the other hand, there are 4667 AGNs in the labeled set, and 830 (17%) of them are classified as QSO in SIMBAD. Therefore, there are 2420 sources in the labeled set with inconsistent classification in MILLIQUAS and SIMBAD. The light curve classifier classifies 920 of these sources as QSOs and 1319 as AGNs.

To understand better this discrepancy in the classification of some sources, we crossmatched our labeled set with the catalog of spectral properties of quasars from SDSS DR14 provided by Rakshit et al. (2020), as well as with the catalog of Oh et al. (2015) in order to obtain bolometric luminosities (L_bol), BH masses (BH_mass), Eddington ratios (L/L_Edd), and redshifts, for our sample of QSOs and AGNs. We excluded blazars from this analysis since they are not properly identified and characterized in these catalogs.

We show in Figure 21 the distribution of these spectral properties for AGNs and QSOs from the labeled set and for misclassified sources (i.e., AGNs from the labeled set classified as QSO, and vice versa). It is clear from the figure that AGNs and QSOs from the labeled set have spectral property distributions that are not strictly separated. From this we can conclude that the luminosity-based division performed by MILLIQUAS is not physically so meaningful. One issue, for instance, is to what extent the host contributes to the total magnitude used to classify the source as AGN or QSO. In addition, Figure 21 shows that the misclassified sources have similar distributions of redshift and luminosity. Moreover, these distributions lie exactly in the range of values where the properties of AGNs and QSOs from the labeled set overlap (redshift ∼0.5 and L_bol ∼ 10^45.5 [erg s⁻¹]).

From these results, we can infer that the decrease in the recall as a function of the number of detections observed for QSOs and AGNs is produced by the discrepancy in the QSO/AGN classification obtained from different catalogs, which is produced by the difficulty of making strict cuts to separate AGNs and QSOs. When more epochs are available, it is easier for the light curve classifier to perform a correct variability-based classification, and therefore, identify an original inconsistent classification provided by a given catalog. We propose that by using the variability properties of active galactic nuclei we can more easily separate them as core-dominated or host-dominated, or in other words, as bright QSOs or low-luminosity AGNs.

C.2. The Particular Case of RRL

In Section 5.2 we claimed that the low recall values obtained when only the g-band photometry is used for bright RRL can be explained by the differences in the variability features of the RRL subtypes. Figure 22 shows the Multiband_period versus the Meanvariance measured in the g and r bands for bright RRLs (mean g < 16) split into their subclasses "ab" and "c," and for E, CEP, and DSCT (grouped as a single class). From the figure we can notice that ab-type RRLs tend to have larger Meanvariance in the g band, which helps to distinguish them from E/CEP/DSCT, while c-type RRLs have values of Meanvariance in both bands similar to those of E/CEP/DSCT, which makes it difficult to tell them apart. The RRL class in our labeled set is dominated by ab-type RRLs (∼80%), and thus the light curve classifier identifies those more easily. However, for sources with g ≤ 15, the fraction of ab-type RRLs decreases to 64%, which explains the low recall values obtained for this regime of brightness.

**Figure 22.** Logarithm of the multiband period vs. the `Meanvariance`_`1` (g band, left) and the `Meanvariance`_`2` (r band, right), for bright RRLs (mean g < 16) split according to their subclasses "ab" (green) and "c" (red), and for E, CEP, and DSCT (gray). We show a zoom in the area where most of the RRLs lie. The contours show the density of points of each RRL class.
Download figure:
Standard image High-resolution image

In spite of all this, the fraction of RRLs with only g-band photometry is low (see Figure 5), and thus, the low performance obtained when we only consider the g-band features for bright RRLs does not substantially affect the presented results. As a comparison, when both bands are available for bright RRLs, we classify 87.5% of them as RRLs, but when we hide the r-band features, leaving only the g band features, we recover only 17.2% of the bright RRLs.

C.3. The Particular Case of CV/Nova

In Section 5.2 we show that the recall curves of the CV/nova class are close to zero when only the g band is available. We noticed that in this case most of the CV/novae are misclassified as periodic by the top level of the classifier (i.e., the class with the highest probability is periodic), and that most of them are misclassified as CEP or RRL by the bottom level. For these sources, the second or third classes with the highest probabilities in the bottom level is CV/nova, and the probability of the CV/nova class returned by the stochastic classifier [P_S(CV/nova)] is larger than the probability of being a CEP or RRL returned by the periodic classifier [P_P(CEP) or P_P(RRL)]. This confusion of the CV/nova class with the periodic classes can also be seen in Figures 7 and 12. Therefore, the problem is produced in the first level of the BRF model.

We inspected the feature distributions in both bands for the CV/nova, CEP, and RRL classes, in order to understand why the results obtained when only the g band is available are so different from the results obtained when only the r band is available. We did not find large differences among the g and r feature distributions, except for the g−W3 and r−W3 colors, where the confusion of CV/nova with CEP and RRL is larger for the case of g−W3, as can be seen in Figure 23. However, these differences are not large enough to totally explain the low recall curves obtained for the CV/nova class. Another possible explanation is that there is a large fraction of Cepheids with photometry only in the g band (see Figure 5). This may be playing a role in the obtained results, since for the BRF model a lack of features in the r band and AllWISE+ZTF colors similar to those of the periodic classes could imply that the source is periodic.

**Figure 23.** Normalized g−W3 (left) and r−W3 (right) distributions, for active galaxies (red), CEPs (blue), RRLs (cyan), and CV/novae (yellow).
Download figure:
Standard image High-resolution image

Despite all this, more than 96% of the CV/novae have photometry available in both bands, thus the low recall obtained for the g band is not a relevant issue. We plan to explore in future work better ways to classify CV/Novae. In particular, we will focus our efforts on classifying subclasses of CVs and Novae.

Appendix D: Sky Distribution of the Extragalactic Candidates

Figure 24 shows the sky distribution (in Galactic coordinates) of extragalactic candidates (QSO, AGN, blazar, SN Ia, SN Ibc, SN II, and SLSN). It is expected that only a few extragalactic candidates are observed in the Galactic Plane, which is confirmed in Figure 24.

Alert Classification for the ALeRCE Broker System: The Light Curve Classifier

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Data

2.1. Reference Data

2.2. Classification Taxonomy

2.2.1. Labeled Set

3. Features Used by the Classifier

3.1. Detection Features

3.1.1. Description of a Relevant Set of Detection Features

3.2. Non-detection Features

4. Classification Algorithms

4.1. Balanced Random Forest

4.1.1. The Two-level Classifier Approach

4.2. Additional ML Algorithms Tested

4.2.1. Gradient Boosting

4.2.2. Multilayer Perceptron

5. Results

5.1. Results for the BRF Classifier

5.1.1. Comparison with the GBoost and MLP Classifiers

5.1.2. Results for the BRF Classifier Excluding AllWISE Data

5.2. Performance of the BRF Classifier as a Function of Magnitude and Number of Detections

5.3. The Deployed BRF Classifier

5.3.1. Feature Ranking of the Deployed BRF Classifier

5.3.2. Results for the Unlabeled ZTF Set

6. Conclusions

6.1. Summary

6.2. Final Remarks and Perspectives

Appendix A: Further Description of Some Variability Features

Appendix B: One-level Multi-class RF Model

Appendix C: Further Analysis of Classes with Anomalous Recall Curves

C.1. The Particular Case of AGN, QSO, and Blazar

C.2. The Particular Case of RRL

C.3. The Particular Case of CV/Nova

Appendix D: Sky Distribution of the Extragalactic Candidates

Footnotes