On neural architectures for astronomical time-series classification

Despite the utility of neural networks (NNs) for astronomical time-series classification, the proliferation of learning architectures applied to diverse datasets has thus far hampered a direct intercomparison of different approaches. Here we perform the first comprehensive study of variants of NN-based learning and inference for astronomical time-series, aiming to provide the community with an overview on relative performance and, hopefully, a set of best-in-class choices for practical implementations. In both supervised and self-supervised contexts, we study the effects of different time-series-compatible layer choices, namely the dilated temporal convolutional neural network (dTCNs), Long-Short Term Memory (LSTM) NNs, Gated Recurrent Units (GRUs) and temporal convolutional NNs (tCNNs). We also study the efficacy and performance of encoder-decoder (i.e., autoencoder) networks compared to direct classification networks, different pathways to include auxiliary (non-time-series) metadata, and different approaches to incorporate multi-passband data (i.e., multiple time-series per source). Performance---applied to a sample of 17,604 variable stars from the MACHO survey across 10 imbalanced classes---is measured in training convergence time, classification accuracy, reconstruction error, and generated latent variables. We find that networks with Recurrent NN (RNNs) generally outperform dTCNs and, in many scenarios, yield to similar accuracy as tCNNs. In learning time and memory requirements, convolution-based layers are more performant. We conclude by discussing the advantages and limitations of deep architectures for variable star classification, with a particular eye towards next-generation surveys such as LSST, WFIRST and ZTF2.


INTRODUCTION
Time-domain imaging surveys continue to expand access to the photometric phase space of cadence and depth/volume. Despite many upcoming projects (e.g., the Rubin Observatory Legacy Survey of Space and Time -LSST 1 , Ivezić et al. 2019; Euclid 2 , Laureijs et al. 2011; and the Wide-Field Infrared Survey Telescope WFIRST 3 , Spergel et al. 2015) being optimized for transient (supernovae, microlensing) discovery and characterization, the data from these surveys create unprecedented opportunities to broaden our understanding of stellar variability and stellar evolution as well as expand the use of variable stars (VSs) as probes. Detached eclipsing binaries, for example, provide direct mea-surements of distance (e.g., Paczyński 1997) and fundamental stellar parameters (Torres et al. 2010), while pulsating VSs such as RR Lyrae, Miras and Cepheids due to their accurate period-luminosity relations are considered useful tools to trace galactic structures (Kraft & Schmidt 1963;Majaess et al. 2009;Skowron et al. 2019), calibrate the cosmic distance ladder (Freedman et al. 2001;Huang et al. 2018;Riess et al. 2018Riess et al. , 2019 as well as standard candles to measure distances to their host-galaxies (Carretta et al. 2000;Clementini et al. 2003;Alves 2004).
Stellar variability, primarily manifest as changes in brightness and color, arises from various physical mechanisms. Intrinsic variations arise as flares, rotation, pulsations, and/or violent outbursts due to thermonuclear processes occurring in the surface layers or deeper within. Extrinsic factors that may add to the observed variability include eclipses, relativistic Doppler beaming, mutual interaction in binary systems, and/or gravitational lensing.
The classification of VSs is based usually on brightness variations, typically, at visible wavelengths. While far from standard, the General Catalog of Variable Stars (GCVS; Samus' et al. 2017) maintains the taxnomy and nomenclature for VSs that distinguishes between subtypes of rotators, pulsators, eruptive variables, cataclysmic variables, eclipsing binaries in addition to other types such as microlensing sources. In general, stellar variability is not expected to fall into a unique type of dynamical behavior, as objects may display a multitude of physical behavior, such as rotational modulation superimposed to pulsation in RCB-type stars, rotational modulation interjected by abrupt episodes of deep minima or a steep increase in brightness in BY Dra-type stars, or symbiotic systems with a M-type pulsating Mira star with an accreting white dwarf companion as the R Aqr star. A comprehensive survey of VS classification is provided by Eyer & Mowlavi (2008).
Large data volumes, the primary strength of massive surveys, also present an acute challenge: how do we discover and characterize variable stars at scale in streaming, heterogeneous, and noisy time series? Human-free classification, part of a fully automated system to process and analyze survey data, requires the development and deployment of robust and reliable techniques from information-data technology to process large volumes of data and produce tractable and reproducible results.
Ideally, featurization produces a uniform dimensionality reduced representation of the observations that adequately captures the intrinsic properties of the data needed for classification. However, a known issue in hand-coded feature-based classification lies in the fact that the generated low-dimensional representation may overlook subtleties in higher-order systems and restrict such complexity into a set of low-level descriptors tailored for specific use-case applications. Furthermore, developing domain-specific features can be time-consuming, computationally expensive, highly dependent of expert-knowledge, and may show a strong dependency on survey characteristics.
Representation learning (RL) techniques offer an alternatively possibility to process raw observables without traditional feature engineering. The benefit of fully-automating the classification task using RL lies in the ability to reach a higher level of abstraction and capture complex structures embedded in the data. Distinct approaches in RL to automate features extraction from astronomical time-series have already been introduced in a broad range of studies. Used techniques include unsupervised learning algorithms (Armstrong et al. 2016), dimensionality reduction techniques, data transformations (Johnston et al. 2020), autoencoders (Naul et al. 2018) or dictionary learning (Pieringer et al. 2019).
In the recent years, VS classification using deep learning (DL)/neural architectures has been explored in several works that achieved satisfactory classification performance and thus demonstrated the ability of DL systems to learn stellar variability types from light-curves measurements and auxiliary metadata. Among mostly-used DL architectures, recurrent neural networks (RNNs) proved to be highly-performant for periodic VS classification (Naul et al. 2018;Tsang & Schultz 2019), supernovae (SNe) classification (Charnock & Moss 2017) and online transient events detection (Muthukrishna et al. 2019;Möller & de Boissière 2019). Convolutional neural networks (CNNs) have also proven comparably performant, with a better training convergence time and lower memory allocation requirements in comparison to RNNs in various applications such as exoplanet transit detection (Shallue & Vanderburg 2018;Schanche et al. 2019;Ansdell et al. 2018), SNe binary classification (Pasquet et al. 2019b) and Cepheid classification (Dékány et al. 2019). A review on the recent contributions of DL techniques in SNe classification is given by Ishida (2019).
To classify astronomical time-series, two main approaches have emerged, either (1) design an automated system to encode the photometric observables into a set of features (semior self-supervised learning) that constitute the entry point to traditional algorithms (e.g., support-vector machines, NNs or tree-based classifiers), or (2) develop a DL architecture to find an optimal mapping between the photometric observables and the labels through a supervised learning scheme.
The objective of this paper is twofold: first is to provide an overview of the current ML/DL techniques for variable stars classification, and second to discuss through a test ex-ample the applicability of a selected set of NNs architectures and the importance of data representation for classification of stellar variables. This paper also investigates approaches for variable stars classification using multiband photometric data. The paper is organized as follows. Section 2 first introduces the state-of-the-art of ML/DL techniques for VS classification then proceeds to present our proposed architectures for classification. In Section 3, we present variants of NNs architectures and discuss their performances through a test example using public data from the MACHO VS database. The networks performances are evaluated in term of training convergence time, classification accuracy, properties of the generated latent representations and light-curves reconstruction. Finally, we conclude in Section 4.

State-of-the-art
Classification of variable stars can be performed in feature space or in data space. The first approach consists of finding an optimal mapping between the labels, a vector Y, and an (encoded) feature set, a matrix X enc derived from the direct observables X; the second approach focuses on finding a direct mapping between the labels Y and the observables X.
In feature space classification, traditional approaches require hand-coded feature extraction to compute a set of informative descriptors using domain knowledge. For variable stars, the most discriminant features of stellar variability depends strongly on the specific class or subclass. For instance, the dominant modes of the pulsation mechanism can be well characterized in the frequency domain, found through Fourier decomposition or periodogram analysis. In eclipsing binaries, the shape, duration, and relative phase of the eclipses in their light-curves inform the type of the interaction (contact binaries, detached binaries or semi-detached binaries) and their properties such as the mass and radii ratios. Cataclysmic variables are described through the morphology of their light-curves at maximum light, the decline shape, the duration of the burst event, the event occurrence in time distinguishing between recurrent outbursts and final explosions, the quiescent state of the star post-event and observed spectral features during outburst. For eruptive variables, their light-curves and spectra show distinctive variations as sudden brightening or dimming episodes over extended periods of time due to flares and violent processes taking place in the corona or the chromosphere of these stars.
In traditional approaches, feature engineering relies on domain knowledge while feature learning automates the extraction procedure from the data using dimensionality reduction techniques, self-supervised networks (e.g., autoencoders), dictionary learning, or unsupervised algorithms. The extracted features constitute a discretized set of encoded infor-mation exploited by feature-based classifiers to predict labels. Among notable references, the work by Armstrong et al. (2016) exploits the unsupervised learning algorithm SOMs (Self-Organizing Maps) to encode photometric lightcurves into a set of features processed, along with additional descriptors, by the tree-based classifier, the random forest (RF, Breiman 2001), to predict labels for periodic VSs. The work of Naul et al. (2018) presents a bidirectional RNN autoencoder to discretize the photometric observables into a set of latent variables exploited, along with ancillary metadata, by a RF classifier to predict labels for periodic variables.
For classification in data space, common techniques exploit DL to identify embedded characteristics in the data and find a direct mapping between the input observables and the output labels. Applications using DL techniques for astronomical time-series classification include (1) SNe classification using RNN architectures to process multiband photometric data and auxiliary metadata (e.g., redshift measurements) (Charnock & Moss 2017;Muthukrishna et al. 2019;Möller & de Boissière 2019), (2) online transient events detection using RNN architectures to compute timely class-predictions for early-observed light-curves in order to forecast potential pre-SN outbursts and prompt follow-up procedures before the event reaches its maximum light (Muthukrishna et al. 2019;Möller & de Boissière 2019), and (3) exoplanetary transit detection using a composite convolutional networks that analyzes the full light curve and the eclipses to discern between planetary transits and stellar eclipsing binaries (Shallue & Vanderburg 2018;Ansdell et al. 2018;Schanche et al. 2019).
More recently, composite architectures has been introduced for astronomical time-series classification in the form of NNs composed of different submodules designed for specific tasks. Notable references include the work by Pasquet et al. (2019b) for SNe classification where the authors propose a DL architecture (PELICAN) with three modules: an autoencoder branch to generate the embeddings at the bottleneck level by optimal reconstruction, a classifier for label predictions and a contrastive module designed to reduce the discrepancy between the Test and Train sets. Binary classification of type Ia SNe is performed using multiband photometric data and ancillary metadata (redshift measurements of the host galaxies). The work by Tsang & Schultz (2019) proposes a similar approach for periodic VS classification, initially derived from the DAGMM (Deep Autoencoding Gaussian Mixture Model) network in Zong et al. (2018). The authors propose a network composed of two modules: an autoencoder branched out to a classifier module at the bottleneck level.

INPUTS
• X phot light-curves measurements High-level architecture of the direct classifier (left) and composite (right) networks. In direct classifiers, the time-series data (X phot ) and metadata (X meta ) are combined through a series of neural layers and concatenations, leading to classification predictions (Y label ) on which the loss function is optimized. Composite networks use X phot and a bottleneck/decoder to predict a reconstructed light curve (X rec ). The bottleneck layer along with X enc is also used to predict Y label . Both X rec and Y label are used in the loss function of composite networks.
This section presents selected architectures for VS classification. We distinguish between two types of architectures: direct classifiers and composite networks, as shown in Figure  1. Both architectures are composed of an encoder and classifier module. Composite networks are supplemented with a decoder connected at the bottleneck level. The encoderdecoder combination (i.e., autoencoder) aims to learn a latent representation X enc of the input photometric data X phot by optimal reconstruction, while the classifier maps the encodings to the labels Y label . By connecting the autoencoder to the classifier module, the system ideally learns a latent representation (ie., compressed summary) that is closely correlated to different stellar variability types in the training data. The high-level design in Figure 1 is adaptable: modules can be adjusted to the application and data as needed. In the current work, the decoder module corresponds to a mirror-image of the encoder, which is a common choice for autoencoder architectures. We evaluate the encoder for different types of NNs such as recurrent neural networks, convolutional neural networks and their variants, while the classifier module is set to a 2-layer MLP (Multi-Layer Perceptron) for all networks to predict labels. A short description of the neural networks used in this work is provided in Appendix B.1.
The majority of applications for VS classification using DL exploit auxiliary features that complement the photometric observables such as redshift measurements, detectable frequencies for periodic variables, amplitudes and colors. The photometric information in the light-curves constitutes a fundamental description of the evolutionary state of the star over time but does not contain the entirely of the available information; unless the light curve of a certain class is demonstrably different than other classes, additional metadata can be expected to improve classification accuracy. Typically, classification tasks require an upstream phase of data preparation. However, preprocessing transformations applied to the photometric observables may inadvertently remove discriminating features linked to the stellar variability types, thus altering the quality of the information necessary for classification. For instance, light-curves of periodic variables are preprocessed through phase-folding and data normalization. The phase-folding procedure transforms the periodic data into a compact representation in phase of one to two cycles by stacking multiple observations, thus removing the periodicity information over time. On the other hand, data normalization via minmax normalization yields to rescaled magnitude measurements, amplitudes and errors. As a direct result, similarly shaped light-curves with different peak-topeak amplitudes cannot be distinguished from one another.
In the current work, we investigate the importance of the metadata in two scenarios in which the network classifies the data solely based on (preprocessed) light-curves without auxiliary metadata as opposed to supplementing metadata to the system as a secondary input for label predictions.
In VS classification, multiband photometry can be processed either by transforming the photometric passband measurements into a single entity fed to the network or by jointly processing individual encodings from each passband measurements for label predictions. A simplified representation of the aforementioned approach, identified in this work as the merged and hybrid approach, is provided in the Appendix Figure A1. In the merged approach, multi-passband measurements are combined into a unique observable X phot,merged processed by the encoder module to compute the latent representation X enc . Whereas, the hybrid approach indepen-dently encodes the multi-passband measurements into individual features combined at a latter time into a compact encoded representation X enc . In both scenarios, the classifier module exploits the generated encodings X enc , along with metadata X meta , for label predictions. Composite networks differ from the direct classifiers by the addition of a decoder module connected at the bottleneck level to the encoder to generate the embeddings by optimal reconstruction.

APPLICATION
This section presents a set of NN-based architectures for VS classification. We discuss through a test example the performance of these networks in terms of training convergence time, label predictions, reconstruction and generated latent representations.

Data
As an exemplar, for all architectures, we use public photometric data and labels from the MACHO survey (Alcock et al. 1996). The MACHO project carried-out a long-term photometric monitoring of stars in the Magellanic Clouds and the galactic bulge from 1992 to 1999 in search for rare microlensing events, observable in theory if the dark matter is composed of massive compact halo objects (MACHOs). The large collection of data from the survey has allowed over the years a rare insight into a variety of stellar populations such RR Lyrae, Cepheids, LPVs (Long-Period Variables) and eclipsing binaries (Cook et al. 1995). In MACHO, the LPVs are categorized into four subtypes -namely the four Wood sequences A, B, C and D -referring to the parallel sequences identified from the period-luminosity relation of these red variables (Wood et al. 1999). Photometric data in MACHO consist of magnitude measurements in two photometric filters (the MACHO red and blue filters), the associated 1-σ error measurements and the observation epochs expressed in MJD. We exploit the multiband photometric data of the public MACHO VS database 4 to test our selected NNs architectures, and we perform further checks on the data. In particular, we retain the confirmed set of eclipsing binaries with corrected periods from Derekas et al. (2007). The full process yields 17604 periodic variables from the initial count of 24k periodic VS in the MACHO database to test our architectures: 1806 Cepheid variables, 9163 RR Lyrae, 2965 long-period variables (LPVs) and 3670 eclipsing binaries (cf. Table 1).
The majority of ML/DL open-source software exploits input data in the form of a fixed-size input tensors and few DL architectures are able to process observables with different lengths, e.g., using generator functions on an iterable list 4 http://macho.nci.org.au/ of input data. In our approach, the architecture of the direct classifier can process fixed-size data in a batch mode as well as a list of observables with different lengths using generator functions. Composite networks, however, require fixed-size inputs to comply with the implementation specifications of the decoder module. To meet such requirements, we reduce the data into a fixed-size format. Typically, data reduction can be achieved using padding, rebinning, interpolation (e.g., splines or polynomials) or model fit and prediction. More recently, model prediction using Gaussian Processes (GPs) have been used used on astronomical time-series (Ambikasaran et al. 2016;Foreman-Mackey et al. 2017;Pruzhinskaya et al. 2019;Boone 2019). To describe the stellar variability in the MACHO periodic lightcurves, a GP model with a quasi-periodic covariance function is fitted to each object using the open source code Foreman-Mackey et al. (2019a). The selected GP model is a mixture of stochastically-driven damped oscillators (SHOs), briefly discussed in the Appendix B.3. Fixed-size light-curves are generated by model prediction on a time frame sampled within the range of the observed epochs. At a latter stage, the reduced photometric light-curves are normalized and phasefolded to span over 2 cycles.

Design and implementation
We experiment with a variety of NN architectures and discuss the absolute and relative performances. Classification using multiband photometry is evaluated using the approaches introduced in Section 2.2, i.e. the merged and hybrid approaches. We also investigate the use of auxiliary metadata as a secondary input on the classification accuracy for the direct classifier and composite networks via two scenarios, in which the network classifies the data using the information from preprocessed light-curves without metadata as opposed to supplementing the auxiliary metadata as a secondary input for label predictions. Data entries for the different use case scenarios are summarized on the Appendix Table D1. The autoencoder is evaluated for different NN: RNN with LSTM cells, RNN with GRU cells, temporal CNN (tCNN) and dilated TCN (dTCN).
For VS classification, the objective of NNs is to empirically determine a mapping between the inputs, the photometric observables X phot = [x 1 , · · · , x N t ] T and auxiliary metadata X meta = [d 1 , · · · , d N t ] T , and the output labels Y label = [y 1 , · · · , y N t ] T , where N t designates the total number of objects. In the current application, the datavectors (x i ) i:1→N t refer to the reduced photometric measurements -magnitudes, observation epochs and error measurements-spanning over N p datapoints for the i th object and the metadata (d i ) i:1→N t is composed of N f features such as the periods, the amplitudes, the averaged magnitudes and the colors. The elements (y i ) i:1→N t refer to scalar values encoding the labels. For categorical classification, the labels are encoded into codewords transcribing the membership of the object to a variability class { j} j:1→N C with N C the total number of classes. A standard approach in categorical classification consists in transcribing class memberships into binary codewords {0, 1}. In online transient event detection applications, a similar methodology is used to map the observables into the output space. Class predictions are computed on the pixel level, which corresponds to a prediction vector (y i ) i:1→N t allowing to monitor the evolution over time of the label predictions and forecast potential pre-SN outbursts and trigger followup observations within a rapid decision time before the event reaches its maximum light. In contrast, the static-type prediction approach adopted for VS classification consists in associating the static label predictions to fully observed lightcurves. The approach is used in this study aiming to map the observables {X phot , X meta } into scalar values Y labels . Nonetheless, our architectures can be adapted to perform online predictions on the pixel level by adjusting the inputs-outputs format.
Detailed representations of the proposed architectures are shown in the Appendix Figures D1 to D3. After preprocessing (i.e., data reduction, phase-folding and data normalization of periodic light-curves), the networks proceed as follows. In the RNN architectures, the first part of the structure following the input layer is a stack of RNN layers that generates a sequence data of fixed length. The direct classifiers and composite networks differ in the last layer of the encoder module in which composite nets are supplemented with a fully-connected layer -a dense layer with a linear activation function -to transform the sequence data from the last RNN layer into an encoded representation X enc of a fixed size N enc . If auxiliary metadata X meta is supplemented as a secondary input, the classifier exploits the augmented features to pre-dict the labels Y label . The classifier module corresponds to a MLP of two dense layers with a rectified linear unit (ReLu) and softmax activation functions. In composite networks, the decoder transforms the encoded variables into reconstructed representations similar to Naul et al. (2018). The decoder proceeds by duplicating the embeddings in a number of times equivalent to the length of the input data, joins the epochs and passbands information and feeds the augmented embeddings into a stack of RNNs. The last component of the decoder module consists of a fully-connected layer that generates the final sequence data of the reconstructed light-curves. To prevent overfitting, regularization is added to the model through the dropout method (Srivastava et al. 2014) that randomly removes units during training.
Following the input layer, the tCNN architecture is composed of a series of convolutional layers with a number of filters and fixed kernel sizes for convolutions. The activation function for each convolutional layer is set to the hyperbolic tangent function and dropout is used for regularization. In the encoder module, the output of the last convolutional layer is flattened to generate a vector output. The latter corresponds to the final encodings for direct classifiers, while composite networks exploits an additional fully-connected layer to generate fixed-size embeddings. Similar to the RNN architectures, the encodings are fed to the classifier module, along with metadata when applicable. In the decoder, the embeddings are processed through a reshaping substructure, stacks of transposed convolutional layers and a final fully-connected layer to generate the outputs. The reshaping substructure emulates the RNN decoder in order to reframe the encodings into a fixed-length format matching the input datapoints.
The dTCN architecture closely follows the TCN architecture (Lea et al. 2017) initially derived from the Wavenet architecture (Oord et al. 2016). Our design adds dropout functions after the causal convolutions to prevent overfitting. Following the input layer, the first component of the dTCN consists of a causal convolutional layer connected to interconnected stacks of residual blocks with dilated convolutions and gated activation units. Following the Wavenet design, a ReLu activation function and two convolutional layers are used after the stacks. The encoder outputs are transformed into a vector format and a fully-connected layer is supplemented in composite networks. The generated encodings are then fed into the classifier module and augmented with metadata when applicable. The decoder module is composed of a reshaping substructure, a TCN unit mirrored to the encoder module and a final fully-connected layer.
The NNs are tested with a different set of hyperparameters (cf. Appendix Table D2). The number of parameters per model are provided in the Appendix Table D3. All models are trained using the Adam optimization algorithm (Kingma & Ba 2017) with a learning rate of 5×10 −4 , a fixed batch size per gradient update in optimization, a dropout fraction and an early-stopping procedure to interrupt the training when an optimal solution is found (i.e., convergence). An improved search for the hyperparameters space can be performed as an upstream stage of the final classification procedure. However, we choose in this study to empirically evaluate different sets of hyperparameter configurations, discuss the networks performances, and identify the best-performing models. In training, the autoencoder (i.e., encoder-decoder) aims to minimize the reconstruction loss, the weighted MSE (Mean-Square Error) function following Naul et al. (2018), and the classifier branch is set to minimize the categorical cross-entropy loss. Best-performing models are identified from minimum loss obtained on a Test set, i.e. the subset of data neither used in training nor validation. Our tests used up to 2−4 cores on a CPU model Intel Xeon E5-2643v3 on the UC Berkeley Savio Linux cluster. The NN architectures are implemented in the keras (Chollet et al. 2015) and Tensorflow (Abadi et al. 2015(Abadi et al. , 2016 programming frameworks, and the RNN autoencoder branch is partly adapted from the architecture in Naul et al. (2017). To facilitate reuse and reproducibility, our benchmarking codebase is provided in the open source 5 .

Performance study
This section presents the results obtained on the public MACHO VS dataset. Computations are performed through a classical training-validation-test scheme with 80% of data for training-validation and 20% of unseen objects in the test prediction phase. Performance metrics are computed for all models, with a particular emphasis on the performances reached on the Test set. Metrics include the system total loss, the classification accuracy as well as averaged precision, recall and F1-score (cf. Appendix C). The total loss corresponds to the classification loss (categorical cross-entropy) evaluated on the encoder-classifier branch, supplemented in composite networks with the weighted MSE evaluated on the encoder-decoder branch. The classification accuracy is obtained through a direct comparison between the true labels Y true and the network predictions Y pred . We also discuss the classification performances within three main types of stellar variability: short-period pulsators including the RR Lyrae and Cepheids, long-period pulsators (LPVs) and eclipsing binaries. The autoencoder performances are assessed through the quality of the reconstructed light-curves X rec , and the embeddings X enc are projected onto a 3-d representation using a data reduction algorithm. The degree of separation (or lack thereof) in the reduced latent space is discussed. The training convergence time is reported for all networks in the Appendix Table E1. The network type, size and hyperparameters influence on the total time required to reach convergence, with an average training time for RNNs (LSTM and GRU cells) scaling higher in comparison with the convolutional networks (tCNNs and dTCNs) due to higher memory requirements for RNNs that entail a longer time in training. As expected, increasing the network size (i.e., number of parameters) correlates with a longer time in training. For instance, direct classifiers without metadata (c F ) corresponding to the hyperparameter set configuration (6) (the largest models for LSTM, GRU and dTCN) converges in training after approximately 4−5 hours for RNNs, 18 minutes for dTCNs and ∼4.8 minutes for tCNNs using one photometric band on CPU. Networks processing multiband data converge at a slower rate due to larger data entries. After training, the prediction step is extremely fast: 0.5−−3 ms per object for tCNNs, 1−−10 ms per object for dTCNs and up to 3−−20 ms per object for the RNNs on a CPU.

Training convergence time
To track the evolution of the total loss and the accuracy during the training and validation stage, the Appendix The values in each box correspond to the number of predictions versus the initial count of true labels (indicated on the right). Main stellar variability groups are highlighted in red, respectively the RR Lyrae, the Cepheids, the LPVs and the eclipsing binaries.
when the system reaches convergence. During the validation step, the loss values exceed the losses computed during training, which indicates a proper bias-variance trade-off.
The system converges to reach at best ∼75% for the bestperforming LSTM d F without metadata. By supplementing the metadata as secondary input, the accuracy increases up to 91%.

Labels predictions
Total loss and classification accuracy for all trained models are reported on the Appendix Tables E6−E7. We identify the best-performing models for each architecture type (LSTM, GRU, tCNN and dTCN) from minimum loss on Test set. As previously mentioned, the total loss corresponds to the loss evaluated on the encoder-classifier branch (categorical cross-entropy) supplemented in composite networks to the loss the encoder-decoder branch (weighted MSE). Appendix Table E2 reports the identifiers of the hyperparameters set configurations associated to the best-performing networks. The qualification of best-performing models is solely based on an empirical search in the network hyperparameters space. Nonetheless, an improved search through the hyperparameters space can constitute a future work to investigate the effect of hyperparameters selection on the network performances and optimization landscape during training. Table 2 summarizes the classification accuracy of the best-performing models reached on Test set. For the bestperforming models, the overall accuracy ranges between 67% to 79% for models processing the light-curves information without metadata. In particular, the dTCNs without metadata achieve an accuracy of 67−71% whereas the accuracy of RNNs and tCNNs jointly scales higher with 71−79% fraction of correct predictions. By incorporating the metadata −colors, amplitudes, averaged magnitudes and periods− in the models c F,meta and d F,meta , the accuracy improves by 10−20% to reach at best ∼88−92% for RNNs, ∼82−92% for tCNNs and only ∼75−87% for dTCNs.
The NN architectures learned on single band and multiband light-curves (the blue band B versus the multiband RB) show comparable classification accuracy. Moreover, the results do not indicate a significant difference in the performance reached by networks processing multiband photometry RB via the merged and hybrid approaches, due to the photometric bands (red and blue) in MACHO equally containing informative characteristics on the stellar variability types. We would expect that classification using sparselyobserved multi-passband photometry would benefit from the availability of several sources of information on stellar variability. In such a case, the advantage of the hybrid approach would be more prevalent for systems sequentially processing the multi-passband photometry in an optimized scheme, whereas the merged approach may call for higher memory requirements in terms of CPU/GPU usage.
Generally, classification performances are better summarized on a confusion matrix that reports the fraction of predicted labels compared to the true class labels. Optimal re- Table 3. Selected subset of objects from the MACHO VS database for display. sults correspond to a diagonal matrix with a fraction of true positives per class (i.e., diagonal elements) close to unity. Figure 2 shows the performance obtained on the Test set for the best-performing LSTM direct classifiers c F and c F,meta that respectively reached an overall accuracy of 75% and 92%. Overall, the confusion matrices tell a similar story: the main stellar variability groups, highlighted in red, are recovered to a fair accuracy despite the overlap between subtypes and the misclassifications. The network c F provides class predictions based on the information from preprocessed (i.e., normalized and phase-folded) light-curves without metadata. The observed degeneracy is to be expected between classes of objects sharing a similar shape of light-curves. By supplementing the metadata to the network, the accuracy-per-class for c F,meta increases and the number of false positives (i.e., off-diagonal elements in confusion matrices) is significantly diminished. Moreover, the observed porosity between adjacent classes in the confusion matrix appears to remain within the main stellar variability types. From the confusion matrices, label predictions for the eclipsing binaries in our sample appear to overlap with other variability groups despite the use of the metadata. These misclassifications may be due to some degree of (true) label noise. In the current tests, class predictions of a few subtypes remain erroneous despite the use of metadata. The second overtone RRL pulsators (type e) are inaccurately predicted as first-overtone RRL (type c). To avoid confusion within subtypes, a proposed solution would be to introduce a weighting scheme on the feature contributions {X enc , X meta } injecting some prior knowledge regarding the features importance in VS classification.
To further investigate the classification performance, we compute additional metrics (cf. Appendix Table E4). The recall (i.e., the true positives rate or sensitivity) characterizes the ability of the network to properly retrieve the true labels and the precision depicts the level of agreement between the predictions per class and the true labels. The F1-score corresponds to a combination of both metrics. We report in Appendix Table E5 the averaged metrics for the best-performing LSTM direct classifiers c F and c F,meta that respectively showcased a classification accuracy of 75% and 92%. The averaged precision, recall and F1-score are boosted respectively to 77%, 80% and 78% from an initially 51% rate. Individual metrics per class highlight a better improvement. However, the inability of the network to predict a few subtypes such as the second-overtone RRL pulsators affect the averaged recall and precision. Appendix Table E4 reports the averaged metrics for the best-performing models. Similar conclusions can be reached regarding the performances of RNNs and tCNNs that outperforms the dTCNs in the current tests.
We also investigate the the accuracy within three main stellar variability groups: short-period pulsators (group 1), eclipsing binaries (group 2) and long-period variables (group 3). We report the classification accuracy on Test set for the best-performing models per group in the Appendix Table E3, For short-period pulsators, the accuracy of the bestperforming networks reaches 78−85% due to their distinctive light-curves shapes as the characteristic asymmetry and steep rise in fundamental mode pulsators. With metadata, the accuracy increases up to 92−94%. The classification of the eclipsing binaries in our sample shows a comparable improvement with a 91−92% fraction of correct predictions when using metadata. Similarly, the LPVs sample benefits from the use of metadata as a secondary input as the best-performing networks reaches 63−88% from an initial 38−54% fraction of true positives.

Reconstructed light-curves
In this section, the autoencoder performance is assessed through the quality of reconstructed light-curves X rec . An ideal reconstruction would preserve the embedded (denoised) structure of the input data X phot . To visually assess the recon-  Figure 3. Reconstructed light-curves from the Test set for the best-performing LSTM composite network d F,meta . struction quality, we select a sample of objects for display (cf. Table 3) and show the reconstructed light-curves for the best-performing composite network d F,meta in the Appendix Figures E3-E4. A subset for reconstructed light-curves for the best-performing LSTM composite network is also shown in Figure 3.
Overall, the reconstruction results indicates a smoothing effect on the magnitude measurements as well as a correlation between the decoder performance and the input data quality. In particular, low signal-to-noise levels in the data limits the ability of the network to recover real structures and the resulting X rec appears exaggeratedly smoothed out with no distinct features (such as the characteristic pulsation profile or peaks at maximum light). Furthermore, some pulsating variables can exhibit irregularity or low-frequency mod-ulations that may evolve on timescales longer than the observed time-range, noticeably seen for Long-Secondary Periodic stars (LSP) in the Wood sequence D (Wood et al. 2004). In such cases, the folded light curves using the primary period appear as a mismatched superimposition of multiple cycles, as seen in the irregular LPV displayed in Figure 3. To prevent such limitations, the detection of multiperiodicity, irregularity and low-frequency modulations should be considered to potentially isolate those objects that may require a different preprocessing strategy and classification approach.
To assess the global performance reached on the Test set, we evaluate the reconstruction error as a function of a dataquality indicator (SSR) corresponding to model fit residuals computed from the SuperSmoother algorithm (Friedman 1984). Light-curve measurements with low signal-to-noise,

Network d F,meta ; LSTM; best configuration (6); B band
All datasets Test set Figure 4. Reconstruction error (MSE) as a function of the model fit residuals from the SuperSmoother algorithm (Friedman 1984) for the best-performing LSTM d F,meta on the B-band. The highlighted numbers (1 to 10) refer to the subset of selected objects from the Test set used to showcase the reconstruction quality of the autoencoder branch. irregular variations or extended low-frequency modulations are associated with a high SSR. The distribution of the reconstruction error in Figure 4 suggests a distinct trend: the system is able to reasonably recover the morphology of objects associated with low SSR (e.g., the Cepheids n4 and n5 and the Mira star n8) as opposed to noisier and irregular lightcurve profiles (e.g., objects n3, n6 and n7 which are RRL of type e and LPVs from the Wood sequences A and B).
A comparable analysis on the performance of the bestperforming GRU, tCNN and dTCN composite models d F,meta reaches a similar conclusion with respect to the distribution of the reconstruction error (cf. Appendix Figure E5). Shown in the reconstructed profiles of selected objects (cf. Appendix Figures E3−E4) are comparable performance despite few noticeable differences such as the shockwave propagating before the maximum light in the fundamental model RR Lyrae (object n1) that is recovered by the convolutional networks but heavily smoothed out in the LSTM and GRUs. The RNNs also appear to overly smooth out the modulations of the LPV star (object n9, LPV Wood sequence D), while convolutional nets partially restore discontinuous plateaus. The composite LSTM model also appears to preserve the shape of the primary and secondary eclipses in the detached eclipsing binary (object n10), while convolutional nets (dTCNs and tCNNs) partially recover the depth of the eclipses. The GRU disproportionately smooths out such features.

Latent space exploration
Using dimensionality reduction algorithms, encoded features of Train set can be projected into a reduced (2 or 3-d) representation. The results for the best-performing LSTM composite network d F,meta are shown in Figure 5. For better visualization, projections are separated for the three main variability groups in the Appendix Figures E6 to E8. We limit the analysis of the latent representation to the training dataset as it corresponds to the learned partitioning.
The encoded features generated from the best-performing composite network are projected onto a three-dimensional representation using the UMAP algorithm (McInnes et al. 2018). The degree of separation of the clusters in the reduced representation space characterizes the type of information fed to the network. Without metadata and solely based on the encoded morphology of the light-curves, the projection shows (cf. Figure 5 on the left) a large fraction of RR Lyrae, Cepheids, eclipsing binaries and LPVs that appear isolated in the projected space. However, the level of separability of these clusters is limited by the overlap between classes of objects sharing a similar shape of (preprocessed) light-curves, as with the majority of LPVs, overtone pulsating RR Lyrae, Cepheids and few eclipsing binaries in our sample. In the detailed representations (cf. Figures E6 to E8), the encoded features of eclipsing binaries are clustered into composition of a compact aggregate, a dispersed set and outliers, while short- period pulsators cluster into a compact aggregate of fundamental mode pulsators and an overlapping blend of overtone pulsators given their similar light-curves profiles. In the latent space, the LPVs cluster into a compact aggregate without a clear delineation between the different subtypes; this suggests a lack of discriminating features in X enc that would be necessary to distinguish the different subtypes. When metadata is supplemented, the projection of the augmented features set {X enc , X meta } show better separability in the reduced latent space (cf. Figure 5 on the right); this separability also coincides with the improvement in the classification accuracy for the networks utilizing metadata to complement the encoded photometry. In particular, the detailed representations for the main variability types highlight well-separated clusters of the short-period pulsators, while the eclipsing binaries cluster into a composition of a compact aggregate, a filamentary structure and dispersed outliers. The diversity in substructures for the eclipsing binaries can be explained by the different categories of binaries merged in the MACHO database (e.g., contact, detached and semi-detached stellar binaries) or possible label contamination. Similarly, the LPV sample is well-separated from others variability groups and is projected onto a filamentary structure with an enhanced separability between the different sub-types. The best-performing GRU, tCNN and dTCN networks give comparable projection results.
To summarize, composite NN architectures are able to encode the photometric observables into substructures associated with the stellar variability classes. Without metadata, the encodings generated from the photometric data cluster into distinct aggregates despite the overlap between classes of objects sharing a similar morphology of preprocessed lightcurves. By supplementing the metadata as a secondary input to complement the encoded photometry, the level of separation in the latent space is enhanced, which aligns with the overall increase in classification accuracy. An examination of the nature of the overlap regions of the latent space, as well as the properties of the different aggregates, are left for future study. We would expect similar studies using larger VS datasets will help test the potential universality of the latent space, and also reveal potential outliers that could constitute new subclasses.

CONCLUSIONS
In this work, we explored the use of NN architectures for VS classification through various use-case scenarios. These architectures allowed us to generate higher-abstraction en-codings of the photometric data without the need for handcoded feature engineering.
Two types of architectures, identified as direct classifiers and composite networks, are tested. Both networks are composed of an encoder module to transform the data into a reduced representation and a classifier to predict labels. Composite networks include a decoder module to define an encoded representation of the input data by optimal reconstruction. In our analysis, VS classification using multi-passband photometric data can be performed in two approaches, either by encoding a merged representation of all passband measurements (merged approach) or jointly processing individual encodings (hybrid approach). Sparsely observed multipassband photometry would benefit from adopting the latter approach.
In this work, we also experimented with a variety of NN architectures and investigated the effect of ancillary metadata on classification performance. We found that systems solely exploiting the time-series data are able to reach a ∼70% accuracy for the best-performing models. By supplementing the metadata as a secondary input, a net increase in the classification accuracy is observed across all network types, reaching at best a ∼91% accuracy for the best-performing LSTMs and temporal CNNs models. Misclassifications for the bestperforming networks are primarily restricted to the main stellar variability types, which provides a strong incentive for a multiple-stage architecture for label predictions, to first predict the main stellar variability type followed by subtypes prediction. On a computational level, the training convergence time for RNNs models was found to be longer, due in part to larger memory allocation costs.
For composite networks, the reconstruction quality of the decoder module appears highly contingent on the input data properties (ie., the signal-to-noise level and smoothness of the light-curves profiles). Variable stars exhibiting multiperiodicity, irregularity or modulations over time appear in their phase-folded representation as a mismatch of superimposed cycles that a network is unable to learn and overly smooths out in reconstruction.
The exploration of the learned encodings indicated a clear clustering linked to the stellar variability types. Without metadata, clusters of variables appear isolated despite the overlap noticed for objects sharing similar light-curves profiles. Supplementing the metadata to the encoded information predictably lessens such degeneracy and enhances the separation between the classes; this in turn, accounts for the increase in the fraction of correct predictions. We would expect the latent representations to highlight interesting properties in the data and pinpoint to potential outliers, unknown stellar variability types or new subtypes within known variability classes.
To conclude, various NN architectures are able to capture low-dimensional data representations and reach achieve excellent classification accuracy without the need for handcoded featurization. The best-performing networks in our tests are primarily LSTM-and tCNN-based models, with the latter benefiting from smaller training convergence time and smaller memory footprints.
As a future direction, developing a baseline for an automated system able to learn a wider range of stellar variability traits should be explored. The need for general architectures is strongly motivated from the fact that massive surveys are set to produce large datasets with a blend of different types of stellar variables such as aperiodic VSs (e.g., cataclysmic stars and microlensing events) as well as periodic and quasiperiodic variables (e.g., pulsators, rotators and eclipsing binaries). Automated classification for periodic VSs presented here exploits phase-folded representations as well as the information from the frequency domain, while classification of quasi-periodic variables require the use of a combination of multiple data representations -phase-folded light-curves, periodograms, O-C diagrams and the time-series -to produce reliable class predictions. To meet the need for a general framework, one proposed design would consist of a multistage architecture with different components specialized to distinguish distinct stellar variability traits, to first discriminate between the three categories of periodic, quasi-periodic and aperiodic VSs then follow with the classification into the stellar variability types and subtypes.   Figure A1. High-level architecture of the direct classifier (top) and composite (bottom) networks across two variants of the combination of multi-passbands photometric data, identified as merged and composite approaches [left and right]. In the merged approach, multi-passband measurements are merged into a unique observable X phot,merged processed by the encoder module to compute the latent representation X enc . Whereas, the hybrid approach independently encodes the multi-passband measurements into individual features merged at a latter time into a compact encoded representation X enc . In both scenarios, the classifier module exploits the generated encodings X enc , along with metadata X meta , for label predictions. Composite networks differ from the direct classifiers by the addition of a decoder module connected at the bottleneck level to the encoder to generate the embeddings by optimal reconstruction.

B.1. RNN, CNN and TCN
Recurrent neural network (RNN) refers to a neural network architecture composed with interconnected nodes through a directed graph (cyclic or acyclic) along a temporal sequence. In the standard architecture, the fully-connected RNN layer is constructed such that each node is interlinked to the adjacent units. Each node in the standard architecture corresponds to a neural unit with an activation function and a weight. The LSTM (Long-Short Term Memory, Hochreiter & Schmidhuber 1997) and the GRU (Gated Recurrent Network, Cho et al. 2014) are variants of the RNN with a higher level node structure composed with multiple subunits acting as internal regulators to the propagated information within the network. Both the LSTM and GRU cells exploit a forget gate and an input gate, and the LSTM utilizes an additional output gate. RNNs applications on astronomical time-series include VS classification using autoencoders (Naul et al. 2018 Convolutional neural network (CNN) refers to a NN architecture with convolutional layers that applies a series of convolutions through overlapping windows, allowing to capture spatial correlations in the data. The standard CNN architecture utilizes pooling layers after convolutions to downsize the data in addition to fully-connected layers. The performances of convolutionalbased neural networks has been demonstrated in a broad range of astronomical data applications such as galaxy classification (Dieleman et al. 2015;Aniyan & Thorat 2017;Kim & Brunner 2017;Domínguez Sánchez et al. 2018), VS classification using asteroseismology (Hon et al. 2017), supernovae classification (Cabrera-Vives et al. 2016;Pasquet et al. 2019b;Brunel et al. 2019), photometric redshifts estimation (Hoyle 2016;DIsanto & Polsterer 2018;Pasquet et al. 2019a), and cosmological parameters inference (Ntampaka et al. 2020), parameters estimation from 21-cm tomography (Gillet et al. 2019), strong lensing detection (Lanusse et al. 2018;Jacobs et al. 2019), gravitational-waves signal detection (Gebhard et al. 2017;Gabbard et al. 2018;George & Huerta 2018;Fan et al. 2019;Gebhard et al. 2019) and generator models for weak lensing convergence maps (Mustafa et al. 2019).
Temporal convolutional neural network (TCN) refer to a NN architecture in Lea et al. (2017), initially derived from the Wavenet architecture (Oord et al. 2016). The network is a composition of a serie of dilated convolutions and residual blocks used to expand the filters receptive fields and reduce the training convergence time. A detailed description of deep learning techniques is available in specialized computer science publications, such as the overview on DL techniques by Schmidhuber (2015).

B.2. UMAP
The UMAP (Uniform Manifold Approximation and Projection) algorithm is a nonlinear dimensionality reduction algorithm introduced by McInnes et al. (2018). Assuming a uniform distribution of the data on a locally connected Riemannian manifold, the algorithm computes a low-dimensional representation by optimizing a fuzzy set cross-entropy between the simplicial set representations of the data and the target embeddings. The UMAP has gained interest and use recently for astronomical data applications such as the SDSS DR15 spectroscopic data classification in Clarke et al. (2019) and anomaly detection in SDSS galaxy samples in Reis et al. (2019).

B.3. Model prediction using Gaussian Processes
In preprocessing, data reduction is performed a using Gaussian Processes (GPs) model which is used to generate fixed-length representations of each source. Foreman-Mackey et al. (2017) provides GP kernels suitable for astronomical time-series, with various applications including radial velocity fitting and transit modeling. For periodic VSs, we select a GP kernel based on a composition of stochastically driven damped harmonic oscillators (SHOs) with a quasi-periodic covariance term. For each SHO model, the associated power spectral density S (ω) is defined as following: where, ω 0, j and Q j respectively refer to the frequency of the undamped oscillator and the associated quality factor of the j th oscillator. The parameter S 0, j is proportional to the resonance (i.e., ω = ω 0, j ) power. Our data reduction approach is a two-fold process: first, we fit a GP model on the observed data X obs and second, we use the model to predict what the source would look like over a reduced time-frame T red . For periodic VSs, we use a GP model corresponding to a mixture of N S HO = 2 SHOs. In the model parameterization, the periodicity (due to pulsation or binarity) is captured by the resonance frequency, such that: Here, P j and A j refer respectively to the period and amplitude of the variability per SHO model j. Using the MAP (Maximum-a-posteriori) solution, model prediction is performed on a time frame T red sampled within the range of observed epochs T obs . For unevenly sampled data, predictions made on time observations located in large time gaps are associated with a high uncertainty. To prevent such limitation, we developed an approach to generate a random time range within the observed T obs outside significant time gaps. Using the unsupervised K-means algorithm applied to the time differences ∆T obs , observations are clustered based on their proximity in time. Significant time gaps T gaps are identified within the group of large time differences, and an optional rejection criterion is supplemented to refine the detected time gaps to span, at least, higher than n cycles of the primary period of the light-curve.
The reduced time frame T red is generated via random sampling of time values within the observed time range outside the identified time gaps T set = {T obs \T gaps }. For each detected subset j of clustered observations, time values are obtained either by randomly generating values within the range of T set, j or by randomly selecting of values in T set, j shifted by a random δt > 0. The second approach generates a time frame T red close to the observed T set .
To illustrate the results of data reduction using the aforementioned GP model and prediction scheme, a display of MACHO light-curves is provided in Figure B1. The GP model fitted to the data is a mixture of 2 SHOs with the following parameters.

Phase-folded representation of the light-curves GP predictions Identified large gaps
with, P is the primary period of the time-serie, (µ A , σ 2 A ) refer to the amplitude of the time series and associated error (or a fixed variance), and (µ Q , σ 2 Q ) are strictly positive values set to separate the two oscillation modes.

C. METRICS FOR MULTICLASS CLASSIFICATION
To quantify the performances in multiclass classification, the following metrics are computed: TP ( j) with N C the total number of classes, N s the number of true samples and TP ( j) , TN ( j) , FP ( j) and FN ( j) respectively referring to the true positives, the true negatives, the false positives and the false negatives computed from the predictions on objects from the true class j.  (2) 2 categorical classification;
(c) Dilation factors in dTCN correspond to {2 j−1 } j:1→r , with r > 1 the dilation rate.    Figure D2. Architectures of the direct classifier and composite tCNNs. Naming convention follows the implementation in keras.

Conv1D
(nFilters=nE, kernel size=1) Figure D3. Architectures of the direct classifier and composite dTCNs. The series of dilated convolutions and residual stacks follow the Wavenet architecture (Oord et al. 2016) augmented with additional dropout functions to prevent overfitting. Naming convention follows the implementation in keras. Table D3. Network sizes corresponding to the total number of parameters per model across three different datasets (B-band only [top], and two variants of the combination of R-and B-bands [middle and bottom]). The identifiers of the hyperparameters set configurations (1) to (6) are ordered by increasing size. Largest networks for the RNNs, tCNNs and dTCNs correspond respectively to the identifiers (6), (4) and (6) and the smallest configurations to (1), (3) and (1). The metadata in the current application (N f = 6) adds 96 parameters.          Table E7. Classification accuracy, defined as the fraction of true positives within the sample, computed for all trained models across three different datasets (B-band only [top], and two variants of the combination of R-and B-bands [middle and bottom]). Solid underlines highlight the best-performing models associated to minimum loss obtained on the Test set.

Flatten
The identifiers (1) to (6) refer to the hyperparameters set configurations.

MACHO -B band
(3) (1) (3) (1) (3) (1) (3)   Figure E3. Displays (1) of reconstructed light-curves from the Test set for the best-performing composite d F,meta on the B-band (left to right: the best-performing LSTM, GRU, tCNN and dTCN). For visualization purposes, the 1−σ error measurements of the input data are not displayed in the reconstruction results.  Figure E4. Displays (2) of reconstructed light-curves from the Test set for the best-performing composite d F,meta on the B-band (left to right: the best-performing LSTM, GRU, tCNN and dTCN). For visualization purposes, the 1−σ error measurements of the input data are not displayed in the reconstruction results.