Improving Astronomical Time-series Classification via Data Augmentation with Generative Adversarial Networks

Due to the latest advances in technology, telescopes with significant sky coverage will produce millions of astronomical alerts per night that must be classified both rapidly and automatically. Currently, classification consists of supervised machine-learning algorithms whose performance is limited by the number of existing annotations of astronomical objects and their highly imbalanced class distributions. In this work, we propose a data augmentation methodology based on generative adversarial networks (GANs) to generate a variety of synthetic light curves from variable stars. Our novel contributions, consisting of a resampling technique and an evaluation metric, can assess the quality of generative models in unbalanced data sets and identify GAN-overfitting cases that the Fréchet inception distance does not reveal. We applied our proposed model to two data sets taken from the Catalina and Zwicky Transient Facility surveys. The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data with respect to the case of using only real data.


INTRODUCTION
Deep learning models have become state-of-the-art in an extensive range of tasks, such as image recognition, video analysis, and natural language processing, demonstrating their immense ability to solve complex problems and outperform existing algorithms. Based on this fact, applying deep learning models to the classification of astronomical time-series arises as an interesting approach.
Models have progressively increased their number of parameters to achieve such results, from thousands to millions. Unfortunately, architectures with such a large number of parameters are vulnerable to overfitting. Overfitting occurs when models memorize the data available in the training set rather than learning meaningful characteristics from the data so that the model can generalize and perform well when testing on new and unseen data. To avoid overfitting, models that achieve state-of-the-art results in different tasks are trained with annotated datasets that have been extensively processed and filtered, and that consist of a large number of samples for each class, thus preventing overfitting.
However, real-world problems present different scenarios in regard to data. For example, not only there is a small number of annotations in astronomical time-series datasets, but the annotations have also highly imbalanced class distributions. While small datasets already hinder learning by making algorithms fail at generalizing characteristics of the data, imbalanced distributions only accentuate this issue (Caruana 2000;He & Garcia 2009). These two characteristics, in addition to the irregularly time-spaced nature of astronomical observations, are a considerable difficulty for machine learning algorithms and make the classification problem a unique challenge.
To overcome these problems, data augmentation techniques are frequently applied to transform small imbalanced datasets into large and balanced datasets. Most of these techniques, although widely applied in the domain of images, cannot be directly applied in the time domain due to its dissimilar properties. Consequently, augmentation techniques in the time domain remain a challenge and deserve more attention from the community (Wen et al. 2021).
Traditional augmentation techniques in the time domain, such as jittering, window warping, and slicing, assume that these transformations exist naturally in the data and that the augmented samples will be valid timeseries with similar properties to the existing ones. More-arXiv:2205.06758v1 [astro-ph.IM] 13 May 2022 over, appropriate augmentation techniques are specific to the dataset (Iwana & Uchida 2021) and the task (Wen et al. 2021). An example of a dataset-specific technique could be jittering, where additive Gaussian noise is often used in sensor datasets. Yet this method cannot model the heteroscedastic nature of astronomical data. On the task-specific side, we could mention slicing or warping transformations that heavily discard or modify the context of the time-series, potentially altering the original samples' class information.
A generative model for data augmentation allows avoiding assumptions about existing transformations in the data. Since we will use the generated samples for classification, the generative model should learn the class conditional distribution of the data. Therefore, the model can learn how to generate new realistic samples directly from the data and preserve the class information simultaneously.
Because of their ability to model complex real-world data and the wide success they have achieved across a variety of domains (Sampath et al. 2021), Generative Adversarial Networks (GANs; Goodfellow et al. 2014) are the generative models of our choice.
While previous works have explored GAN-based data augmentation methods for classification, most have focused on the image domain (Zhu et al. 2018;Frid-Adar et al. 2018;Salehinejad et al. 2018;Huang et al. 2020) and only a few in the time domain (Ramponi et al. 2018;Zhang et al. 2020). Furthermore, Ramponi et al. (2018) is the only work that addresses astronomical time-series generation.
To the best of our knowledge, none of the existing approaches is suitable for our use-case: dealing with irregularly-spaced data, allowing for both multiclass and physical parameter conditional generation, and focusing on the downstream task of classification. In addition, the literature lacks a GAN evaluation metric to select appropriate models for classification tasks.
In this work, we propose a GAN-based data augmentation methodology for time-series to improve the classification accuracy on two astronomical datasets taken from the Catalina and Zwicky Transient Facility surveys. The main contributions are: (a) Proposing a GAN model capable of performing conditional generation based on class and physical parameters, suitable for irregularly-spaced timeseries. (b) Revealing the incapability of the standard GAN evaluation metric (FID) to assess overfitting and proposing a novel evaluation metric that overcomes this issue to select an adequate generative model.
(c) Proposing a resampling technique to delay the occurrence of overfitting. (d) Designing two new data augmentation techniques for time-series that produce plausible time-series preserving the properties of the original ones.
The remainder of the paper is structured as follows: Section 2 presents a theoretical background of the work. In Section 3 the utilized datasets and their preprocessing are explained. Section 4 explains the proposed methodology. Section 5 presents the obtained results, which are discussed in Section 6, stating its strengths and weaknesses. Finally, Section 7 presents the main conclusions of this work and future steps.
be a dataset where x i is a real example and y i ∈ Y = {1, 2, ..., c} a class label associated to x i . D is said to be imbalanced if the distribution of Y differs significantly from the discrete uniform distribution U{1, c}. Therefore, imbalanced datasets are composed of one or more classes (majority classes) that severely outrepresent other existing classes (minority classes) (He & Garcia 2009).
Given an imbalanced dataset D, we can apply sampling techniques to transform its class distribution into a uniform. The result of this transformation is a modified version of the original dataset, its balanced counterpart D u .

Generative Adversarial Networks
The GAN framework consists of a game between two networks. Given an input dataset of real samples x r ∼ P r , the generator network (G) aims to implicitly approximate the data distribution P r by performing a mapping between a source of noise and the real sample space. The result of this mapping are fake samples x g ∼ P g that attempt to resemble the real ones. In contrast, the discriminator network (D) tries to distinguish between x r and fake samples generated by G.
During the training process, the two networks compete against each other without having control of the opponent's parameters. On the one hand, G is trained to generate samples that resemble the real ones, while on the other hand D is trained to predict whether a given sample comes from the input dataset or was generated by G. At the end of the training, G will generate samples similar to the ones in the input dataset, and the D will be unable to tell apart generated from real samples.
Since the creation of GANs, they have revolutionized the field of generative modeling, showing novel results especially in the domain of images. As a broad overview of the evolution process, we could mention conditionalgeneration models (Mirza & Osindero 2014;Odena et al. 2017), models that stabilize the erratic behavior of the original GANs Gulrajani et al. 2017;Miyato et al. 2018), and models that generate samples with an impressively high quality and resolution (Karras et al. 2018;Brock et al. 2019;Karras et al. 2019;Karras et al. 2020b) among many other models and applications. An extensive description of GAN models in computer vision is provided in Wang et al. (2022).
GANs have also been applied to the time-series domain, with significant improvements in recent years. The first model capable of generating continuous sequential data was proposed by Mogren (2016) adding recurrent neural networks to the GAN's generator and discriminator to handle the time-series' temporal evolution. This work was followed by Esteban et al. (2017) who added label-conditional generation and a focus on downstream medical tasks. More recently, Yoon et al. (2019) introduced a jointly trained embedding network that combines the unsupervised GAN framework with a supervised autoregressive model to capture the time-series' conditional temporal dynamics. Lately, Ni et al. (2020) proposed a GAN framework to deal with long time-series data based on an approximation of the Wasserstein distance using the signature feature space, avoiding the usage of costly discriminators and claiming to achieve state-of-the-art results in measures of similarity and predictive ability.
The most related work corresponds to the T-CGAN (Ramponi et al. 2018), which proposes a method to generate irregularly-spaced time-series. Still, it does not include conditional generation with physical parameters of interest, it does not perform multi-class generation, and similarly to the works mentioned above, it does not tackle the problem of model selection for a downstream task.

Wasserstein GAN
The Wasserstein-GAN (WGAN, Arjovsky et al. 2017) is one of the GAN models that is widely used and well known for its training stability. This GAN leverages an approximation of the Wasserstein-1 distance to measure the dissimilarity between P r and P g . An upgraded version of this model is the WGAN with Gradient Penalty (WGAN-GP, Gulrajani et al. 2017), which adds a regularization term to the original WGAN loss to satisfy the Lipschitz condition on D. The WGAN-GP objectives that are minimized during the training process are: (1) where Px is the distribution implicitly defined by sampling uniformly along linear paths between points sampled from P r and P g , and λ is the penalty coefficient that controls the strength of the gradient regularization.

Data Augmentation
Data augmentation refers to a set of techniques applied to a dataset used to create new samples that are slightly different from the existing ones to increase the number of samples in the dataset. It is frequently used to prevent overfitting, and it helps improve the performance of machine learning models for various applications (Wen et al. 2021). Classic examples of this in the field of images are rotations, translations, crops, and flips, among others.
In the time domain, data augmentation techniques are less standardized. Traditional techniques in time-series correspond to non-parametric transformations such as jittering, scaling, window-slicing, and window-warping (Guennec et al. 2016). Parametric techniques can also be applied in data augmentation, such as the parametric model-based augmentation for transient phenomena proposed inÓscar Pimentel et al. (2022).

Overfitting in GANs
As described in Karras et al. (2020a), overfitting in GANs occurs when training on small datasets. The less data there is, the earlier the discriminator becomes too confident in separating real from generated samples, which impedes the progress of G and eventually deteriorates the quality of the generated samples.
Even though Karras et al. (2020a) proposed Adaptive Discriminator Augmentation (ADA) as a technique to deal with overfitting in GANs, this technique requires the application of differentiable transformations to augment the training data. Since our goal is to provide a GAN-based data augmentation method motivated by the limited augmentation methods for time-series, we intentionally do not include any augmentation method (apart from oversampling) in the GAN-training process, hence we do not consider using ADA.

Evaluation of GANs
Even though the losses described in Equations 1, 2 successfully describe the adversarial problem and quantify the distance between P r and P g , their high variance makes them unsuitable for using them as a stopping criterion. Even if they did not suffer from this issue, metrics based on D are specific to their corresponding G, and cannot generalize properties about the generated dataset. Consequently, the framework requires additional evaluation metrics to assess the quality of the generated samples and select the definitive generator for the downstream task.
Evaluation of generative models requires a notion of the distance between P r and P g . Defining such a measure for high dimensional distributions is a challenging task and remains an open problem (Naeem et al. 2020).
An intuitive way of comparing these distributions is as follows: if a generative model can successfully capture P r with P g , the performance on any downstream task should be similar when our data comes from any of the two distributions. Setting the downstream task to classification leads to using classification metrics for evaluation.

Classification metrics
Considering that the ultimate purpose of this work is to improve the classification of real astronomical objects, we naturally adopt the classification accuracy metric first proposed in Yang et al. (2017) and later used in Esteban et al. (2017), Santurkar et al. (2018), Shmelkov et al. (2018) and Ravuri & Vinyals (2019). For clarity, we choose to preserve the names in Esteban et al. (2017): Train on Synthetic Test on Real (TSTR) and Train on Real Test on Real (TRTR). These two scores are computed by training a classifier on synthetic (generated) data or real data and then evaluating its classification accuracy on real data.

Feature-based metrics
Based on the difficulty of finding meaningful metrics in the input space, quantifying the distance between the distributions P r and P g often involves mapping samples x ∈ {x r , x g } into a feature space with a transformation x → φ(x), where φ is an intermediate representation of a pre-trained classifier (Salimans et al. 2016;Heusel et al. 2017;Sajjadi et al. 2018;Kynkäänniemi et al. 2019;Naeem et al. 2020). The classifier is generally a convolutional neural network (CNN) such as the Inception-v3 (Szegedy et al. 2016), a widely used architecture in computer vision.
Since the dimensionality of φ is often lower than that of x, the distributions of the feature space are often called manifolds. We will informally understand these manifolds as connected regions with a relatively simple structure embedded in a more complex space.
When evaluating generative models, two desired characteristics are fidelity and diversity. The former describes how real the generated samples look in compari-son to the real ones, while the latter measures how much of P r the model can cover with P g .
The Fréchet Inception Distance (FID) -This metric proposed by (Heusel et al. 2017) consists of a Wasserstein-2 distance between Φ r and Φ g , the distributions of φ r and φ g respectively.
Under the assumption that both distributions are multivariate Gaussians, their mean µ and covariance Σ are estimated to obtain a closed-form of the distance: While (a) can be interpreted as a measure of fidelity that indicates the average distance between the two distributions, (b) can be interpreted as a measure of diversity that compares the variability of the two distributions.
A particularly relevant limitation of FID in the presence of highly imbalanced distributions is that computing the last term in (b) requires full-rank Σ matrices, which makes the calculation of a per-class FID unfeasible if the minority classes contain fewer samples than the dimensionality of Φ. Furthermore, even if we had enough samples to compute it, a per-class score would be unreliable for the minority classes since FID is known to suffer from high bias for small sample sizes (Bińkowski et al. 2018).
Precision and Recall -Sajjadi et al. (2018) proposed separating fidelity and diversity into two relative-densitybased metrics: precision and recall. These two metrics improve upon FID by identifying cases of mode dropping or mode inventing in the generated distribution, in the pathological case where different models achieve similar FID values by privileging either one of the two terms in Equation 3.
Improved Precision and Recall -Motivated by the failure at identifying models with poor variability, Kynkäänniemi et al. (2019) proposed improved precision and recall metrics (P&R). These metrics are computed by estimating the manifolds Φ ∈ {Φ r , Φ g } according to: where Φ ∈ {Φ r , Φ g } is a collection of feature samples φ ∈ {φ r , φ g }, the ball B(x, r) is the solid sphere around x with radius r, and N N D k (φ) is the distance from φ to its k-th nearest neighbor within the corresponding manifold. In the presence of outliers, the KNN approach results in an over-estimation of the manifolds due to the large distances between samples.
Density and Coverage -Naeem et al. (2020) proposed density and coverage (D&C) motivated by the vulnerability of P&R to outliers. While P measures fidelity depending on the binary decision of whether a feature sample φ g belongs to the real manifold Φ r , D considers the amount of balls B(φ r , N N D k (φ r )) within each φ g is contained, adding robustness to real distributions with outliers. On the other hand, C measures diversity based on the real manifold estimate instead of the generated one, in contrast to R. In our practical case, we found that P and C saturate quickly, not providing meaningful information. Since these metrics directly depend on the real manifold estimates, we hypothesize that this behavior can be caused by the sparsity of Φ r in the minority classes, leading to the same over-estimation issue as outliers. Consequently, we decide to use D and R as our fidelity and diversity metrics.
Let B k r be the abbreviation of B(φ r , N N D k (φ r )), and Φ g the approximation of the generated manifold described in Equation 4, we compute the D&R metrics according to: where 1 A (x) is the indicator function defined as: 3. DATA

Datasets
Because of the recognizable shapes of their light curves when visualized in phase space, we focus on periodic variable stars. However, the framework could be effortlessly extended to other stars of interest if needed. We perform and validate our experiments on data captured by two time-domain astronomical surveys.
The Catalina Surveys Data Release-1 -This catalog described in Drake et al. (2014), captured with the 8.2 deg 2 field-of-view camera mounted on the CSS 27inch Schmidt telescope, provides ∼61,000 light curves of periodic variable objects, with their corresponding periods and classes. To decrease the complexity of the multiclass problem induced by the large number of periodic classes provided, we only consider a subset of the periodic objects grouped following the mapping described in Table 1.

CEP Anomalous Cepheids (ACEP) type-II Cepheids (Cep-II)
The Zwicky Transient Facility -This survey, known by its acronym ZTF (Bellm et al. 2018) provides a public multiband stream of alerts captured by a 47 deg 2 field-of-view camera mounted on the Palomar 48-inch Schmidt telescope, is capable of scanning the entire northern sky every three nights and the plane of the Milky Way every night. To enable further analysis in follow-up telescopes, the alerts are processed by alert brokers that are designed to provide a rapid and selfconsistent classification. We use the subset of periodic variable stars present in the ZTF training set created by the ALeRCE broker (Förster et al. 2021), along with their taxonomy. This training set was constructed considering sources observed by ZTF whose labels had been cross-matched from different multiple catalogs.
Previous works (Sánchez-Sáez et al. 2021, Carrasco-Davis et al. 2021) have already used ZTF data processed by the ALeRCE broker to train different machine learning algorithms. More details about the data processing can be found in Förster et al. (2021).
After pre-processing both datasets following the steps detailed in Section 3.2, we obtain the definitive versions of the datasets that will be used in our experiments, from now on referred to as the "Catalina" and the "ZTF" datasets. The class distributions of the pre-processed datasets are shown in Table 2. To use the data described in Section 3.1, some preprocessing steps need to be applied. The pre-processing consists of four main steps: period folding, outlier filtering, time sampling, and median centering.

Period folding
Since the desired characteristic shapes of periodic light curves are only visible in the phase space, we start by folding the light curves into the period provided in both datasets. Denoting the light curve period as T , and the observation time as t, the folding operation is performed by converting t into φ t according to: where the congruence symbol ≡ in Equation 8 refers to the modulo operator with modulus T . With this operation, we transform times with a variable range of values to phases with values bounded between 0 and 1. This transformation is convenient because multiple neural networks will process the phases, and having inputs with a similar range is a desirable property when training such algorithms.

Outlier filtering
Considering that some of the light curves in the datasets can include a significant amount of noise, we filter out anomalous observations within each curve of both datasets. These anomalous observations are in general isolated observations with a magnitude that does not follow the general behavior of the magnitudes in the light curve, and including them could be detrimental to the performance of our algorithms. For the Catalina dataset, the anomalous behavior is quite particular to each light curve, and a general threshold filtering cannot be applied; therefore, a different approach is needed.
The Catalina light curves are filtered by comparing each magnitude with the local statistics of the magnitudes' neighborhood. This comparison is performed using the z-score 1 of the magnitudes within a window that considers only a portion of the light curve. The process is performed by sliding the window through the entire light curve with a window size w s = 20, removing the outlier observations that satisfy z score > 3, and repeating two times per light curve since consecutive outlier observations can significantly alter the moving window's statistics and not be detected in a single pass. The results of this filtering step are shown in Figure 1b. After this step, we perform a second filtering stage by discarding the light curves that contain more than 90% of their magnitudes out of the range delimited by the class medians and class standard deviations.
On the other hand, anomalous observations in the ZTF data have been already marked with a magnitude of 100. Hence, these observations can be filtered out by a simple threshold. Following the filtering steps used by Sánchez-Sáez et al. (2021), we use mag thr = 30.

Time sub-sampling
To bring the problem to a more straightforward domain, we set the length of the light curves to a predefined value for each dataset. With this simplification, we can work with convolutional architectures rather than recurrent architectures that could hinder the GAN's training stability by violating the Lipschitz constraint, adding extra complexity to the problem.
Given a light curve with an arbitrary number of m observations, we obtain the fixed-length light curves by randomly choosing n from the m available observations. Considering that we choose our points with no particular bias, this approach should give a reasonable approximation of the original light curve if n is not too small compared to m.
Since both of our real datasets contain irregularly sampled light curves, and we perform the sub-sampling step after the period folding step, choosing an observation implies selecting a magnitude with its corresponding observation phase. Both magnitudes and phases are part of the input of our models, as will be detailed in Section 4. Figure 1b shows an example of the time sub-sampling step.
The light curve length is set to 100 observations for the Catalina dataset, whereas that of the ZTF dataset is 40 observations, consistent with the fact that ZTF is a relatively new sky survey with a lower number of observations per object compared to the Catalina Survey.
After discarding the light curves that do not have the minimum length to perform this step, we end up with approximately 41k and 56k samples in the Catalina and ZTF datasets, respectively, whose class distribution is shown in Table 2.

Median centering
The last step to get the data ready for data generation is centering it around 0 so all the magnitudes have a consistent range that can be learned from the generator. This is done for each light curve by subtracting the center (median) of the magnitudes. We compute the median instead of the mean because of its robustness to outlier magnitudes.
This step is necessary because G is a neural network that outputs a tanh activation, and it can only generate values in a symmetrical range around zero. It is worth mentioning that we could center the data around any other offset, which would require to also include that offset to the output of the generator; the importance of performing this step is not the value of the offset itself, but rather the unification of all the magnitudes around a single value so our generator can model them.

General Description
We propose a conditional generation approach that extends the T-CGAN (Ramponi et al. 2018), adding the class and amplitude of the light curves to the conditional parameters, which include the observation phases according to the original model. The details of how the conditional parameters are included into the model will be explained in Section 4.2.
A summary of the proposed methodology, that details the partitions of datasets for the models and metrics is provided in Figure 2.
We start by partitioning the pre-processed dataset D into D train , D val and D test , the train, validation, and test sets. Each class in D val and D test contains 20% of the total number of samples of the smallest class in D.
To train the GAN and the classifier we use and D u train , a uniformly balanced version of the original D train obtained through the resampling block that will be explained in Section 4.6.
After training the GAN, we use G to create a synthetic uniformly balanced dataset D u gen . Since G performs conditional generation, to generate a uniformly balanced dataset we sample the conditional vectorsz from D u train . It is essential to mention that the generated dataset will follow the distribution of the dataset from which we sample the conditional vectors. For example, sampling them from D train would imply generating a heavily unbalanced dataset. To obtain the TSTR score, we train a classifier on D u gen and evaluate its accuracy on a real dataset.
We compare the TSTR score to multiple TRTR scores, computed in a similar manner but using D u train (or slightly modified versions of it) instead of D u gen . This comparison is reasonable because the datasets used for evaluation (D val and D test ) are fixed and balanced by construction: their sampling process from D is designed to have the same amount of samples per class.

Data structure details
Let φ t , a, and c denote the observation phases, amplitudes and classes of the light curves respectively, our GAN's generator requires a samplez = [φ t , a, c] from the real dataset D train and a sample z ∈ R ∼ N (0, I).
The latent space dimensionality is set to 16 and 8 for the Catalina and ZTF datasets, respectively, obeying roughly the proportion between the light curve lengths of the datasets. Following a CGAN-like approach (Mirza & Osindero 2014), the concatenation of z andz is passed as an input to G to generate synthetic samples.
The conditional parameters are also inputs of D similarly concatenated with real or generated magnitudes. We create a tensor version of the conditional parameters for this concatenation to be viable. Let a and c be tensor versions of a and c, and L and N denote the light curve length and number or classes of a dataset; we define a ∈ R L as a vector with value a in all its components, and c ∈ R L×N as a one-hot encoding of c, composed by 0's and 1's vectors, where {0, 1} ∈ R L . The tensor version ofz isz = [φ t , a, c] ∈ R L×2+N . The concatenation ofz and real or generated magnitudes will be the input of D, and will be dimensions L × 3 + N .

Classifier details
To reduce the variance of the experiments, the classifier consists of an ensemble of 5 identical base-classifiers trained independently. The base-classifier is a CNN that receives the concatenation of the magnitudes x and phases φ t following the classification scheme in Ramponi et al. (2018). The input is forwarded through a set of convolution blocks that halve the temporal dimension, followed by dense layers. The network is trained using Adam optimizer (Kingma & Ba 2015) with α = 0.0001, β 1 = 0.9, β 2 = 0.999. Table 3 shows the detailed architecture of the base classifier. To compute all the feature-based metrics explained in Section 2.5.2, we use the output of the last convolution block of this base classifier, trained on each of the datasets separately.

GAN details
In addition to the original WGAN-GP formulation, we include additional regularization terms to Equation 2 and 1. Following an AC-GAN-like approach (Odena et al. 2017), the output of D has two components: D rg ∈ R that tries to separate real from fake samples and D y ∈ R N that tries to predict the class of the input. Therefore, we add a cross-entropy regularization of real and generated samples to the discriminator loss. Also, to prevent the GAN equilibrium from happening in any arbitrary offset, we add a regularization term to prevent D rg (x r ) from drifting too far away from zero, as proposed in Karras et al. (2018). To the generator loss, we only add the cross-entropy regularization of generated samples. Consequently, the losses minimized in the proposed framework are: Table 3. Classifier architecture. L, and N correspond to the light curve length and number of classes respectively and they vary depending on the selected dataset as mentioned in Section 3. The fixed block parameters ps and ks stand for pool size and kernel size. Since the convolution blocks always halve the temporal dimension, we only specify their channel dimensions cin and cout.

Input
x where H r = H(y r , D y (x r )) and H g = H(y g , D y (x g )) correspond to the cross-entropy between the real labels and the discriminator predictions, y g are the real labels used to generate x g , and ξ = 0.001 and = 1 control the strength of each regularization term. We perform n disc = 5 discriminator iterations per generator iteration, and train for 400K generator iterations using Adam optimizer with α = 0.0001, β 1 = 0.5, β 2 = 0.9. At training time, we compute the Exponential Moving Average (EMA, Yazici et al. 2019) with decay 0.999 for the generator weights, to be used when generating samples for evaluation. A full description of the GAN architecture is shown in Table 4.
On the one hand, G receives the concatenation of the noise source z and the conditional variablesz as an input, and it forwards it through a dense layer followed by a set of strided deconvolutions that duplicate the temporal dimension of every block and simultaneously halving the number of channels (except for the last block). On the other hand, D receives the concatenation of the magnitudes x and the conditional tensorz, and it forwards it through a set of strided convolutions that halve the temporal dimension of every block and duplicate the number of channels, followed by a dense layer.

Preliminary experiment: The u-GAN
With all details and parameters provided in the above sections, we perform a preliminary experiment using D u train -the uniformly balanced version of D train -as the GAN training set, to then generate D u gen and obtain the TSTR accuracy scores. This GAN setup will be referred to as the "u-GAN".
It is worth mentioning that this setup is the standard approach when training machine learning algorithms, where D u train is usually preferred over D train because it reduces the biases towards the most populated classes, induced by the highly imbalanced class distribution of D train .
The first finding of performing this preliminary experiment is that the TSTR accuracy score can vary significantly depending on how long we train the GAN. For this reason, we analyze the behavior of different GAN models throughout the training process to find an adequate criterion for model selection. Figure 3 shows the evolution of the validation TSTR accuracies and FID scores every 10k iterations. Since computing TSTR accuracies involves training multiple classifiers, evaluating this score more frequently is unfeasible.
The preliminary experiment shown in Figure 3 raises two major concerns that will be addressed in the following sections: a) The TSTR accuracy reaches an optimal value early in the GAN training and then decreases consistently, coinciding with the GAN overfitting phenomenon explained in Section 2.4.
b) The FID -the standard metric for evaluating GANs -cannot always measure the drop in sample quality reflected in the TSTR accuracy curve, as shown in Figure 3a.
The behavior detailed in a) can be understood as follows: in a balanced dataset such as D u train , overfitting is not only strongly influenced by the limited amount of training samples, but it also is exacerbated by the amount of imbalance of the original class distribution of D train . As the imbalance grows, samples in the minority classes need to be excessively repeated in order to equate the number of samples in the majority classes, resulting in quick GAN overfitting caused by D learning fast how samples of the minority classes look. The rapid decay in validation TSTR accuracy is problematic considering that we need to compute this metric every 10k iterations. Hence, the best model selected by this metric could be sub-optimal if the decay occurs suddenly, which motivates the proposed resampling block explained in Section 4.6.
The discrepancy described in b), although undesirable, is not surprising; it was also reported in Ravuri & Vinyals (2019), and it is completely plausible considering the limitations of FID related to mode dropping and mode inventing mentioned in Section 2.5.2. These two phenomena can drastically affect how P g relates to P r and thus affect the TSTR accuracy without being reflected in the FID, which suggests that FID is not always reliable in the presence of highly unbalanced datasets, and motivates the proposed G -score for model selection explained in Section 4.7.

Resampling block
Motivated by the rapid GAN overfitting shown in Figure 3, we propose a resampling operation that can successfully delay the occurrence of this behavior.
The resampling operation consists of continuously drawing samples from the N classes of a dataset D, to modify its class distribution. Let S be the number of samples of D. We start by splitting D into N sub- where each dataset D i only contains samples from the i-th class. From each sub-dataset, we draw without replacement until there are no samples left, then D i is shuffled and the sampling process continues.
The goal of this operation is to modify the class distribution of D by controlling the probability p i of drawing a sample from each D i . The resampling block serves as a generalization of the uniform balancing operation by extending the target class distribution to non-uniform distributions. To illustrate this clearly, we describe two edge cases. On the one hand, we could leave the original class distribution unbalanced by setting p i = S i /S, in which case the resampling block does not affect the class distribution, and it would be equivalent to a "shuffle and repeat" operation. On the other hand, we could obtain the balanced version of D by simply setting p i = 1/N , which is how we get D u train from D train . Apart from these two scenarios, we could also generate any dataset D γ whose class distribution lies "in between" that of D and D u , created by linearly interpolating between the aforementioned probabilities: where the two edge cases can be recovered with γ = 0 for the imbalanced D, and γ = 1 for the balanced D u . By using the proposed γ-resampling we are able to con- Table 4. GAN architecture. , L, and N correspond to the latent space dimensionality, light curve length and number of classes respectively, which depend on the selected dataset as mentioned in Sections 3 and 4. The fixed block parameters s and ks stand for stride and kernel size respectively, and li represents the input length of the blocks. Since the convolution/deconvolution blocks always adjust the temporal dimension by a factor of 2, we only specify their channel dimensions cin and cout.
(a) Generator   trol the overfitting speed of the model, as shown in Figure 4. Training a GAN with D u (γ = 1) implies that all the samples from the minority classes are rapidly shown to the model, leading to fast overfitting. On the other hand, using D(γ = 0) implies that training batches rarely contain a sample from the minority classes (1 every 230 samples will be cepheids of the Catalina dataset, roughly 1 cepheid every 4 batches), avoiding fast overfitting but inducing slow and unstable training. Train-ing with D γ (0 < γ < 1) allows a reasonable learning pace without overfitting rapidly, as shown in Figure 4 for γ = 0.25. A model trained with D γ will be referred to as the "γ-GAN" 4.7. Model selection: The G-score As mentioned in Section 4.5, the behavior of TSTR accuracies shown in Figure 3 evidences the need for a criterion to choose an adequate G. While using the validation TSTR accuracy for model selection might look appropriate, doing so involves training new classifiers for every candidate of G, an operation that becomes computationally expensive. The problem then lies in finding a fast-to-compute metric that correlates with the TSTR accuracy (and implicitly with the quality of the generated samples). The natural option for this metric would be FID, but as also shown in Figure 3a, it fails to measure the decrease in quality of the generated samples reflected in the TSTR accuracy curve. Additionally, since FID is only a measure of the distance between P g and P r , it cannot differentiate between the fidelity and diversity of the generated samples (Naeem et al. 2020), and it provides an arbitrarily weighted average between them.
As an alternative, we propose a metric that leverages equally two measures of fidelity and diversity: density(D) and recall(R). Figure 5 shows the results of computing the per-class density and recall metric for the Catalina dataset. The fact that D values are not bounded by one is consistent with the formula presented in Equation 5 and can happen if points in the generated manifold in average belong to more than K balls of the real manifold, which is probably caused by the over-estimation of the real manifold mentioned in Section 2.5.2, due to sparse feature spaces. An illustration of this situation is shown in Figure 6, where the sparsity in the real distribution causes that the generated samples in average belong to more than K = 2 balls, leading to D = 1 4 ( 2 2 + 3 2 + 4 2 + 3 2 ) = 1.5. Additionally, if we reduce the sparsity of the real distribution by removing the furthest sample (bottom left), we get D = 1 4 ( 1 2 + 2 2 + 3 2 + 2 2 ) = 1. Figure 6. Two-dimensional scenario that illustrates a case in which D is not bounded by 1. The dashed lines show the regions B 2 r : circles around the real feature samples φr, with radii equal to the distance to their second nearest neighbors. The numbers inside each sample φg denote the number of circles that enclose the sample.
Since R is bounded between 0 and 1 by definition, the unbounded behavior of D is undesirable because it privileges D over R in any mean we compute between them. In addition, we find that D also presents a clear bias towards the less populated classes. To overcome these problems, we perform a per-class min-max normalization to D and R according to Equation 3.
where the subscript (·) i denotes score of the i-th class, and the superscripts (·) min,max denote the minimum and maximum score of the class respectively. After the class scores are normalized, we combine them in an equally weighted F -score described in Equation 15. Finally, considering that we are equally interested in the different classes, the G-score is obtained by computing the balanced F -score (macro F -score), as shown in Equation 16.
When computing the balanced F -score, we prefer the mean of the class F -scores over the F -score of the class means intending to weight equally majority and minority classes, as suggested by Opitz & Burst (2019).
The results of computing the G-score for multiple GANs trained with different values of γ are shown in Figure 7. As it can be seen, the G-score curves and validation accuracy curves from Figure 4 seem to have a high correlation, which becomes more evident when analyzing the γ = 0 curve for the Catalina dataset.

Baselines
To evaluate our generated datasets in the classification task, we compare the TSTR classification accuracies to multiple baselines. These baselines consist of TRTR classification accuracy scores when training in augmented real datasets. It is worth mentioning that the training sets used to compute the scores are all balanced datasets, either GAN-generated (TSTR) or real-augmented (TRTR).
Acknowledging the heteroscedastic behavior of astronomical data, we do not consider jittering as a suitable operation for the problem. Additionally, we discard utilizing window-slicing techniques since our convolutional architectures work on pre-processed time-series with a fixed number of observations. Consequently, our augmentation methods consist of oversampling and different window-warping-based operations.

Oversampling
The oversampling augmentation corresponds to generating the balanced dataset D u train by repeating samples from the original dataset D train , using the resampling block described in Section 4.6.

Window-warping
Let x(t) be a continuous signal sampled at times t. The window-warping operation starts by selecting a random time window delimited by the values [t 1 , t 2 ], where all the times t w in the window satisfy t 1 ≤ t w ≤ t 2 . The warping operation expands or contracts the signal by scaling the variations ∆t in t w and shifting the times t > t 2 accordingly, altering the time-series' length.
Since we work with folded light curves in phase space, window-warping expansion could be incongruous with the fact that the phase space has an upper bound of 1. Consequently, we derive a new transformation to avoid such incongruence: soft window-warping.

Soft window-warping
We preserve the core idea of window-warping by designing expansions and contractions that do not increase the time-series' length. Given a random window, we formulate the problem as finding a mapping t w → f (t w ) such that the length of the transformed window is at most that of the original, this is f (t 1 ) ≥ t 1 , f (t 2 ) ≤ t 2 . We believe that expansions and contractions should be naturally performed with respect to the center of the window, expanding from the center to the limits and contracting from the limits to the center.
A mapping that meets these requirements is: where the values of a, b and c are determined by the desired behavior with respect to the center of the window. The constant k is randomly sampled in the interval 1 2a , 2 a and it modulates the strength of the expansions or contractions by modifying the saturation degree of the tanh (·), producing expansions when saturated and contractions otherwise.
Even though the proposed transformation is designed to be applied across the time axis, it can be easily extended to the signal axis by noting that since the time intervals are monotonous, t 1 , t 2 are the minimum and When applying these transformations to our astronomical light curves, we consider the signal axis as the magnitude axis, and the time axis as the phase axis. These two transformations referred to as soft time-warping and soft magnitude-warping, are illustrated in Figure 8. The result of simultaneously applying these two transformations will be referred to as soft mixed-warping. Figure 9 shows some samples of the GAN-generated light curves. The conditional vectorz used to generate these samples considers phases, amplitudes, and classes of the real data shown in the first two columns. Accordingly, and as it can be seen, most of the generated samples preserve the real class and amplitude. It is worth mentioning that although some generated samples present normal fluctuations in phase and magnitude with respect to the real ones, there are also samples that do not look plausible (see Figure 10), which could be attributed to the lack of truncation techniques or any type of filtering to improve the fidelity of the generated samples, which we address in Section 6.

Classification
The classification accuracies obtained by using different training sets are shown in Table 5. The first four rows show TRTR classification results when training on real data that has been augmented with the random transformations described in Section 4.8. The softwarping transformations (rows B-D) are applied to the dataset previously balanced by oversampling. The last four rows show TSTR classification results when training on GAN-generated data, comparing the proposed γ-resampling for GAN training (γ-GAN) against uniform resampling (u-GAN), and the proposed G-score for model selection against the validation accuracy criterion.
As Table 5 shows, none of the soft-warping transformations achieves statistically significant differences with respect to the oversampling baseline (row A).
On the other hand, the benefits of using generative models are clear. Both GAN models achieve significant improvements with respect to the oversampling baseline, either using the validation accuracy criterion or the Gscore criterion for model selection.
We can also notice that using the γ-resampling can be beneficial in comparison to using the uniform approach. For both datasets, the minimum TSTR classification accuracy corresponds to the u-GAN (E for Catalina and F for ZTF), while the maximum corresponds to the γ-GAN (H for both datasets). Furthermore, for each dataset, the best TSTR accuracy is always significantly better than the worst.
Regarding the model selection criteria, the G-score shows to be an effective criterion, achieving accuracies that are at least statistically equivalent to the ones obtained by the computationally expensive validation accuracy criterion. Furthermore, it can sometimes obtain significantly better results, as shown in the ZTF dataset by the γ-GAN model.
Interestingly, the combination of the proposed γ-GAN + G-score obtains the best classification accuracies overall, statistically outperforming all existing methods for ZTF dataset, and all but one (γ-GAN + val. accuracy) for the case of the Catalina dataset.

Quality of generated samples
Thus far, we have presented a framework for generating realistic light curves that can be used to improve the classification of real astronomical objects. In the entire process, we constantly generate sets of samples that are then compared to the set of real samples, computing global metrics that indicate the quality of the model based on the distance between the sets. However, no metrics to evaluate the quality of individual samples have been mentioned.
In fact, Figure 9 shows that although the generated samples look generally realistic, there can be samples that present artifacts, making them not the best candidates for the classes they intend to represent. While these could be easily solved by applying truncation techniques on latent space of G, it would not be informative about the quality of the individual samples themselves, impeding us from learning what makes a sample look realistic.
The selected metric to evaluate individual sample quality is the realism score (Kynkäänniemi et al. 2019), computed over the manifold representation used for the D and R metrics. Given a generated feature sample φ g and a set of real samples Φ r = {φ r }, the similarity between φ g and the real manifold Φ r is calculated as: where N N D k (φ) is the distance from φ to its k-th nearest neighbor within the corresponding manifold. Equation 19 compares the radii of the KNN induced hyperspheres with center in φ r to the distance between φ r and the sample φ g . Naturally, if φ g does not belong to any of the hyperspheres, R will be low, and its value will increase the closer φ g is to any φ r . The effect of ranking the generated samples of the ZTF dataset by realism score is shown in Figure 10.
Because it can successfully identify artifacts that could be filtered out of the dataset, we would in principle expect that using a realism score filtering would improve our results even further. However, this is not the case. Empirically, we found no statistical differences when applying this filtering to our generated datasets. We hypothesize that these artifacts, although undesirable, are not crucial when defining the decision boundaries of the problem, hence, they have little impact on the classification accuracy. Moreover, strongly filtered datasets cause a drop in the classification accuracy, probably caused by their over-constrained diversity.

Classification results
Soft-warping transformations -Regarding the effects of the proposed soft-warping augmentations for classification, we can see that despite the fact that they create plausible light curves, they do not show improvements in the classification task. We hypothesize that the diversity added to the dataset by these transformations is not substantial enough for the classifiers to benefit from it.
γ-resampling -The results suggest that the proposed resampling offers a clear improvement upon uniform resampling for GAN training. We believe that this im-provement comes from the delay in the GAN overfitting, providing more potentially good models to choose from before the GAN completely overfits. With respect to the no-resampling model, Figure 4 shows that models trained γ = 0 and γ = 0.25 reach comparable accuracies, consistently with the fact that the resampling block does not add any extra information. Using the resampling block can offer a more stable training that reaches similar performance in a shorter training time. This can be particularly relevant if the defined iteration horizon is not long enough to capture the peak accuracy as in figure 4b. For this reason, we do not think that γ should be tuned thoroughly, and we set it to γ = 0.25, placing the G-score peak within the extent of training iterations, earlier than the peak of γ = 0 but later than that of γ = 1.
G-score -For the model selection criterion, the correlation between the metrics and the classification results validate the G-score as a metric to evaluate the quality of the generated samples. Using this metric instead of the validation accuracy, it is interesting because of the subtle improvements in TSTR. It also offers faster computation times: computing G-score is approximately six times faster than computing the validation TSTR accuracy.
We hypothesize that these subtle improvements come from the robustness of the G-score against overfitting. While G-score compares D u gen to the entire training set D train , the validation accuracy score is computed on the small dataset D val for evident reasons. Hence, it is more susceptible to overfitting. A fact that reinforces this hypothesis is the consistently lower variance of the models selected with the G-score criterion compared to valida- Figure 10. Realism score ranking of the ZTF generated light curves. To rank the samples, we first generated a replication of the D u train , computed their realism score, and then selected the best and worst samples from the sorted realism scores.
tion accuracy. On the other hand, computing the Gscore also has some drawbacks related to the normalization step restrictions. Since the normalization requires the minimum and maximum value of the D and R metrics, we cannot compute the G-score during the training time, and we must first completely train the models. In addition to this, it only allows for comparison between different candidates of the same run, not permitting comparisons between different runs that likely have different normalization parameters.
6.3. Alternative to G-score Evaluating generative models by fidelity and diversity can be posed as a multi-objective problem. Thus, we provide an alternative to the G-score that considers both objectives (D&R) simultaneously, according to the problem's nature.
As an alternative to evaluate all candidates with TSTR validation accuracy, we propose evaluating only candidates that lie on the Pareto frontier 2 of the raw macro-density and macro-recall. For example, in the case of the Catalina dataset, doing so would imply evaluating approximately 1/4 of total candidates. The disposition of the optima for the Catalina dataset is shown in Figure 11. Interestingly, the model selected with the validation accuracy criterion is in the sub-optimal region which supports the idea of overfitting explained in Section 6.2. On the other hand, the model selected with G-score belongs to the Pareto frontier, which is not necessarily guaranteed considering the extra normalization step included in the computation of the G-score.
Using this alternative offers an attractive advantage. Not performing the normalization step of the G-score allows for comparing different GAN setups in the DR plane, which could also be used to perform hyperparameter optimization of the models. In this scenario, we first need to identify the models that lie in the Pareto frontier considering all the D&R scores and then evaluate these candidates based on the validation TSTR score to choose an operating point. Figure 11. Macro density and recall metrics for the Catalina dataset. Each point corresponds to the average of 5 independent computation of density and recall for a single GAN model.

CONCLUSIONS
In this work, we have presented a GAN-based data augmentation methodology for astronomical time-series, to improve the classification accuracy of periodic variable stars by mitigating the problems of small and imbalanced astronomical datasets.
Using our methodology, we can generate diverse synthetic datasets of irregularly sampled time-series that capture the original training sets' properties and leverage their diversity to outperform classifiers trained on real data. Motivated by the rapid overfitting of our generative model in this unbalanced setup, we propose a resampling technique (γ-resampling) to mitigate this behavior. Also, inspired by the incapability of FID to measure this overfitting, we propose a novel evaluation metric (G-score) that correlates with TSTR classification accuracy; hence it helps select a generative model among the possible candidates saved during training.
The proposed model could be extended to work with classifiers that are currently operating in real-time such as the ALeRCE light curve classifier (Sánchez-Sáez et al. 2021), boosting its performance on the ZTF stream and eventually on its successor, the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST;Željko Ivezić et al. 2019), contributing to understanding the tridimensional structure and formation of our galaxy and its neighbors.

Future Work
Although effective in this simplified setup, the presented methodology could be improved by upgrading it to a scenario where the input data has a variable length. This upgrade should involve recent GAN models that include recurrent neural networks in their architectures, such as Yoon et al. (2019) or Ni et al. (2020). In addition, the generation of data with variable length should also be addressed.
Regarding conditional generation, we used the classconditional parameter to generate datasets with uniform class distributions. Although our model permits other conditional parameters such as amplitude, in all experiments we replicated the distribution of their real counterparts. An interesting extension of the work could include analyzing how the results vary depending on the generated conditional distribution of these parameters, and other physical parameters that may be relevant to include.
Finally, all our synthetic datasets were generated by sampling z from a multivariate Gaussian distribution related to data samples generation. Evaluating different sampling methods, such as those presented in Kynkäänniemi et al. (2019), and inspecting how they affect the qualitative and quantitative results, could be an exciting path to follow.