ABSTRACT
Large repositories of high precision light curve data, such as the Kepler data set, provide the opportunity to identify astrophysically important eclipsing binary (EB) systems in large quantities. However, the rate of classical "by eye" human analysis restricts complete and efficient mining of EBs from these data using classical techniques. To prepare for mining EBs from the upcoming K2 mission as well as other current missions, we developed an automated end-to-end computational pipeline—the Eclipsing Binary Factory (EBF)—that automatically identifies EBs and classifies them into morphological types. The EBF has been previously tested on ground-based light curves. To assess the performance of the EBF in the context of space-based data, we apply the EBF to the full set of light curves in the Kepler "Q3" Data Release. We compare the EBs identified from this automated approach against the human generated Kepler EB Catalog of EBs. When we require EB classification with 90% confidence, we find that the EBF correctly identifies and classifies eclipsing contact (EC), eclipsing semi-detached (ESD), and eclipsing detached (ED) systems with a false positive rate of only 4%, 4%, and 8%, while complete to 64%, 46%, and 32%, respectively. When classification confidence is relaxed, the EBF identifies and classifies ECs, ESDs, and EDs with a slightly higher false positive rate of 6%, 16%, and 8%, while much more complete to 86%, 74%, and 62%, respectively. Through our processing of the entire Kepler "Q3" data set, we also identify 68 new candidate EBs that may have been missed by the human generated Kepler EB Catalog. We discuss the EBFʼs potential application to light curve classification for periodic variable stars more generally for current and upcoming surveys like K2 and the Transiting Exoplanet Survey Satellite.
Export citation and abstract BibTeX RIS
1. INTRODUCTION
Periodic variable stars, in particular eclipsing binary (EB) star systems, continue to be of central importance to a variety of applications in stellar astrophysics. EBs are crucial astrophysical benchmarks for stellar evolution theory as they can provide fundamental physical parameters of stars with very high precision and accuracy. Fundamental issues, from the stellar mass–radius relation using EBs in nearby clusters and the field (e.g., Lacy 1977), to the extragalactic distance scale using EBs in the Large Magellanic (Pietrzyński et al. 2013) and Small Magellanic Cloud (Graczyk et al. 2014), rely on the discovery and analysis of statistically significant samples of EBs and/or of rare but astrophysically important EBs. Indeed, EBs in which at least one component is of very low mass, in a scarce evolutionary phase, or itself a periodic variable are astrophysically very interesting and useful, yet rare and thus only small numbers of these benchmark systems are found in current catalogs. For example, there is only one known EB system comprising two brown dwarfs (Stassun et al. 2006, 2007), and only a couple of EBs containing a Classical Cepheid pulsator are currently known (e.g., Pietrzyński et al. 2010). Repositories of large time-series photometric survey data, such as the Kepler archive, provide an ideal sample from which to identify periodic variables including these interesting EBs for focused followup investigation. However, constraints on the rate of human analysis make mining this data using classical techniques prohibitive.
An intelligent data pipeline (IDP) that contains an automated light curve classifier to replace the human bottleneck is then necessary for the future pace of large survey data rates. Ideally such an IDP would be capable of automatically finding EBs in light curve data sets and then accurately classifying these EBs by morphological type (e.g., eclipsing contact (EC), eclipsing semi-detached (ESD), eclipsing detached (ED) to enable focused followup analyses. Several algorithms have been developed for this purpose (e.g., Wyrzykowski et al. 2003; Devor 2008; Graczyk et al. 2011; Devinney et al. 2012 these are discussed further in Section 4), yet a fully automated pipeline remains an open task.
An automated classification pipeline with parameters adaptable to multiple time series photometric surveys would be immediately applicable to the Kepler data set (Christiansen et al. 2013) as well as the follow-up K2 mission (Howell et al. 2014) and would be well suited to the upcoming Transiting Exoplanet Survey Satellite (TESS, Ricker et al. 2014). In addition, the development of the Large Synoptic Survey Telescope (LSST) contains a requirement that data processing enable a fast and efficient response to transient sources (i.e., automated identification of variable stars and astrophysically interesting binaries) with a robust and accurate preliminary classification (Ivezić et al. 2011). In short, there is a large need currently and into the near future for methods that can reliably process large quantities of light curve data and automatically identify astrophysical objects of interest such as EBs with minimal human intervention.
So motivated, especially the immediate applicability of an automated pipeline for the study of EBs in K2 (Welsh et al. 2013), in this paper we present a fully automated, adaptive, end-to-end computational pipeline—the Eclipsing Binary Factory (EBF). The EBF first identifies EBs, and then morphologically classifies them as either EC, ESD, or ED. The classification module of the EBF pipeline was described in the first paper in this series (Paegert et al. 2014); here we add the remaining computational modules for a complete automated pipeline. In this paper we also then measure the accuracy and completeness of our EBF generated catalog of Kepler EBs against the manually constructed Kepler EB Catalog (Prša et al. 2011; Slawson et al. 2011).
The EBF takes a modular approach to automatically process large volumes of survey data into patterns suitable for recognition by an integrated, fully trained artificial neural network (ANN). This modular pipeline is designed to accept as input the direct archival data from a time-series photometric survey where each moduleʼs parameters are adaptable to the specific characteristics of the data (e.g., photometric precision, data collection cadence, flux measurement uncertainty) and define the output options to produce the ANN generated catalog of EB classifications (i.e., EC, ESD, or ED). The computational efficiency of the ANN classifier, , where N is the number of neurons in the neural network, is quite fast—of order 0.8 seconds for classification of ∼50,000 light curves using a standard workstation. This then alleviates the current bottleneck by limiting follow-up human analysis to just the light curves automatically classified as EBs. In addition, the portability of this modular software pipeline makes the EBF a solution for the automated probabilistic identification and morphological classification of EBs in other large repositories of time-series, photometric data currently underway or planned for the near future.
The Kepler mission monitored ∼200,000 targets brighter than the magnitude limit in the Kepler spacecraftʼs field of view (FoV). Kepler captured and summed these images as long cadence, 29.4 minutes, data files (Jenkins et al. 2010) that contain one quarter, or days, of observations across the FoV for sixteen quarters until reaction wheel failure placed the instrument into safe mode. The complete scientific data set, archived in MAST at the Space Telescope Science Institute, provides simple aperture photometry (SAP) with the photometric precision necessary to detect planetary transits (Koch et al. 2010) and astero-seismological signals (Gilliland et al. 2010) from solar-like stars: about 20 ppm for 12th magnitude G2V stars for a 6.5 hour integration (van Cleve & Caldwell 2009). As such, this data set provides both the appropriate sample size and precision for use in testing the EBF pipeline. In addition, the Kepler Eclipsing Binary Catalog Version 3 (KEBC3) of manually classified EBs and a corresponding catalog of manually identified false positives in the Kepler FoV (Prša et al. 2011; Slawson et al. 2011) provides a benchmark of the EBFʼs classification accuracy (i.e., false positive rate) and completeness (i.e., false negative rate). Moreover, full processing of the Kepler data set with the EBF automated pipeline provides the opportunity to uncover some additional EBs that may have been missed by the human generated KEBC3.
We demonstrate the efficacy of the EBF by automatically identifying and morphologically classifying EBs in the ∼165,000 Kepler Quarter 3 ("Q3") SAP light curves. In Section 2 of this paper we describe the procedures and methods of each EBF module. Section 3 presents the results of the EBF applied to the Kepler "Q3" light curves. Section 3.1.1 benchmarks the pre-classification data processing modules against the metrics of the manually corrected and phased light curves in the KEBC3 while Section 3.1.2 focuses on the EBF pipelineʼs false positives and false negatives as a measure of classification accuracy and completeness. Finally, in Section 4 we discuss the future expansion of the EBF to include the automated identification and classification of other periodic variable stars such as δ Cepheid fundamental mode (DCEP-FU) and first overtones (DCEP-FO), δ Scuti (DSCT), RR Lyrae (RRAB) and c (RRC), and Mira (MIRA) as well the EBFʼs applicability to other large time-series, photometric surveys. A set of newly identified candidate EBs that may have been missed by the human generated Kepler EB Catalog is provided in Appendix
2. METHODS: THE ECLIPSING BINARY FACTORY PIPELINE
The EBF Pipeline, diagrammed in Figure 1, is a compilation of five fully automated, adaptable software modules that process light curves from large astronomical surveys in order to automatically identify EBs and classify the each EBʼs morphology. In this section we describe the parameters with which systematics are removed, periodic variable stars are identified, phased light curves are represented as ANN recognizable patterns, and constrain how EB candidate classifications are tuned and validated as a function of the survey instrumentʼs photometric precision, the data set duration and cadence, the flux measurement uncertainty, and the desired level of classification confidence, respectively. The procedure of each EBF module is described below in turn, then we test the overall pipeline performance by the rate of correctly classified systems and the completeness of the EBF classified sample as compared to the KEBC3 benchmark sample classified by human analysis. We use the Kepler "Q3" long cadence data files as described in the Kepler Data Release 4 Notes KSCI-199044-001 (van Cleve 2010) to tool and test the EBF pipeline.
2.1. Data Correction Module
Corrections to the raw photometry collected from large survey instruments are generally necessary to remove systematic artifacts from the data. Though most surveys include their own data correction processes, there is no guarantee that the method employed preserves the necessary signal from which the EBF will identify and classify EBs. In the case of the Kepler "Q3" light curves corrections to the SAP flux are necessary to remove systematics such as differential velocity aberration, thermal gradients across the spacecraft, and pointing variations (Kinemuchi et al. 2012). Of course, the Kepler archive data reduction pipeline removes systematics in the "Q3" SAP data to produce pre-search conditioned data (PCD) using the method of cotrending basis vectors; however, PCD may lack the astrophysical signatures of variables whose harmonic thresholds were not met (McQuillan et al. 2012). Hence, we employ a data correction module (DCM) to remove trending artifacts while preserving the necessary signal as well as to normalize the flux in order to present the remainder of the pipeline with a standardized, detrended light curve. The DCM step is applied to all of the input light curves (i.e., no light curves are filtered out at this stage).
The DCM accepts as input the time-series flux with the associated uncertainty of each data point and detrends the flux via the sigma-clipping algorithm of Slawson et al. (2011); where data points outside a standard deviation interval of a least squares Legendre polynomial fit to order l, expressed by the generating function
are iteratively discarded until there are no remaining data points outside the interval. For each data point input, the DCM returns a point containing a normalized flux value with a new uncertainty. The EBFʼs implementation of this method allows the parameters of the module to be set to the specific characteristics of a data set by defining the sigma clipping thresholds and Legendre polynomial order. Additionally, the module allows for the input of the time intervals of known breaks in the light curves due to instrument operations. In this paper we detrend the Kepler "Q3" SAP light curves using asymmetric sigma-clipping thresholds of with the 10th order Legendre polynomial
and known breaks in the "Q3" light curves beginning at BJD 281.0, 291.0, and 322.5. Figure 2 shows the uncorrected Kepler "Q3" SAP light curve for a target and the EBF detrended, normalized light curve where systematics were removed by the DCM and the red circles indicate data gaps given by the aforementioned breaks.
Download figure:
Standard image High-resolution image2.2. Harmonics Recognition Module (HRM)
The EBF pipeline is currently optimized for identification and classification of EBs, which exhibit periodically varying light curves. Therefore the precursors to morphological classification of EBs via pattern recognition are the identification of a targetʼs periodic variability and a subsequent phase folded light curve. Hence, the detrended normalized light curve is next processed by the HRM where periodic variability is probed by the analysis of variance (AoV) method (Schwarzenberg-Czerny 1989; Devor 2005) included in the VARTOOLS1.202 package (Hartman et al. 2008). The HRM takes as input all of the normalized light curves output by the preceding DCM, but outputs only those light curves that are deemed to exhibit truly periodic behavior (of any kind, i.e., the output light curves from the HRM stage include EBs but also any other type of periodic variable).
Two separate searches for the first three harmonics, each with a coarse sub-sample setting of 0.1 and a fine-tune setting of 0.01, are conducted over the entire data set. The first search separates the data into five bins and the second uses fifty bins such that skewed harmonics from bin size related flux variance is minimized. We then compare the strongest harmonic signal from both periodograms and, in the event of disagreement between the primary periods, assign the period (along with all other associated AoV statistics) from the periodogram with the smallest false alarm probability.
To eliminate light curves that are unlikely to exhibit truly significant periodic variability, we use the AoV calculated signal to noise ratio (S/N). Specifically, periodograms with a S/N 100 are rejected by the module while the remaining light curves, now identified as candidate periodic variables, are accepted. These accepted candidates' light curves are phased by the standard equation
where ρ is the phase of an associated time value, t, for a given reference time, , and the assigned period, p.
For an EB, we want the reference or zero phase , where , to be associated with the primary eclipse. However, not all of the periodic variables identified by the HRM will be EBs; there will be other types of periodic variables where the most physically meaningful reference phase is not an eclipse but rather a flux maximum. Therefore, to maintain generality, we associate with either the minimum or maximum flux, dependent upon whether the light curve spends most of its time above or below the median flux. Most true EBs will have flux points in the "high" state and will spend a minority of the time in eclipse. Here the determination is made by the value of the flux ratio (FR) as described by Coughlin et al. (2011):
By definition, the FR has a value on the interval (0, 1). For reference, a perfect sinusoid will have FR = 0.5. We assign to the flux minimum, indicative of an EB, for an FR less than a defined threshold, and a flux maximum assigned at otherwise.
The HRMʼs parameters that are adaptable to the data set are the period search interval, based on the duration of the survey data, and the FR threshold. The latter is based on the tolerance for ECs, which in general are not perfect sinusoids and whose FR may be greater than 0.5 if they spend more than half the time above the median flux. As we limited the Kepler input files to "Q3" (), the harmonics search was limited to a period range of 0.11–20.1 days in order to ensure at least three primary eclipses would be present in each light curve. In addition, we set the FR threshold to 0.7 based on a simple analysis of KEBC3 FRs in order to include those ECs whose light curves are not perfectly sinusoidal. Figure 3 shows a plot of the FR distribution in the KEBC3 versus cataloged morphology, and a "Q3" target as an example of an imperfect sinusoidal phased EC light curve with a . An example of a phased ESD light curve, and an example of a phased ED light curve, each with , are shown in Figure 4.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution image2.3. Pattern Generation Module (PGM)
As described in Section 2.4, the EBFʼs classification of the phased light curves for the candidate periodic variables output from the HRM is dependent upon the ANN evaluating these light curves as linearly separable patterns. In its simplest form linearly separable means two sets of points in a plane may be divided by at least one line such that each data point belonging to one set is on one side of the line and each data point belonging to the second set is on the other side of that line. When generalized to an n-dimensional space, linear separability is formalized such that every point satisfies and every point satisfies , where and are two sets of points in an n-dimensional Euclidean space and xi is the ith component of for real numbers w1, w2 ..., wn, k. The PGM therefore represents the phased light curves via piecewise smooth polynomial chains to pass to the ANN as linearly separable patterns.
Since EBF efficiency goes as O(N2), where N is the number of elements in , it becomes computationally expensive to process light curves possessing on the order of 103 data points. Therefore, we condense the light curve by phase binning down to the order 102 data points over the phase interval , with the primary eclipse centered at . In addition to meeting the efficiency requirements, the phase binning process also minimizes error by smoothing over uncorrected systematics that may present a stray minimum flux below the true primary eclipse depth.
The PGM therefore fits the phase binned curves to piecewise smooth second order polynomial chains using the polyfit algorithm of Prša et al. (2008). However, the ANNʼs use of the sigmoid activation function (see Section 2.4) leads to errors in processing large polynomial coefficients that may arise during polynomial fitting. Hence, we modify the polyfit algorithm to fit polynomial chains with components xi described by three points in a two-dimensional Euclidean space (phase, flux). These three data points represent the chainʼs two end-points, or knots, and the midpoint. Here each chain shares one knot with the next, thus a single chain is effectively described by two points in a plane. We add the additional modification of allowing either a four-polynomial chain or two-polynomial chain fit based on a χ2 goodness of fit criterion. With these modifications the PGM processes phased light curves initially containing on order data points into at most four linearly separable patterns where N = 4 for each . That is, we reduce the recognizable pattern of the phase folded light curve which may contain a very large number of data points to a maximum of sixteen points.
Bin size selection is an adjustable parameter. This option allows for a condensed light curve based on the survey cadence and/or duration in order to manage the point density per bin. Of course, more bins lead to longer run times for the PGM and fewer bins lead to sparse light curves where the narrow eclipses of wide EBs might be smoothed over or missed. The minimum number of points per bin may also be adjusted in an attempt to capture narrow eclipses; however, from trial and error we find that no fewer than 50 phase bins and no fewer than three bins per chain is optimal.
To fit the Kepler "Q3" phased light curves we bin each light curveʼs ∼5,000 long cadence data points into 200 () phase bins with a polynomial chain that contains no fewer than three data points. The EBF generated binned light curves and linearly separable patterns, plotted as the knots and midpoints of the moduleʼs fitted flux, as examples of Kepler EBs with the morphology EC, ESD, and ED are shown in Figures 5–7, respectively.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution image2.4. Artificial Neural Network Classifier (Pattern Recognition Module)
The EBFʼs trained, validated, and tested ANN is the single hidden layer feed forward perceptron network diagrammed in Figure 8. This module takes as input the PGM output set of data points as a simple representation of the target light curve. These patterned inputs are recognized by the ANN and classified into the EB categories that it has been trained to recognize. The outputs of the ANN are therefore the classification of each light curve with an associated confidence level in the classification. A detailed description of the ANN, including its training, validation, and output classifications, is described in Paegert et al. (2014). Here we provide a summary of its specifications and functions.
Download figure:
Standard image High-resolution imageThe ANN functions as a two-stage pattern recognition classifier based on nonlinear statistical models. The input layer, nodes x1 to xn, are populated with the PGM fitted polynomial chains as an input vector of n components that is then propagated via weighted connections w(h) to the hidden layer. The activation function a(h) is in the form of the sigmoid
where ν is the variable used by the ridge function to derive the features H1 to Hm as linear combinations of the inputs. The signals are again weighted by w(o) and propagated to the output layer where there are o target classifications coded as a 0 to 1 categorical target variable ν for the -class classification using the same sigmoid activation function to model the target nodes y1–yo as linear combinations of H1 to Hm, where n = 19, m = 20, and o = 10. In other words, the ANN accepts n = 19 input parameters—representing the 16 polynomial chain parameters from the PGM as well as the logarithm of the period in days, the total χ2 as a measure of fit, and the maximum amplitude of the light curveʼs normalized flux—and outputs a probability associated with each of the o = 10 known light curve classifications. The 10 known classifications are the three EB categories (EC, ESD, ED), the other variable types (DCEP-FU, DCEP-FO, DSCT, RRAB, RRC, MIRA), and a miscellaneous classification (MISC).
Training of the ANN was conducted by back-propagation, and gradient descent was used to minimize the error in the set of all weights θ, or , where the error function is given by
for and . is described by the softmax function
as , and was used to approximate the nonlinear input functions. As described in Paegert et al. (2014), training and validation of the ANN was done using 32,278 light curves from the Automated All-Sky Survey (ASAS; Pojmański 1997). These tests showed the ANN is capable of an accuracy of 90.6%, 76.5%, and 90.3%, and a completeness of 95.8%, 63.7%, and 87.8% for ECs, ESDs, and EDs, respectively.
In summary, for each input patterned light curve from the PGM, the ANN outputs a most likely classification and an associated confidence in that classification. The output classifications for EBs can be either EC, ESD, or ED. Because the ANN was originally developed on the ASAS sample (Paegert et al. 2014), there is also a classification of MISC for light curves that are not clearly identified as one of the three EB classes. In addition, the ANN outputs up to nine additional possible classifications (and associated confidence levels) for each light curve. As we describe below, this permits us to identify possible EBs at lower confidence that, e.g., have a primary classification of MISC but secondary classification of EC, ESD, or ED.
2.5. Candidate Validation Module (CVM)
Once ANN classifications are complete, candidate EC, ESD, and ED light curves are filtered for false positives by the CVM. We position the CVM as a post classification process to provide the option to examine some or all of the filtered candidates.
The CVM filters candidates by two distinct metrics. The first is the polyfit algorithmʼs total goodness of fit, χT2, such that
where is the normalized, detrended flux of phase bin l and xji is the ith component of the jth polynomial chain for that satisfies the requirement of linear separability from Section 2.3. The χT2 metric describes how well the PGM represents the candidate light curves and is a useful measure of whether the ANNʼs recognition of the EBF generated pattern translates to an accurate classification as an EB.
The second metric is the normalized, detrended light curveʼs scaled sum of all second derivatives for the portion of the curve between each and every data point, which we refer to as δ2 and defined as
where ρi and are the phase and flux for the ith of n data points. As such, δ2 is a measure of the dispersion in the detrended light curve and when compared to the average detrended flux measurement uncertainty calculated in the DCM provides a criterion for filtering a pseudo-periodic signal that may arise from random noise.
The CVM adapts the χT2 and δ2 filter metrics by using known EBs contained in the survey. The module then calculates the upper limits on both the χT2 and δ2 from this benchmark subset. A CVM analysis of the KEBC3 EBs contained in the Kepler "Q3" data set and recovered by the ANN provides a χT2 upper limit of 100 and a δ2 upper limit of 1200 as shown by the histograms in Figure 9.
Download figure:
Standard image High-resolution imageFigure 10 shows the light curves of two example Kepler targets where the PGM fits to some quasi-systematic artifact that the ANN consequently mis-classifies as an ESD and where the PGM fits to a pseudo-sinusoidal signal generated by random noise that the ANN consequently mis-classifies as an EC. These examples demonstrate the CVM determined criteria of and filtering false positives from the automated EB catalog. Notice that a CVM filter with the criteria of only would have passed the target, with χT2 of only 0.01, yet a δ2 of 4752 to the EBF catalog. Similarly, a CVM criteria of only would have passed the target, with a δ2 of 1165 yet a, χT2 of 3378 to the EBF catalog. Thus, we set the CVM cut-off parameters to and in order to minimize EB classification false positives.
Download figure:
Standard image High-resolution image3. RESULTS
We use the 2612 EBs cataloged in the KEBC3 to benchmark the EBF pipeline outputs as an initial assessment of pipeline capabilities on standard morphology ECs, ESDs, and EDs. Thus, we first remove from the benchmark sample those KEBC3 EBs flagged as eclipse timing variations (ETV) induced by third body candidates (Conroy et al. 2013), heart beat (HB), and those classified in the KEBC3 as uncertain (UNC). We also exclude those KEBC3 EBs with cataloged periods outside the Kepler "Q3" duration parameters of period and of course those cataloged EBs that do not appear in the Kepler "Q3" data set. Thus, we benchmark the EBF pipeline against an adjusted KEBC3 sample of 1198 cataloged EBs.
3.1. Eclipsing Binary Factory Performance Metrics
3.1.1. Pre-classification Metrics from the Data Correction Module, Harmonics Recognition Module, and Pattern Generation Module
We compare the metrics of the EBF pipelineʼs pre-classification output to the same metrics of the manually determined KEBC3. Note that the KEBC3 light curves contain ∼50,000 data points from more than 10 quarters of observations while the "Q3" observations include only .
We evaluate the DCMʼs performance by comparing the 200 phase binned, normalized flux values of each common target. We find the DCMʼs performance to be in excellent agreement, mean , with the KEBC3 benchmark. This DCM light curve agreement breaks down to a mean χ2 of 0.00373, 0.00264, and 0.00052 for cataloged ECs, ESDs, and EDs respectively.
The HRM assigned period agrees to within days for 99.67% of the EBF identified periodic variable candidates found in the KEBC3, though the HRM failed to identify statistically significant periodicity in 5.59% of the benchmark sample, and therefore were not considered for follow-up classification by the ANN. A further comparison of the HRM assigned periods to those manually determined in the KEBC3 reveals a period alias rate of 21.65% (i.e., the percentage of EBF determined periods that are an alias of the KEBC3 period). However, aliasing does not negatively impact ANN pattern recognition and the true period is easily determined by examining the candidate light curve. Figure 11 shows examples of both the EBF generated and the manually determined KEBC3 light curves for targets where the EBF assigns a half-period alias. It was also discovered that phase disagreement was common for targets with . Though there is no adverse effect on pattern recognition for the example targeted EC in Figure 12, the ambiguity in phasing a light curve with an FR of on the minimum flux led to diminished pipeline accuracy as other variable types mistaken by the pipeline for EBs (i.e., -cepheids incorrectly phased on the flux minimum) contributed to the false positive rate.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageThe PGM fitted the EBF phased, detrended light curves with either four- or two-polynomial chains and a goodness of fit described by a median χT2 of 5.39 as compared to the unaltered polyfit algorithm used to fit the KEBC3 samples with a median χT2 of 3.74. This goodness of fit includes cases where the PGM patterned fit is in excellent agreement with the manually determined KEBC3 light curves, such as the target shown in Figure 13, as well as instances where the patterned fit fails to capture the full depth of the eclipses as circled in Figure 14.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageThe subtle deviation between a phased light curve and an EBF generated pattern may present a challenge to an automated solution estimator, yet the PGMʼs representation of the light curves shape allows the ANN to successfully recover and classify these systems. This and other deviations, such as the small aberrant third chain circled in red on Figure 11, are the result of the three data points per chain minimum parameter we adopted in the PGM in order to capture narrow eclipses. Though % of the non-recovered KEBC3 benchmark sample is comprised of wide EDs with narrow eclipses that contain less than three binned data points, the three data point rule remains a hard minimum to ensure pattern generation integrity. It is important to note that the PGM failed to pattern 3.33% of the benchmark sample and that these light curves were not considered for follow-up recognition by the ANN.
3.1.2. Classification Metrics from the Artificial Neural Network and Candidate Validation Module
The performance of the ANN for cataloged binaries in the Kepler "Q3" data set is benchmarked by KEBC3 EBs with period p where p while ETVs, HBs, and UNCs were excluded. The ANN reports each of its classifications with a posterior probability P interpreted as a level of confidence for the morphological class of EC, ESD, ED, or any other periodic variable class contained in the ASAS training set (Paegert et al. 2014).
Before discussing the performance of the EBF pipeline as a whole, we consider the ANN performance alone. This is evaluated only for those light curves passed forward by the DCM, HRM, and PGM such that the ANN performance metrics are not hindered by the exclusion of light curves by the preceding modules. The ANNʼs classification accuracy metric is calculated using the KEBC3 targets flagged in the KEBC3 database as InCat:False (ICF), a false positive EB (e.g., blends, planets, pulsators, rotators). Thus we only select those targets designated ICF that lack a cataloged morphology (e.g., pulsators, rotators) as an inaccurate classification. That is, the ANNʼs accuracy is given as the percentage of ANN EC, ESD, ED classifications that are Kepler targets currently cataloged in the KEBC3 as an EB as opposed to those currently cataloged as a pulsating or rotating variable. The ANNʼs completeness metric is simply the percentage of the adjusted KEBC3 benchmark sample that the ANN successfully recovered as an EB. Using the Kepler "Q3" data set, the ANN accuracy is found to be 93.68%, 84.27%, and 83.03% with a completeness of 94.17%, 93.62%, and 93.63% for EC, ESD, and ED, respectively.
The CVMʼs performance was evaluated by a manual inspection of the 91 high-confidence () EB candidate light curves generated by the EBF. Here we remove from these candidates 21 Kepler "Q3" targets that suffer from the phasing ambiguity discussed in Section 3.1.1. This manual inspection of the remaining 68 candidates found that the high-confidence CVM validated EBs are likely reliable to % accuracy, commensurate with the integrated EBF pipelineʼs accuracy as discussed below.
3.1.3. Overall Performance
To evaluate the performance of the integrated EBF pipeline, we describe the EBF generated catalog for candidate ECs, ESDs, and EDs with two distinct queries in order to evaluate EBF accuracy and completeness in the Kepler FoV both at high-confidence () and with no regard to confidence (). Here, no regard to confidence means either the primary ANN classification is EC, ESD, ED, or the primary classification is the placeholder MISC with a secondary classification of EC, ESD, or ED with any posterior probability on the interval (0, 1). The accuracy and completeness for each condition are listed in Table 1.
Table 1. EBF Pipeline Accuracy and Completeness with the Kepler "Q3" Data Release
Query | EC | ESD | ED | ||
---|---|---|---|---|---|
for | |||||
Accuracy (%) | 96.12 | 95.53 | 91.93 | ||
Completeness (%) | 63.93 | 45.80 | 32.40 | ||
a | for | ||||
Accuracy (%) | 93.68 | 84.27 | 83.03 | ||
Completeness (%) | 86.07 | 73.95 | 61.59 |
Note. aClassifications with no regard to confidence. Either the primary ANN classification is EC, ESD, ED, or the placeholder MISC with a secondary classification of EC, ESD, or ED with any posterior probability on the interval (0, 1).
Download table as: ASCIITypeset image
These two queries serve to exemplify the range of utility in the EBFʼs automated output of candidate Kepler FoV EBs. For example, a high-confidence () query for ECs in the interest of analyzing the characteristics of a homogeneous sample of Kepler ECs, with a high certainty of morphological classification, recovers of those available in the "Q3" data set with only false positives. Conversely, a query with no regard to confidence () in the interest of a statistical analysis using a large number of Kepler ECs, without regard to the certainty of their morphology (), recovers of those in the "Q3" data with false positives.
3.2. New High Confidence Candidate Eclipsing Binaries Identified with the Eclipsing Binary Factory
In addition, for the high-confidence condition we find 68 CVM validated EB candidates that do not appear in the KEBC3 catalog. The set of these newly identified candidates are provided in Appendix
4. DISCUSSION
Previous works have developed aspects of an automated EB classifier, although to date a fully automated end-to-end pipeline has remained an open challenge. For example, the EB identification algorithm of Wyrzykowski et al. (2003) automatically identified candidates in the Optical Gravitational Lensing Experiment (OGLE), but then relied on visual inspection of all candidate light curves for manual classification. Similarly, the semi-automated method of Graczyk et al. (2011) relied on visual inspection to manually remove artifacts from the OGLE-III candidate light curves. Though the pipeline of Devor (2008) automatically filtered 97% of the Trans-Atlantic Exoplanet Survey (TrES) data as non-periodic, a manual removal of 86% of the remaining light curves was necessary to finalize the catalog. More advanced algorithms described by Devinney et al. (2012), such as the Eclipsing Binaries via Artificial Intelligence (EBAI) method (Prša et al. 2008), used sophisticated ANNs to estimate the characteristic parameters of detached EBs in the Kepler Field; however, the classification of EB morphological sub-classes remained largely a manual task.
In contrast to the algorithm of Wyrzykowski et al. (2003), where the authors are only able to estimate that their EB catalog is complete to with an efficiency of , and where the classification of EB sub-classes is achieved only through visual inspection, the EBFʼs completeness and accuracy rates are precisely benchmarked by the KEBC3 and based upon the ANNʼs fully automated classification of sub-classes. Additionally, the EBFʼs single-layer ANN represents a much simpler representation of the classifier algorithm with a more robust training set (i.e., 19 input nodes and trained on ∼32,000 distinct ASAS exemplars). The Wyrzykowski et al. (2003) method requires the conversion of OGLE photometric light curves into representational pixel images that are then analyzed by a input node three-layer ANN trained on variations of only 10 OGLE EB exemplars.
The OGLE-III analysis of Graczyk et al. (2011) is based on a 13 step algorithm that, when complete, reports a false positive rate of and thus requires a manual inspection of each and every EB candidate light curve. In contrast, the EBFʼs automated filter reduces false positives to less than . The semi-automatic method of Devor (2008) also requires manual classification of the EB sub-classes and is further constrained to the analysis of ED and ESD TrES light curves while the EBF extends the result to include the EC class. The EBF pipeline again represents an advance in full automation and adds the ability to automatically classify the EC sub-class when compared to more advanced solution estimators, such as the EBAI method (Prša et al. 2008).
The purpose of the EBF pipeline is to alleviate the human bottleneck in discovering astrophysically interesting EB systems by automatically identifying and classifying these systems from large volumes of survey data as a foundation for an IDP designed for the future of astroinformatics. The performance results using the Kepler "Q3" Data Release demonstrate the EBFʼs accuracy for those EBs flagged at high confidence, however the completeness metric leaves room for improvement. An examination of the pipelineʼs unclassified KEBC3 EBs reflects the 5.59% exclusion rate of the AoV-S/N output filter, as well as the 3.33% failure rate of the PGM. Here we find that 8.92% of the benchmark sample never reaches the ANN. In addition, we discover that a simple evaluation of the FR in relation to an arbitrary threshold of is insufficient to discriminate between ECs and other variable types. Thus, the necessity for the pipeline to first reliably identify harmonics as well as correctly phase and fit the input data without manual adjustment is obviously an area in need of refinement.
These period and subsequent phasing errors can potentially be mitigated in the future by layering the HRM with multiple harmonic search methods for varied period length. For example, the computationally inexpensive Fast Chi Squared (Fχ2) method of Palmer (2009) and the Boxed Least Squares transit search algorithm (Kovács et al. 2002) may supplement the AoV method. This multi-layered harmonics search could be combined with a revised PGM where a post-fitting χ2 analysis may be used to detect errors and validate assigned periods. Phasing may then be subjected to other constraints (e.g., the symmetry versus asymmetry of the primary eclipse ingress and egress for those targets with FR ). Training deficits such as the lack of a sufficient number of wide EB light curves with narrow transits in the ASAS exemplars as well as an inherent morphological ambiguity in binary systems can potentially be rectified with further training, testing, and validation using expanded survey data sets.
The EBFʼs adaptability, facilitated by the modular parameters described in Section 2, and the automated catalogʼs query options, demonstrates the pipelineʼs utility for a variety of applications. Applying the EBF to other current time-domain photometric surveys, such as K2 and the upcoming TESS mission should provide similar results. In addition, the EBFʼs ANN is currently tooled and trained to recognize and classify the other types of periodic variables discussed in Paegert et al. (2014; i.e., the ASAS periodic variable classifications DCEP-FU, DCEP-FO, DSCT, RRAB, RRC, and MIRA). Only a benchmark of variables in the Kepler FoV, parallel to the KEBC3, is needed to tool and test the EBFʼs expansion into automatically cataloging these and other types of periodic variables.
5. CONCLUSION
The EBF generated a fully automated catalog of EB systems in the Kepler FoV from the Kepler "Q3" Data Release. This catalog, benchmarked by the manually generated KEBC3, identified and classified to 90% certainty EC, ESD, and ED systems () with an accuracy of 96.12%, 95.53%, and 91.93%, while complete to 63.93%, 45.80%, and 32.40% respectively. When classification certainty is not considered, the EBF catalog identified and classified ECs, ESDs, and EDs () with an accuracy of 93.68%, 84.27%, and 91.93%, while complete to 86.07%, 73.95%, and 61.59% respectively. In addition, the EBF identified 68 new candidate EBs that may have been missed by the human generated KEBC3 (see Appendix
This demonstrates the EBF pipeline as an effective alternative to the constraints of human analysis when probing large time-domain, photometric data sets. As such, the EBF computational pipeline is a viable framework for the development of a more complex IDP containing a follow-up automated solution estimator that would enable the automatic determination of physical EB parameters. Future work on the EBF will include enhancing the harmonics identification capabilities with multiple search methods and phasing constraints beyond the FR, a revised pattern generator to constrain polynomial chain fits to smaller errors, and more robust training for the ANN on wide binary systems. In addition, an expansion of EBF searches in the Kepler data, and in the upcoming K2 data, to include the periodic variables of type (fundamental and first overtones), , RR-Lyrae ( and c), and Mira will be published in follow-up papers. Over the long term, the EBF general classification approach can be adopted by future surveys such as TESS and LSST.
We are grateful to A. Prša, K. Conroy, and all those at Villanova University who contribute to the Kepler Eclipsing Binary Catalog, to J. Pepper for his insights and experience with large surveys, and to R. Siverd for his expert assistance with harmonics. Kepler was selected as the 10th mission of the Discovery Program. Funding for this mission is provided by NASA, Science Mission Directorate.
Appendix: ECLIPSING BINARY FACTORY HIGH-CONFIDENCE NEW ECLIPSING BINARY CANDIDATES
In Table A1, we provide the KIC, R.A., decl., period, and Kepler magnitude for 68 high-confidence (), astrophysically interesting EB candidates newly identified by the EBF pipeline from the Kepler "Q3" light curves that are not currently included in the human generated KEBC3. Figures A.1–A.3 represent a sample of "Q3" candidate light curves. The list of candidates also includes three targets with a current disposition of "CANDIDATE" in the Kepler Objects of Interest (KOI). Figure A4 shows the light curves for KOI K00188.01, K00830.01, and K00183.01.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageTable A1. High-confidence () Candidate EBs Newly Identified by the EBF Pipeline from the Kepler "Q3" light curves that are not Currently Included in the Human Generated Kepler Eclipsing Binary Catalog v3
KIC | R.A. | Decl. | Period (d) | KEPMAG |
---|---|---|---|---|
1872192 | 292.29273 | 37.32021 | 0.669979 | 13.74 |
2164791 | 292.427369 | 37.556019 | 6.703367 | 15.22 |
2285420 | 286.776414 | 37.63955 | 5.347988 | 13.5 |
2696217 | 286.90421 | 37.90544 | 12.807157 | 13.45 |
3441414 | 290.84053 | 38.54963 | 8.114992 | 11.51 |
3761175 | 295.0293 | 38.89788 | 5.162696 | 12.83 |
3831911 | 285.05514 | 38.98501 | 7.339436 | 15.02 |
3836276 | 286.88712 | 38.90174 | 8.176842 | 15.32 |
3847822 | 290.738025 | 38.971723 | 7.256877 | 11.85 |
4164363 | 293.210621 | 39.2407 | 5.791975 | 11.37 |
4175707 | 295.698821 | 39.26685 | 3.746223 | 15.02 |
4271063 | 293.72196 | 39.38955 | 0.989814 | 14.11 |
4451525 | 287.60961 | 39.59948 | 16.264671 | 13.88 |
4926962 | 292.68917 | 40.04848 | 4.463288 | 13.13 |
5000179 | 288.35829 | 40.17466 | 3.632168 | 13.76 |
5083330 | 286.319321 | 40.23009 | 7.814972 | 15.73 |
5083911 | 286.585518 | 40.228043 | 12.133361 | 12.96 |
5357901 | 290.35802 | 40.56774 | 3.797303 | 14.74 |
5358624 | 290.58173 | 40.5774 | 3.525784 | 15.22 |
5391520 | 298.830296 | 40.507433 | 3.984854 | 11.93 |
5857714 | 284.69325 | 41.16556 | 0.163569 | 11.21 |
6207355 | 292.59219 | 41.53943 | 5.210091 | 13.46 |
6425891 | 285.25158 | 41.8777 | 0.951153 | 14.77 |
6507888 | 286.7031 | 41.99742 | 7.453959 | 15.27 |
6613627 | 293.97651 | 42.04829 | 0.226224 | 12.55 |
6616211 | 294.67262 | 42.01478 | 3.85509 | 11.79 |
6924320 | 282.394541 | 42.44624 | 16.505877 | 15.49 |
6953069 | 293.20455 | 42.48449 | 0.294107 | 15.74 |
7053456 | 297.03561 | 42.53181 | 11.211459 | 12.0 |
7219251 | 296.87916 | 42.70233 | 3.190739 | 13.84 |
7363684 | 292.06889 | 42.93308 | 0.729569 | 13.72 |
7431838 | 287.45459 | 43.04541 | 3.036295 | 12.09 |
7434110 | 288.35027 | 43.04639 | 7.223437 | 15.99 |
7523733 | 290.58686 | 43.11656 | 3.121057 | 11.81 |
7549265 | 297.31412 | 43.14078 | 1.962315 | 13.8 |
7748449 | 290.22456 | 43.49293 | 15.705378 | 13.43 |
7905209 | 296.36599 | 43.62922 | 0.209777 | 13.27 |
8029556 | 292.09914 | 43.88178 | 18.040183 | 15.35 |
8098181 | 292.01466 | 43.97829 | 0.986971 | 12.86 |
8432040 | 292.30794 | 44.48024 | 7.991428 | 11.95 |
8520065 | 299.48401 | 44.50633 | 9.365808 | 13.99 |
8694772 | 293.97896 | 44.85794 | 15.187454 | 14.83 |
8707639 | 297.887694 | 44.868264 | 7.785182 | 12.71 |
8774912 | 298.697591 | 44.90004 | 15.137412 | 12.99 |
8881943 | 290.48391 | 45.18651 | 7.777886 | 14.79 |
8972908 | 297.87374 | 45.24537 | 4.195616 | 12.58 |
9011963 | 287.87358 | 45.31096 | 8.400634 | 15.5 |
9283572 | 292.93673 | 45.73978 | 1.217898 | 14.36 |
9511937 | 284.28423 | 46.11628 | 0.277782 | 15.54 |
9651668 | 292.85574 | 46.39118 | 2.68435 | 14.29 |
9700181 | 286.87755 | 46.43377 | 2.846435 | 12.74 |
9726020 | 297.60386 | 46.48611 | 10.642508 | 12.94 |
10389596 | 284.504783 | 47.550999 | 2.608079 | 13.36 |
10669516 | 293.44307 | 47.98749 | 10.485432 | 13.03 |
10817620 | 298.71137 | 48.13449 | 2.946091 | 13.96 |
10982373 | 294.90293 | 48.47721 | 10.01776 | 11.7 |
10990083 | 297.79189 | 48.49364 | 10.715948 | 13.24 |
11087095 | 293.33433 | 48.69283 | 0.77833 | 15.57 |
11152786 | 298.17105 | 48.7121 | 8.229868 | 15.41 |
11236035 | 287.51454 | 48.93461 | 2.318711 | 14.65 |
11284185 | 284.144111 | 49.036 | 15.705099 | 13.66 |
11653122 | 286.610501 | 49.79567 | 2.650525 | 14.76 |
11922402 | 295.85559 | 50.28799 | 5.917946 | 15.93 |
12102573 | 286.24793 | 50.61843 | 10.942348 | 11.81 |
12105278 | 288.17649 | 50.67654 | 7.110381 | 12.32 |
12314646 | 295.343081 | 51.07426 | 13.579375 | 14.14 |
12458797 | 290.53101 | 51.30857 | 0.241319 | 13.89 |
12602567 | 290.586697 | 51.602235 | 10.241997 | 13.02 |
Download table as: ASCIITypeset images: Typeset image 1