The following article is Open access

Applying Random Forest Classification to Ultracool Dwarf Discovery in Deep Surveys. I. Color Classification with SDSS, UKIDSS, and WISE Photometry

, , , , , , , and

Published April 2022 © 2022. The Author(s). Published by the American Astronomical Society.
, , Citation Zijie Gong et al 2022 Res. Notes AAS 6 74 DOI 10.3847/2515-5172/ac6521

2515-5172/6/4/74

Abstract

In this first of two studies, we apply a random forest model to classify ultracool dwarfs from broadband color information. Using the Skrzypek et al. ultracool dwarf sample and a set of background sources, we trained a random forest classifier based on 28 colors derived from optical and infrared photometry from SDSS, UKIDSS, and WISE. Our model achieves 99.7% accuracy in segregating L- and T-type UCDs from background sources, and 97% accuracy in separating spectral subgroups. A separate random forest regressor model achieved a spectral classification precision of 1.3 subtypes. We applied these models to a 12.6 deg2 region with overlapping SDSS, UKIDSS, and WISE coverage and identified 35 UCD candidates, five of which are previously reported, of which four are photometrically or spectroscopically classified UCDs. Our random forest model can be applied to multiple surveys to greatly expand the known census of UCDs.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Ultracool dwarfs (UCDs) are the lowest-mass stars and brown dwarfs, with effective temperatures Teff ≲ 3000 K and spectral classifications ≳M6 (Kirkpatrick 2005). As intrinsically faint sources, these objects are relatively rare in large imaging surveys. Nevertheless, several thousands of UCDs have been found in wide-field optical and infrared imaging surveys such as SDSS (York et al. 2000), UKIDSS, (Lawrence et al. 2007), and WISE (Wright et al. 2010); and in deep imaging surveys such as the Dark Energy Survey (Carnero Rosell et al. 2019). Identifying rare sources in rich data sets is an ideal problem for machine learning methods such as random forest (RF) classification (Breiman 2001), which have previously been deployed to classify M dwarfs (Hardegree-Ullman et al. 2019) and perform star-galaxy classification (Miller et al. 2017; Clarke et al. 2020) using photometric data. Here, we explore the application of a hierarchical RF model to segregate and classify UCDs using multi-color photometry.

2. Methods

Our UCD training set was drawn from the compilation of Skrzypek et al. (2016), which includes 1341 photometrically-classified late-M, L, and T dwarfs with photometry from the SDSS, UKIDSS, and WISE surveys. From this sample, we selected a subset of 233 sources with L and T photometric classifications and complete photometry in 8 photometric bands: SDSS i, z; UKIDSS Y, J, H, K; and WISE W1, W2. We also drew a sample of 4055 "backgound" (non-UCD) sources from a 1° radius circular field centered at α = 12h , δ = +10° with overlapping SDSS, UKIDSS, and WISE coverage. Our classifying data were comprised of 28 colors derived from the measured photometry. We constructed two RF classifier models: one to segregate UCDs from non-UCDs, and a second to classify UCDs into four spectral type groups: L0–L4.5, L5–L9.5, T0–T4.5, and T5–T9.5. We also trained an RF regressor model to derive decimal classifications for the UCD sample. We used the scikit-learn package (Pedregosa et al. 2011) to design and train these RF models. From our sample, 76% was used as the training set, 9% as the validation set, and 15% as the test set.

The validation set was used to adjust the hyperparameters of the RF models, including the number of trees (30), tree depth (no limit), accuracy criterion (Gini coefficient), and use of bootstrapping. This initial training is used to prevent both underfitting and overfitting, and to maximize our classification/regression metrics of precision, accuracy, recall, and F1 score (Chinchor 1992). The RF models were trained in the Google Colab environment 8 (Carneiro et al. 2018).

3. Random Forest Performance

Figure 1 displays the confusion matrices for the UCD/non-UCD and UCD spectral group classifiers. The former achieved an accuracy of 99.7% on the test sample, while the latter achieved an accuracy of 97%. For the UCD/non-UCD classifier, our feature importance analysis found that i − z color had the most predictive power for identifiying UCD candidates. For the UCD spectral group classifier (Figure 1(c)), we found that W1 − W2, K − W2, and i − J colors had the most predictive power for determining UCD spectral type. These colors also have clear monotonic trends with spectral type across the L and T dwarf sequence. We found reasonable agreement between the predicted spectral types from the RF regressor model to the types reported in Skrzypek et al. (2016), with an average classification error of 1.3 subtypes.

Figure 1.

Figure 1. Random forest classification of ultracool dwarfs. (a) Confusion matrix for our UCD/non-UCD classifier. The number of sources classified in the test samples are indicated. Diagonal elements indicate correct classifications, off-diagonal elements mis-classifications. (b) Confusion matrix for UCD spectral group classification. (c) Feature importance plot indicating the relative importance (vertical axis) of various colors in determining UCD spectral group classifications. (d) Comparison of actual and predicted spectral types for UCDs based on the RF regressor model; the orange line indicates perfect agreement.

Standard image High-resolution image

4. Application as Discovery Tool

Once the RF models were re-trained on the entire Skrzypek et al. (2016) sample, we applied them to a sample of 13,483 sources with overlapping SDSS, UKIDSS, and WISE photometry in a 2° radius circular field (12.6 deg2) centered at α = 10h , δ = +5°. Our UCD/non-UCD classifier selected 35 sources as candidate UCDs; our UCD spectral group classifier and spectral type regressor identified most of these as early L dwarfs with classifications between L1 and L6. Five of these sources have SIMBAD entries. One, J095924.95+061628.2 was identified as a candidate white dwarf by Gentile Fusillo et al. (2019), and hence is unlikely to be a UCD. The other four sources are all identified as spectroscopically confirmed or candidate UCDs, suggesting a ≈80% reliability for our RF model. Additional follow-up of the other candidates will allow us to more accurately quantify this reliability. At 80% reliability, the corresponding surface density of UCDs (2.2 deg−2) yields ∼9000 L and T dwarfs in the roughly 4000 deg2 of overlap area between UKIDSS, SDSS, and WISE (Lodieu et al. 2017).

This research was conducted as part of the ENLACE bi-national summer research program at UC San Diego. We thank Dr. Olivia Graeve for organizing this program and for her mentorship. This research has made use of the SIMBAD database, operated at CDS, Strasbourg, France.

Software: astropy (Astropy Collaboration et al. 2018), astroquery (Ginsburg et al. 2019), scikit-learn (Pedregosa et al. 2011).

Footnotes

Please wait… references are loading.
10.3847/2515-5172/ac6521