Measuring Chemical Likeness of Stars with Relevant Scaled Component Analysis

Damien de Mijolla; Melissa K. Ness

doi:10.3847/1538-4357/ac46a0

1. Introduction

The field of Galactic astronomy has entered a transformative era. Large-scale surveys, such as the Apache Point Observatory Galactic Evolution Experiment (APOGEE), Gaia, and the Galactic Archaeology with HERMES (GALAH), are providing millions of high-quality spectroscopic and astrometric measurements of stars across the Milky Way (De Silva et al. 2015; Majewski et al. 2017; Gaia Collaboration et al. 2018). Future large-scale surveys, which will release even more high-quality data, are on the horizon (Bonifacio et al. 2016; de Jong et al. 2016; Kollmeier et al. 2017).

In this landscape of high-volume, high-quality stellar astronomy, fully extracting the scientifically relevant information from stellar spectra remains a difficult problem. Classically, this has been done by comparing observations to synthetic spectra generated from theoretical models (e.g., García Pérez et al. 2016). However, the precision with which stellar labels can be derived under such an approach is ultimately limited by the faithfulness with which synthetic spectra reproduce observations. Because of computational constraints and gaps in knowledge, synthetic spectra do not perfectly match observations, something sometimes referred to as the "synthetic gap" (O'Briain et al. 2021). Computational models used to generate synthetic spectra use incomplete stellar line lists and usually must make simplifying assumptions. This includes, for example, that stellar atmospheres are one-dimensional, in hydrostatic equilibrium, and in local thermodynamic equilibrium. In addition, even beyond these issues, observations are affected by further systematics such as telluric lines introduced by the Earth's atmosphere (e.g., Holtzman et al. 2015) and telescope imperfections/aberrations.

Ultimately this synthetic gap limits our ability to extract information from stellar spectra. In Ting et al. (2017) and Ting & Weinberg (2021) it was shown that stellar spectra contain more chemical information than is captured in bulk metallicity and α-enhancement alone. The precision of derived individual stellar abundances from large surveys, however, may be limited by an inability to fully extract information given approximate models, rather than by the signal-to-noise of observations.

This is problematic because a lot of interesting science requires measuring chemical similarity between stars with a precision beyond that currently delivered by large stellar surveys' modeling pipelines. In particular, high-precision chemical measurements are needed for strong chemical tagging (Freeman & Bland-Hawthorn 2002). This is an ambitious Galactic archeology endeavour aiming to identify stellar siblings—stars born from the same molecular cloud—using chemical information derived from spectroscopy long after clusters gravitationally dissipate. In practice, whether such chemical tagging is theoretically possible at scale is still an open question, but may be answered with large-scale surveys like GALAH (De Silva et al. 2015; Buder et al. 2021). For this form of chemical tagging to be successful, stellar siblings must share a near-identical chemical composition with sufficient variability in chemical compositions between clusters. Even if strong chemical tagging reveals itself to be impossible at large scale, precise chemical similarity measurements would still be useful in reconstructing the broad nature of our Galaxy's evolution (e.g., Coronado et al. 2020; Kamdar et al. 2020).

These issues motivate the development of methods capable of extracting information from stellar spectra, and overcoming the synthetic gap between observations and theoretical models. Several data-driven methods have been developed for this purpose. Methods such as those proposed by Ness et al. (2015), Casey et al. (2016), Leung & Bovy (2018), Ting et al. (2019), O'Briain et al. (2021), and Das & Sanders (2019) allow for improving the precision of stellar labels through leveraging data-driven interpolators between stellar spectra and labels, reducing the impact of noise and systematics on derived parameters. However, as such approaches still rely on synthetic spectra they do not fully alleviate issues with systematic errors from mismatching theoretical spectra. Recently, methods for finding chemically similar stars directly from stellar spectra without reliance on synthetic spectra have been developed (Bovy 2016; Price-Jones & Bovy 2017; Cheng et al. 2021; de Mijolla et al. 2021). This category of method works by removing the effect of nonchemical parameters on stellar spectra, thus isolating the chemical information within the spectra. Such approaches are not without drawbacks. Although they remove the dependency on synthetic models, they still require a comprehensive and precise determination of all nonchemical factors of variation. Additionally, they must make simplifying assumptions regarding the cross-dependencies between chemical and nonchemical factors of variations that may have an impact on accuracy.

In this paper, we present a new approach for identifying chemically similar stars from spectroscopic data, which we name relevant scaled component analysis (RSCA) because of its similarities with relevant component analysis (RCA; Shental et al. 2002). Our approach is grounded in the machine-learning subfield of metric learning. Instead of estimating individual chemical abundances, we project spectra directly into a lower-dimensional subspace in which distances between spectra are made to encode a useful notion of chemical similarity between stars. Crucially, as our approach for transforming stellar spectra does not rely at any stage on synthetic spectra or quantities derived from these, its performance is not hindered by inaccuracies in stellar modeling.

A novelty of our work is that instead of using synthetic spectra to learn this notion of chemical similarity, we make use of spectra from known open clusters with open-cluster membership information. Open clusters are groups of stars born together that remain gravitationally bound after birth and up to the present day. They are relatively rare, as most stellar clusters dissipate rapidly after birth (Portegies Zwart et al. 2010). However, they are extremely useful tools in modern Galactic astronomy. In particular, open clusters, which can be identified using astrometry, display near-identical chemical abundances although small scatter may exist at the 0.01–0.02 dex level and up to <0.05 dex level for some elements (e.g., Bovy 2016; Ness et al. 2018; Liu et al. 2019; Cheng et al. 2021). Open clusters have found many uses in modern astronomy, for example to obtain high-precision measurements of the radial abundance gradients in the Milky Way (Friel 1995; Magrini et al. 2017) or to benchmark and calibrate stellar-survey abundance measurements (García Pérez et al. 2016). Here, we use open clusters as a gold standard for learning a notion of chemical similarity. In our approach, we take the viewpoint that if open clusters are indeed chemically homogeneous, then a successful metric for encoding chemical similarity will be one in which open-cluster stellar siblings are highly clustered.

Our algorithm, RSCA, has several properties that make it suitable for the task at hand, of measuring chemical similarity between stars:

1.
It is fully data driven. Chemical similarity is measured without any reliance or dependency on theoretical models. This offers a measure of chemical similarity that is independent from the systematics introduced in traditional stellar modeling (e.g., Jofré et al. 2017), and offers a means of validating existing discoveries.
2.
It is computationally efficient. As the method is linear, processing spectra from the full APOGEE stellar survey can be done in minutes. The most computationally intensive step of the approach is a principal component analysis (PCA) decomposition.
3.
It is interpretable. In its current formulation, measuring chemical similarity using our method amounts to evaluating Euclidean distances between stars projected on a hyperplane of the stellar spectra space.
4.
It is precise. We find the method, using spectra, to be more effective at identifying stellar siblings from open clusters than is possible using stellar-abundance measurements. We believe this to be in large part because our method bypasses the synthetic gap introduced by spectral modeling. Furthermore, our experiments suggest that the performance could be further improved, for example, with a larger data set of open-cluster stars or by taking into account the error on the flux, which we do not currently do.

The paper is organized as follows. In Section 2.1, we outline the conceptual ideas behind our approach for measuring chemical similarity. We then briefly introduce PCA in Section 2.2, which is a core component of our algorithm. In Section 3, we dive deeper, and present our algorithm, RSCA. This is implemented using open clusters observed by the APOGEE survey in Section 4, and evaluated in light of the field distribution of stars. Its trade-offs and implications are discussed in Section 5.

2. Concepts and Assumptions

2.1. Chemical Similarity as Metric Learning

The characteristics within a stellar spectrum are caused by the interplay of many factors of variation. These include chemical and physical parameters of the star and the instrumental systematics associated with the telescope, as well as interstellar dust along the line of sight. Measuring chemical similarities requires disentangling the imprint left on the spectra by chemical factors of variation from that left by the other nonchemical factors of variation. Our goal is to identify chemically similar stars from their spectra, for stars that span a range of physical stellar parameters (i.e., effective temperatures and surface gravities). We approach this task from a data-driven perspective, and build an algorithm for identifying stars that are as chemically similar as birth siblings, using open-cluster spectra.

For our method, we assume that open clusters are close to chemically homogeneous because of their common birth origin (Ness et al. 2018) but are not special in any other way (at least in terms of their spectra). That is to say, we assume that the only information within spectra useful for recognizing open clusters are the chemical features of the spectra, and so that a model which identifies open clusters from the spectra will need to do so by extracting the chemical information within spectra.

We frame the task of building such a model recognizing open clusters as a metric-learning task. That is to say, we build a data-driven model converting stellar spectra into a representation in which Euclidean distances convey the uncalibrated probability of stars originating from a shared open cluster. To accomplish this, the training objective of our data-driven algorithm can be understood as transforming stellar spectra into a representation in which the distance between intracluster stars is minimized and the distance between intercluster stars is maximized.

Distances in the representation resulting from such an optimizing procedure will organically quantify the chemical similarity of stars. Nonchemical factors of variation, such as stellar temperatures and instrumental systematics, will not contribute to the representation as their presence would make distances between stellar siblings larger. Instead, such a representation will only contain those factors of variation of spectra that are discriminative of open clusters, i.e., the chemical factors of variation. Crucially, chemical factors of variation will contribute to distances in the representation in proportion to how precisely they can be estimated from stellar spectra. Stronger chemical features will be more strongly amplified than weaker chemical features.

The utility of this data-driven approach is that it is independent of imperfect model atmosphere approximations and other issues affecting synthetic spectra. This provides a high-fidelity technique to turn to specific applications within Galactic archeology, such as the chemical tagging of stars that are most chemically similar (Freeman & Bland-Hawthorn 2002).

The assumption underpinning our work is that this chemical information will be the only information within stellar spectra useful for distinguishing open clusters and so will be the only information captured by our model. If this assumption is true, then the representation induced by the model will measure a form of chemical similarity between stellar spectra.

However, since open clusters, in addition to sharing a common age and near-identical birth abundances, are also gravitationally bound, they can be identified from their spatial proximity if such information is available in the spectra. As such spatial information does not robustly transfer toward identifying dissolved clusters, it must not be captured by our model. In this work, we apply our algorithm to pseudo-continuum normalized spectra with diffuse interstellar bands masked, which we assume not to contain any information about spatial location so as for our representation after training to only contain chemical information. Assuming pseudo-continuum normalized spectra do not encode any spatial information is plausible, since after continuum normalization, the spectrum should not contain significant information about stellar distance. With the impact of reddening removed and diffuse interstellar bands masked, a spectrum should also not contain any information about the stellar extinction and interstellar medium along the line of sight of the star. We examine the validity of these assumptions in latter sections. Ultimately, it is worth emphasizing that our method only exploits features proportionally to their discriminativeness at recognizing open clusters. Therefore, we can expect our model to not heavily rely on nonrobust features (provided that these are significantly less informative than robust features). This also relies on our open-cluster training data being representative of the parameter space that should be marginalized out, i.e., our model does not learn to associate intercluster stars via $\mathrm{log}g$ and T_eff, which could happen if the evolutionary state of the observed cluster stars was similar within clusters and different between clusters.

2.2. Principal Component Analysis

Our metric-learning algorithm, RSCA, first uses PCA to transform the data into a (lower-dimensional) basis that represents the primary variability of the ensemble of spectra we work with.

The principal components of a data set X, of shape N_D × N_F containing N_D data points and N_F features, are an ordered orthogonal basis of the feature space with special properties. In the principal-component basis, basis vectors are ordered by the amount of variance they capture. They have the property that for any k, the hyperplane spanned by the first k-axes of the basis is the k-dimensional hyperplane, which maximally captures the data variance. In PCA, the number of principal components used, k, is a hyperparameter controlling the trade-off between the amount of information preserved in the data set X after compression and the degree of compression.

The principal-component basis corresponds to the unit-norm eigenvectors of the covariance matrix of X ordered by eigenvalue magnitude. This can be obtained through diagonalization of the covariance matrix. The principal-component basis can also be formulated as the maximum-likelihood solution of a probabilistic latent model, which is known as probabilistic principal component analysis (PPCA; see Bishop 2006). This probabilistic formulation is useful in that it enables one to obtain the principal components for a data set containing missing values by marginalizing over these.

As we will make use of further in this paper, the principal components also allow for generating a sphering transformation. This is a linear transformation of the data set to a new representation, in which the covariance matrix of the data set X is the identity matrix. Sphering using PCA is carried out by performing a change-of-basis to a modified principal-component basis in which the principal components are divided by the square root of their associated eigenvalues.

3. Relevant Scaled Component Analysis Algorithm

The inputs to RSCA are individual stellar spectra, some of which belong to open clusters, and some of which are a reference field sample. The output of RSCA is, for each spectrum, a N_k vector, in which dimensions are scaled such that distances between N_k vectors of pairs of stars encode chemical similarity between those stars. We step through this in detail below.

3.1. Overview

Let us define X_clust as the matrix representation of a data set containing the spectra of known open-cluster stars. Analogously, let us define X_pop as the matrix representation of a larger data set of stellar spectra in the field (with unknown cluster membership). These matrices are, respectively, of shapes ${N}_{{{\rm{d}}}_{\mathrm{clust}}}\times {N}_{{\rm{b}}}$ and ${N}_{{{\rm{d}}}_{\mathrm{pop}}}\times {N}_{{\rm{b}}}$ , where ${N}_{{{\rm{d}}}_{\mathrm{clust}}}$ is the number of open-cluster stars, ${N}_{{{\rm{d}}}_{\mathrm{pop}}}$ the number of of stars in the large data set and N_b the number of spectral bins. For our purposes, we assume access to only a limited number of open-cluster stars such that ${N}_{{{\rm{d}}}_{\mathrm{clust}}}\ll {N}_{{{\rm{d}}}_{\mathrm{pop}}}$ . We also assume that the spectra in these matrices are pseudo-continuum normalized spectra, with diffuse interstellar bands masked in a process following that described in Appendix A. Pseudo-continuum spectra are normalized rest-frame spectra in which the effects of interstellar reddening and atmospheric absorption are removed, in a process described in Majewski et al. (2017).

RSCA takes as inputs X_clust and X_pop. Through a series of linear transformation to these matrices, RSCA maps these matrices into new matrices of shape ${N}_{{{\rm{d}}}_{\mathrm{clust}}}\times {N}_{{\rm{K}}}$ and ${N}_{{{\rm{d}}}_{\mathrm{pop}}}\times {N}_{{\rm{K}}}$ whose entries are the stellar spectra transformed to a metric-learning representation of dimensionality K. Euclidean distances in this new metric-learning representation can then be used to measure chemical similarity between spectra. As all the steps of RSCA are linear transformations, the mapping converting from spectra to the metric-learning representation can be parameterized by an N_K × N_bins matrix and used to convert unseen spectra (or for visualization purposes).

We provide in Figure 1 a graphical depiction of the linear transformations involved in the RSCA algorithm. RSCA works by first projecting the spectra onto a set of basis vectors with PCA (Step 1). For visualization purposes this basis is made two-dimensional, although it would normally be higher dimensional. After this PCA compression (Step 1), stellar siblings are represented as same-colored dots whose x, y coordinates correspond to coordinates in the PCA basis. Once in the PCA basis a series of linear transformations are applied to the data. For improved clarity, we keep the data fixed throughout our algorithm visualization and represent linear transformations as change-of-basis (black arrows). Steps 2 and 3 of the algorithm find a new basis which more aptly captures spectral variability among stellar siblings, and Step 4 of the algorithm rescales basis vectors of this basis based on a comparison between their spectral flux variance among stellar siblings and among field stars. The outcome of the RSCA algorithm is a new representation of the spectra in which dimensions that are unhelpful for discriminating stellar siblings are minimized in amplitude (through a stretching-out of basis vectors). Conversely, dimensions that are helpful in recognizing stars within the same open clusters are made larger (through a squeezing of basis vectors). The K vector of principal components for each star can be collapsed into a measure of chemical similarity through a Euclidean distance measure between stars of their scaled representation, output from RSCA, where d = $\sqrt{{({n}_{K}-{n}_{K}^{{\prime} })}^{2}}$ for any pair of stars $n,n^{\prime}$ .

We now walk step-by-step through the successive linear transformations involved in the RSCA algorithm. For following along, the pseudo-code for RSCA is provided in Appendix D and the full source code of our project, which contains a Python implementation, is made available online.⁴

3.2. Step 1: Compress the Spectra with PCA to Reduce the Risk of Overfitting

In the first step of our approach, denoted as (1) "Compress spectra", in Figure 1, we apply PCA to X_pop to convert the population of stellar spectra into a lower-dimensional representation. This dimensionality-reduction step serves to make the algorithm more data efficient, which is crucial given the risk of overfitting from the small number of open clusters within our data set.

As some stellar bins are flagged as untrustworthy, we use PPCA, a variant of the PCA algorithm which can accommodate missing values. After finding principal components of X_pop, we compress the data by discarding all but the K-largest principal components, where K is a hyperparameter requiring tuning. Then, data sets X_pop and X_clust are each transformed using the K-basis vectors, which we call Z_pop and Z_clust, the representation of the spectra in the PCA basis of X_pop. These have shapes of ${N}_{{{\rm{d}}}_{\mathrm{pop}}}\times {N}_{{\rm{K}}}$ and ${N}_{{{\rm{d}}}_{\mathrm{clust}}}\times {N}_{{\rm{K}}}$ .

3.3. Metric Learning: Sphering, Reparameterization, and Rescaling

Step (1) in our procedure of our PCA compression is a preprocessing step. The steps that follow fall into the realm of a general-purpose metric-learning algorithm. These rely on assumptions about the PCA-compressed spectra being satisfied. Performance should be robust to within small departures from these assumptions, but will still ultimately be tied to how well these assumptions are respected. We lay out our assumptions for steps (2)–(4) below.

3.3.1. Assumptions

First, we assume that the data (i.e., spectra) in Z_pop are well approximated as being drawn from a multivariate Gaussian distribution. That is to say, that if we define μ_pop and Σ_pop as the mean and covariance of Z_pop, then the stars within Z_pop can be assumed as being samples drawn from z_pop ∼ N(μ_pop, Σ_pop).

Next, we make the assumption that individual clusters are themselves approximately Gaussian in the PCA-compressed space. That is to say, we posit that the members of open clusters are well approximated as being samples drawn from a distribution z_clust ∼ N(μ_clust, Σ_clust). Crucially, we make the assumption that all open clusters share the same covariance matrix, Σ_clust, and only differ in their mean, μ_clust. This is perhaps our strongest and most important assumption. That is, that the stars within different clusters are distributed following a shared covariance matrix (i.e., clusters have the same shape irrespective of their location in the representation). It is this assumption which allows a linear transformation, i.e., a transformation that acts the same across the whole representation, to be an effective approach for measuring chemical similarity. This assumption of what is effectively cluster translation invariance, can be interpreted as assuming that the scatters among stellar spectra in physical and chemical parameters should be the same for all clusters irrespective of the clusters' parameters, which is a sensible assumption. Connecting these assumptions back to Figure 3, each step 2–4 requires that stars in any population, within clusters and within the field, follow a multivariate Gaussian with an invariant covariance matrix for each individual cluster.

3.3.2. Step 2: Sphering to Transform the Population Covariance Matrix into the Identity Matrix

Together, the sphering and reparametrization steps of RSCA serve to transform Σ_pop and Σ_clust to a vector representation of stellar spectra in which Σ_pop and Σ_clust, the covariance matrices of the field stars and clusters, respectively, are diagonal matrices. As there are then no off-diagonal terms in the covariance matrices, this ensures that the variances along basis vectors fully capture the covariance information among stellar siblings and among field stars.

In Step (2) of our algorithm, "Sphere" in Figure 1, we linearly transform the vector representation of spectra such that the data set Z_pop has a covariance matrix of identity after transformation. This linear transformation takes the form of a sphering transformation applied to Z_pop, where in our experiments we use the PCA-sphering scheme. This has the utility of fully capturing the variability among stellar spectra and among stellar siblings.

3.3.3. Step 3: Reparameterization to Diagonalize the Cluster Covariance Matrix

In steps 1 and 2 of RSCA we do operations on the full-field population and full-cluster population, respectively. In steps 3 and 4, we transform and scale to recognise stellar-sibling likeness compared to the field, so consider stellar variability of the stars within individual clusters individually.

In the next step of our algorithm after the sphering transform, (3) "Reparametrize" in Figure 1 represents a change of basis. In this step, the covariance matrix, Σ_clust, is diagonalized. Since we do not have direct access to Σ_clust, as an intermediary step in doing so, a data set approximately distributed according according to N(0, Σ_clust) is created from the open-cluster data set. This is done by subtracting each star's vector representation from the mean representation of all stars belonging to the same cluster, ${\widehat{\mu }}_{\mathrm{clust}}$ , such that each cluster becomes zero-centered. It is worth noting that as ${\widehat{\mu }}_{\mathrm{clust}}$ is estimated from a limited number of samples, it will not exactly match with the true μ_clust and so the resultant population will only approximately be distributed according to N(0, Σ_clust).

As PCA basis vectors correspond to (unit-norm) eigenvectors of covariance matrices, the PCA basis obtained by applying PCA to the zero-centered open-cluster data set parametrizes a transformation to a representation in which Σ_clust is (approximately) diagonal. A change-of-basis to this PCA basis thus parametrizes the desired diagonalization of the cluster covariance matrix. Because the basis vectors of the PCA basis have by construction unit-norm, the covariance matrix Σ_pop will still be the identity matrix after this change-of-basis, and hence both Σ_clust and Σ_pop will be diagonal matrices as desired.

3.3.4. Step 4: Scaling to Maximize Discriminative Power in Identifying Chemically Similar Stars

In the final step of the metric-learning algorithm, denoted by (4) "Rescale" in Figure 1, basis vectors are scaled proportionally to their dimension's usefulness at recognizing open clusters. This is done by applying separately to each dimension of the representation a scaling factor. Here, the independent scaling of dimensions is justified by Σ_clust and Σ_pop being diagonal covariance matrices.

We use the separate individual clusters within X_clust and field populations X_pop to measure variance in each dimension to determine our scaling factor. We design our scaling factor such that, after transformation, distances between pairs of random stars quantify the ratio between the probability that pairs of stars originate from the same (open) cluster and the probability that they do not. To put it another way, we seek to scale dimensions such that pairs of stars that are more likely to originate from the same cluster as compared to originating from different clusters have a smaller separation (i.e., Euclidean distance) in the representation than less-likely pairs.

Under our set of assumptions, along a dimension, i, among the K dimensions, random stellar siblings (intracluster stars) are distributed as ${z}_{\mathrm{clust}}\sim N({\mu }_{{\mathrm{clust}}_{{\rm{i}}}},{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}})$ , where ${\mu }_{{\mathrm{clust}}_{{\rm{i}}}}$ and ${\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}$ are the mean and standard deviation along dimension i (at this stage in the algorithm). Accordingly, using the standard formula for the sum of normally distributed variables, the one-dimensional distance along i between pairs of random stellar siblings ${z}_{{{clust}}_{1i}}$ and ${z}_{{{clust}}_{2i}}$ follows a half-normal distribution, ${d}_{{\mathrm{clust}}_{{\rm{i}}}}\sim | {z}_{{{clust}}_{1i}}-{z}_{{{clust}}_{2i}}| =| N(0,2{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}})|$ . Likewise, the distance, ${d}_{{\mathrm{pop}}_{{\rm{i}}}}$ , between pairs of random field stars follows a similar half-normal distribution, ${d}_{{\mathrm{pop}}_{{\rm{i}}}}\sim | N(0,2{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}})|$ , where ${\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}$ is the standard deviation among field stars along dimension i.

For a pair of stars observed a distance, d_i, away from each other along a dimension, i, the ratio between the probability of the pair originating from the same cluster (intracluster) and the probability of the pair not originating from the same cluster (intercluster) is

$\begin{eqnarray}&&{r}_{i}({d}_{i})=\displaystyle \frac{p({d}_{{\mathrm{clust}}_{{\rm{i}}}}={d}_{i})}{p({d}_{{\mathrm{pop}}_{{\rm{i}}}}={d}_{i})}=\left|\displaystyle \frac{N(0,2{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}})}{N(0,2{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}})}\right|,\end{eqnarray} \tag{ 1 }$

which evaluates to (as distances d_i are by design greater than 0)

$\begin{eqnarray}&&{r}_{i}({d}_{i})={A}_{i}{e}^{\tfrac{-{d}_{i}^{2}}{2{\sigma }_{{r}_{i}}^{2}}},\end{eqnarray} \tag{ 2 }$

where

$\begin{eqnarray}&&{A}_{i}=\displaystyle \frac{{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}}{{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}}\end{eqnarray} \tag{ 3 }$

and

$\begin{eqnarray}&&{\sigma }_{{r}_{i}}=\displaystyle \frac{{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}}{\sqrt{{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}^{2}-{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}}}.\end{eqnarray} \tag{ 4 }$

As dimensions are assumed to be independent, the probability ratio accounting for all dimensions is the product of the probability ratio of the separate dimensions:

$\begin{eqnarray}&&r=\prod _{i=0}^{K}\displaystyle \frac{N(0,2{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}})}{N(0,2{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}})}={{Ce}}^{-\tfrac{1}{2}{\sum }_{i=0}^{D}{\left(\tfrac{{d}_{i}}{{\sigma }_{{r}_{i}}}\right)}^{2}},\end{eqnarray} \tag{ 5 }$

where $C={\prod }_{i=0}^{K}{A}_{i}$ .

From this expression, it can be seen that multiplying dimensions by a scaling factor of $\tfrac{1}{{\sigma }_{{r}_{i}}}$ leads to a representation in which the probability ratio, r, is a function of Euclidean distance and where pairs of stars with smaller Euclidean distance separation have a higher probability of originating from the same open cluster as compared to their probability of originating from a different cluster than pairs with larger Euclidean separation.

It is clear that dividing dimensions by a scaling factor of $\tfrac{1}{{\sigma }_{{r}_{i}}}$ induces a representation where distances, d, which are measured between the scaled reconstructed representation, directly encode the probability of stars originating from the same cluster, as desired for our metric-learning approach. However, using this expression as a scaling factor requires evaluating the ${\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}$ 's and ${\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}$ 's along all dimensions. Because the representation has been sphered, the population's standard deviation is unity along all directions ( ${\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}=1$ ). We estimate the intracluster standard deviations using a pooled variance estimator:

$\begin{eqnarray}&&{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}=\displaystyle \frac{{\sum }_{j=1}^{k}\left({n}_{j}-1\right){\sigma }_{{ji}}^{2}}{{\sum }_{j=1}^{k}\left({n}_{j}-1\right)},\end{eqnarray} \tag{ 6 }$

where ${\sigma }_{{ji}}^{2}$ refers to the sample variance along dimension, i, for the sample of stars belonging to an open cluster, j, containing n_j stars in X_clust. To make the algorithm more robust to the presence of any outliers in the data set, such as misclassified stellar siblings, we use the median absolute deviation (MAD) as an estimator for the sample standard deviation, σ_ji. That is,

$\begin{eqnarray}&&\mathrm{MAD}=\mathrm{median}\left(\left|{X}_{i}-\tilde{X}\right|\right),\end{eqnarray} \tag{ 7 }$

where X_i and $\tilde{X}$ are, respectively, the data values and median along a dimension.

To better understand the effect of our scaling factor on the representation it is applied to, it is instructive to look into the impact it has on the distances between stars along dimensions of a representation. When stars belonging to the same cluster have, along a dimension, a similar standard deviation to the full population of stars (i.e., σ_clust ≈ σ_pop), the dimension carries no information for recognizing cluster member stars and the scaling factor accordingly fully suppresses it σ_r → ∞ . On the other hand, for dimensions where the population's standard deviation, σ_pop, is significantly larger than the cluster standard deviation, σ_clust, the population's standard deviation is no longer relevant and σ_r ≈ σ_clust. That is to say, the scaling devolves into measuring distances relative to the number of standard deviations away from the cluster standard deviation.

4. Experiments on APOGEE Data

We validate our approach for encoding chemical similarity by testing its performance on real data obtained by the APOGEE survey Data Release 16 (DR16; Ahumada et al. 2020). The APOGEE survey (Majewski et al. 2017) is an infrared, high-resolution, high signal-to-noise spectroscopic survey. The survey uses a 300-fiber spectrograph (Wilson et al. 2019) installed at the Sloan Digital Sky Survey telescope located at the Apache Point Observatory (Gunn et al. 2006) and a similar spectrograph on the DuPont telescope at the Las Campanas Observatory (Bowen & Vaughan 1973).

4.1. Data Set Preparation

For our experiments we use spectra from the public APOGEE DR16 (Ahumada et al. 2020) to create X_clust and X_pop, our data sets of field and open-cluster stars. Our field data set, X_pop, contains spectra for 151,145 red-giant-like stars matching a set of quality cuts on the sixteenth APOGEE data release described below. Our open-cluster data set, X_clust, contains spectra for 185 stars distributed across 22 open clusters, obtained after further quality cuts using the Open Cluster Chemical Abundance and Mapping (OCCAM) value-added catalog (Donor et al. 2020), a catalog containing information about candidate open clusters observed by APOGEE. We also create baseline data sets, Y and Y_clust, containing stellar abundances for the stars in X and X_clust. We include abundances for 21 species in Y and Y_clust: C, CI, N, O, Na, Mg, Al, Si, S, K, Ca, Ti, Ti II, V, Cr, Mn, Fe, Co, Ni, Cu, and Ce. These abundances are derived from the X_H entry in the allStar FITS file.

To create the data set of field stars. X_pop, we make the following data set cuts. With the intention of only preserving red-giant stars, we discard all but those stars for which 4000 < T_eff < 5000 K and $1.5\lt \mathrm{log}g\lt 3.0$ dex, where we use the T_eff and $\mathrm{log}g$ derived by the APOGEE Stellar Parameter and Chemical Abundances Pipeline (ASPCAP) pipeline. In addition, we further exclude any stars for which some stellar abundances of interest were not successfully estimated by the ASPCAP pipeline by removing any star containing abundances set to −9999.99 for any of our 21 species of interest. These abundance cuts are applied using the X_H and X_Fe ASPCAP fields rather than the named elemental abundance fields. This selection excludes an additional tilde 5 % of stars. We also exclude all spectra for which the STAR_BAD flag is set in ASPCAPFLAG. The pseudo-continuum spectra of those remaining stars, as found in the AspcapStar FITS file, were used to create the matrix, X_pop, in which each column contains the spectrum of one star. As described in Majewski et al. (2017), this pseudo-continuum normalization procedure involves a series of transformations that aim to standardize the stellar spectra. These involve subtracting the stellar continuum, shifting to the star's rest frame, removing interstellar reddening and atmospheric absorption effects, and normalizing the spectrum.

To create the data set of open-cluster member stars, X_clust, we cross-match our filtered data set with the OCCAM value-added catalog (Donor et al. 2020) so as to identify all candidate open clusters observed by APOGEE. We only keep those spectra of stars with open-cluster membership probability CG_PROB > 0.8 (Cantat-Gaudin et al. 2018). After this cross-match, we further filtered the data set by removing those clusters containing only a single member star as these are not useful for us. Additionally, we further discard one star with Apogee ID "2M19203303+3755558" found to have a highly anomalous metallicity. After this procedure, 185 OCCAM stars remain, distributed across 22 clusters. We do not cut any stars based on their signal-to-noise ratio. The stars in X_pop have a median signal-to-noise ratio of 157.2 and interquantile range of 102.0–272.4, while those in X_clust have a median signal-to-noise ratio of 191.4 and interquantile range of 117.7–322.6.

Because of cosmic rays, bad telluric line removal, or instrumental issues, the measurements for some bins of stellar spectra are untrustworthy. We censor such bad bins to prevent them from impacting our low-dimensional representation. Censored bins are treated as missing values in the PPCA compression. In this work, we have censored any spectral bin for which the error (as found in the AspcapStar FITS file error array) exceeds a threshold value of 0.05. Additionally, we censor for all stars in the data set those wavelength bins located near strong interstellar absorption features. More details about the model-free procedure for censoring interstellar features can be found in Appendix A.

4.2. Measuring Chemical Similarity

Evaluating how good a representation is at measuring the chemical similarity of stars requires a goodness-of-fit indicator for assessing the validity of its predictions. We use the "doppelganger rate" as our indicator. This is defined as the fraction of random pairs of stars appearing as similar to, or more similar than, stellar siblings according to the representation, where similarity is measured in terms of distance, d, in the studied representation. It is worth noting that this procedure for estimating doppelganger rates is related to but different from the probabilistic approach presented in Ness et al. (2018).

We estimate doppelganger rates on a per-cluster basis by measuring distances between pairs of stars in the RSCA output representation. For each cluster in X_clust, the doppelganger rate is calculated as the fraction of pairs composed of one cluster member and one random star whose distance in the studied representation, d_{inter−family}, is less than the median distance amongst all cluster pairs, d_{intra−family}. That is, d_{intra−family} are pairs composed of two confirmed cluster members within X_clust and d_{inter−family} are pairs with one random field stars selected from X_pop and one cluster member from the studied cluster in X_clust. When calculating d_{inter−family}, we only consider pairs of stars with similar extinction and radial velocity, that is to say with ${\rm{\Delta }}\mathrm{AK}\_\mathrm{TARG}\lt 0.05$ and ΔVHELIO_AVG < 5. By only comparing stars at similar extinctions and similar velocities, we ensure that any model being investigated cannot reduce its doppelganger rate through exploiting extinction or radial velocity information in the spectra.

So as to facilitate comparisons between different representations, we aggregate the per-cluster doppelganger rates into a "global" doppelganger rate, which gives an overall measurement of a representation's effectiveness at identifying open clusters. The global doppelganger rate is obtained by averaging the per-cluster doppelganger rates through a weighted average in which clusters are weighted by their size in X_clust.

There is an added subtlety to assessing a representation through its global doppelganger rate. There are very few open-cluster stars in the data set. Therefore, RSCA as a data-driven procedure applied to open clusters is susceptible to overfitting to the open-cluster data set. To prevent overfitting from affecting results, we carry out a form of cross-validation in which clusters are excluded from the data set used for the derivation of their own doppelganger rate. In this scheme, calculating the global doppelganger rate of an RSCA representation requires repeated application of our algorithm, each time on a different subset with one cluster removed, as many times as there are open clusters.

We caution that our cross-validation approach has some implications on the derived doppelganger rates. Because every cluster's doppelganger rate is evaluated on a slightly different data subset, the quoted distances and doppelganger rates are not comparable from cluster to cluster.

4.3. PCA Dimensionality

The number of principal components used in the compression or encoding stage of RSCA (Step 1) is an important hyperparameter requiring tuning. In Figure 2, we plot the doppelganger rate against the number of principal components, both with and without using the cross-validation procedure described in Section 4.2. Results without cross-validation display significant overfitting and are mostly shown in an effort to highlight the importance of the cross-validation procedure.

**Figure 2.** Global doppelganger rates as a function of the number of PCA components used to encode spectra. Performance with cross-validation is shown in blue while performance without cross-validation is shown in green.
Download figure:
Standard image High-resolution image

This figure illustrates how, unsurprisingly, RSCA's performance is strongly dependent on the PCA dimensionality. Doppelganger rates decrease with increasing PCA dimensionality, up until a dimensionality of size 30. At K > 30, doppelganger rates start increasing because of overfitting. Since on our studied data set, RSCA reaches its peak performance for 30 PCA components (compared to 7514 bins in the raw spectral representation), all further quoted results and figures will accordingly use the first 30 PCA components.

That RSCA's performance still improves up to a dimensionality of 30 is interesting. This demonstrates that a hyperplane of at least size 30 is required for capturing the intrinsic variations of APOGEE stellar spectra, a number noticeably larger than the 10-dimensional hyperplane found in Price-Jones & Bovy (2017). Methodological differences between studies may partially explain such differences. For example, the PCA fit in Price-Jones & Bovy (2017) was applied on spectra displaying limited instrumental systematics and with nonchemical imprints on the spectra preliminary removed. It should also be noted that this does not mean that the chemical space is 30-dimensional. Some PCA dimensions may capture instrumental systematics or nonchemical factors of variation, such as residual sky absorption and emission and interstellar dust imprints. Also, since chemical species leave nonlinear imprints, because of the linearity of PCA, each nonlinear chemical dimension may require multiple PCA components to be fully captured.

4.4. RSCA Interpretability

We now study which spectral features are leveraged by the RSCA algorithm when recognizing open clusters. In the RSCA rescaled basis (Step 4), the dimensions are scaled proportionally to their perceived usefulness at measuring chemical similarity. Therefore, factors of variation judged most important by RSCA will correspond to the most strongly scaled dimensions of the representation (i.e., with the largest $\tfrac{1}{{\sigma }_{{r}_{i}}}$ ). Figure 3 shows the relationship between the three features with scaled dimensions with the largest amplitudes and [Fe/H] for a representation obtained by running the RSCA algorithm with a PCA dimensionality of 30.

**Figure 3.** Three features judged most important by the metric-learning approach plotted against [Fe/H] for the 151,145 stars in X_pop and the fourth-most-important feature plotted against VHELIO_AVG (radial velocity ASPCAP label). Locations of the 185 stars in X_clust (the open-cluster data set used to train the metric-learning model) are shown by orange markers.
Download figure:
Standard image High-resolution image

As seen from the left-most panel, there is a close relationship between "Feature #1", the RSCA dimension with the largest associated scaling factor, and the ASPCAP [Fe/H] label. The relationship is close to a one-to-one mapping, which illustrates how this feature traces the metallicity content of the stellar spectra. Because "Feature #1" (as a direction in a hyperplane of stellar spectra) is a linear function of the stellar spectra and metallicity is a nonlinear feature, some degree of scatter in the relationship is expected.

The relationship between "Feature #2" and [Fe/H] (second panel) exhibits the same bimodality as observed when plotting α enhancements [α/Fe] against [Fe/H] (Leung & Bovy 2018). This indicates that "Feature #2" captures α-element enhancements. It is particularly noteworthy that we are able to recover the α-element bimodality when the open clusters in our data set (orange markers) are located only in the low-α sequence. This demonstrates the metric model's capacity to extrapolate to abundance values outside the range of values covered in the open-cluster training data set. This provides evidence that the model may still be effective for stars atypical of those in the open-cluster data set.

The relationship between "Feature #3" and metallicity (third panel) is not as easily interpreted. Given the nonlinear nature of metallicity, it is possible that it encodes residual metallicity variability not captured by "Feature #1", but it is also possible that it contains some further independent chemical dimension.

This figure illustrates a nice property of RSCA. Because dimensions of the RSCA representation correspond to eigenvectors of the covariance matrix, Σ_clust, the RSCA algorithm, at least to first order, assigns the distinct factors of variation within spectra to separate dimensions of the representation. That is to say, that the dimensions of the RSCA representation capture distinct factors of variation, such as the metallicity or the α-element abundance, rather than a combination of factors of variation. Additionally, the most important factors of variation for recognizing open clusters occupy the dimensions with the largest scaling factors. This property makes RSCA particularly versatile. For example RSCA can be used to separate out high- and low-α-abundance stars in the disk, or to select low-metallicity stars. Additionally, it is likely that because of this property RSCA could be used to search for hidden chemical factors of variation within stellar spectra, although this has not been attempted in this paper.

We found that some of the dimensions of the RSCA representation showed trends with radial velocity. An example of a dimension showing a trend with radial velocity is shown in the last panel of Figure 3 and an investigation into the detailed causes of the radial velocity trends is presented in Appendix B. The existence of such trends indicates that, even after the ASPCAP pseudo-continuum normalization procedure which shifts spectra to the rest frame, RSCA is still capable, at least weakly, to exploit radial velocity information in the spectra to recognize stellar siblings. Because only a subset of the dimensions shows such trends, a representation tracing only chemistry can be obtained by only keeping those dimensions which show no trends with radial velocity. In this work, we propose to only keep the first three dimensions of the representation. While this choice might appear particularly stringent, as we will show in coming sections, these three dimensions contain the bulk of the discriminative power of the representation (see Table 1).

Table 1. Global Doppelganger rate Obtained by the RSCA Model Applied to Stellar Spectra in which all but the N most Strongly Scaled Dimensions of a 30-Dimensional RSCA Representation are Discarded

N	Doppelganger Rate
1	0.0962 ± 0.0212
2	0.0219 ± 0.0025
3	0.0198 ± 0.0028
4	0.0182 ± 0.0021
5	0.0180 ± 0.0021
6	0.0188 ± 0.0017
7	0.0184 ± 0.0017
30	0.0199 ± 0.0015

Note. As the implementation of the PPCA algorithm used in this paper yielded stochastic PCA components, doppelganger rates from spectra correspond to the mean across 10 runs, with error bars corresponding to the standard deviation among runs.

Download table as: ASCII Typeset image

4.5. Comparison of Using RSCA versus Measured Abundances in Calculating Chemical Likeness

In this section, we compare the effectiveness of measuring chemical similarity using a data-driven approach on the spectra, with that achievable from using measured stellar abundances. To do so, we compare the doppelganger rates that are obtained by RSCA to those from using stellar-abundance labels. The results of such a comparison are shown in Figure 4. For this figure, doppelganger rates are measured consistently from abundances and the RSCA approach (see figure caption for more detail), such that any differences in performance can be attributed to underlying differences in the information content of the representations. For this comparison, we omit the PCA dimensionality-reduction step when working with abundances, but otherwise apply the exact same algorithmic approach to abundances and to the spectra. We also compare the performance of the RSCA metric-learning approach (shown in red) to that of alternative, simpler representation rescaling approaches applicable for recognizing open clusters from abundances (shown in blue and green). We remind the reader that the doppelganger rates are evaluated for pairs of stars at the same extinctions and radial velocities. This guarantees that the doppelganger rate cannot be artificially reduced through our model exploiting information relating to radial velocity or extinction. Per-cluster doppelganger rates are also provided in Appendix E.

**Figure 4.** Global doppelganger rates estimated for varying metric-learning approaches and representations. On the x-axis, "spectra" refers to doppelganger rates obtained from spectra X after dimensionality reduction with PPCA to a 30-dimensional space; "all abundances" to doppelganger rates obtained from a representation formed from the full set of APOGEE abundances in Y; and "abundance subset" to doppelganger rates obtained using a representation formed only from the abundances for the following species: Fe, Mg, Ni, Si, Al, C, and N. Global doppelganger rates "on raw" (blue) are obtained by measuring distances in the raw representation without any transformation of the representation, "on scaled" (green) are obtained by applying the scaling transform on the raw representation without preliminary application of the sphering and reparametrization transform (steps 1 and 4 for spectra and only Step 4 for abundances which do not need dimensionality reduction), "on transformed" are obtained by applying all steps of the proposed metric-learning approach (steps 1, 2, 3 and 4 for spectra and steps 2, 3, 4 for abundances). As the implementation of the PPCA algorithm used in this paper yielded stochastic PCA components, doppelganger rates from spectra correspond to the mean across 10 runs, with error bars corresponding to the standard deviation among runs.
Download figure:
Standard image High-resolution image

From this figure, we see that executing all steps of the RSCA is crucial for obtaining low doppelganger rates when working directly with spectra. This is seen from how the low doppelganger rates are only obtained with full application of our metric-learning approach (red). On the other hand, when working with stellar abundances, the RSCA approach appears to bring only limited benefits, as seen from the only slight difference in doppelganger rates between measuring distances in the raw abundance space (blue) and in the transformed space (red) that is obtained by application of our full metric-learning approach minus the PCA compression. This result is not surprising and reflects how most steps of the RSCA approach are designed with the aim of generating a representation comparable to stellar labels, that is to say, one where all factors of variation other than chemical factors of variation are removed. The RSCA approach does, however, still bring some benefits when working with abundances as seen from the lower doppelganger rates.

This figure also shows that excluding chemical species improves the doppelganger rate. This is seen from how the doppelganger rate is lower when using a carefully chosen subset of species (right) than when using the full set of abundances (center). This can appear counterintuitive as it implies that more data leads to worsened performance, but in this specific case, where the uncertainties on abundances are not accounted for, it can be justified by the low intrinsic dimensionality of chemical space. Since many species contain essentially the same information, adding species with higher uncertainty into the representation adds noise into the representation. It does this without contributing any additional information beneficial for recognizing open clusters. We expect that such an effect would disappear when accounting for uncertainties on stellar labels, but it is still a good illustration of the brittleness of abundance-based chemical tagging.

The combination of species shown in red is the set of species which were found, after manual investigation, to yield the lowest doppelganger rates. This is the combination of stellar individual element abundance labels Fe, Mg, Ni, Si, Al, C, and N (with respect to Fe with the exception of Fe, which is with respect to H). The doppelganger rate from this combination of species, of 0.023, despite being the smallest doppelganger rate achieved from stellar labels, is higher than the doppelganger rate obtained from stellar spectra of 0.020 (2%). That our method is able to produce better doppelganger rates from spectra than from stellar labels highlights the existence of information within stellar spectra not being adequately captured by stellar labels. While stellar labels are derived from synthetic spectra which only approximately replicate observations, our fully data-driven model makes direct use of the spectra translating into lower doppelganger rates.

4.6. Dimensionality of Chemical Space

That the PCA representation is 30-dimensional does not mean that all 30 dimensions carry information useful for recognizing open clusters. To get a grasp of the dimensionality of the chemical space captured by RSCA, we calculated the doppelganger rates for RSCA representations in which only the dimensions with the largest scaling factors are kept (ie other dimensions are excluded from the doppelganger rate distance calculations). We calculate doppelganger rates multiple times, each with a different number of dimensions preserved. The results of this investigation are shown in Table 1. From this table, we see that the dimensionality of spectra appears to be, at least to first degree, extremely low. The top two dimensions of the RSCA model (as shown in Figure 3) are capable of matching the performance obtained from stellar labels while the top four dimensions exceed the performance from using the full representation. The four-dimensional representation is even more effective at recognizing chemically identical stars than the full RSCA representation, which itself was more effective than stellar labels. It is not a new result that the dimensionality of chemical space probed by APOGEE is low. Recent research suggest that, at the precisions captured by APOGEE labels, chemical abundances live in a very low-dimensional space for stars of the disk. For example, it was found in Ness et al. (2019) and Ting & Weinberg (2021) that [Fe/H] and stellar age or [Fe/H] and [Mg/Fe] could predict all other elemental abundances to within, or close to, measurement precision (nonetheless, Ting & Weinberg 2021 and Weinberg et al. 2021 argue that correlations of abundance residuals imply underlying intrinsic structure in abundance space, even if the scatter of these residuals is only moderately larger than the per-star observational uncertainties). However, while previous analyses have depended on abundances to show this, here we can do this directly from spectra. As our methodology directly picks up on factors of variation and, if not controlled for, is capable of picking up on weak factors of variation such as diffuse interstellar bands, we can be confident that any remaining chemical factors of variation are either (i) highly nonlinear for our model to not be capable of picking up on them, (ii) very weak spectral features, or (iii) not particularly discriminative of open clusters, as would be the case for chemical variations arising from internal stellar processes, for example induced by accretion of planetary materials or internal stellar processes.

4.7. Impact of Data Set Size

Our method learns to measure chemical similarity directly from open clusters without reliance on external information. Because of this, its performance will be tightly linked to the quality and quantity of data available. Figure 5 attempts to estimate our method's dependency on the size of the open-cluster data set. In this figure, for varying PCA dimensionalities, we plot the expected doppelganger rate for an open-cluster data set containing a given quantity of open clusters, whose number is given by the x-axis. We estimate the expected doppelganger rate for a given number of open clusters by estimating and averaging the doppelganger rates for all data subsets containing that number of clusters. From this figure, we see that the larger PCA dimensionalities still benefit from the addition of open clusters. This is suggestive that performance would likely further improve with access to additional open clusters. Additionally, we may also expect the addition of stars to open clusters to also improve the doppelganger rates. Such larger data sets may also enable the usage of more complex nonlinear metric-learning approaches, which may yield further improvements not captured in this figure.

**Figure 5.** Expected global doppelganger rates when training a metric-learning model on only a subset of all open clusters in X_clust with a number of clusters given by the x-axis. Results for different PCA dimensionalities used for compressing stellar spectra are represented by different colored lines. Clusters used in the expectated doppelganger rate calculations were chosen randomly from X_clust, and quoted results are for the average of 50 repeated trials.
Download figure:
Standard image High-resolution image

5. Discussion

We have presented a novel approach for identifying chemically similar stars from spectroscopy based on training a metric-learning model on the spectra of open-cluster stars. This approach has several appealing properties. It is end-to-end data driven in the sense that it does not rely on synthetic models nor labels derived from synthetic models. This method only makes use of open-cluster members, which can themselves be identified with minimal reliance on theoretically derived quantities (Gao 2014; Castro-Ginard et al. 2018; Agarwal et al. 2021). This makes the method insensitive to the domain gap of synthetic spectra. Additionally, where traditional spectral-fitting approaches require instrumental systematics to be fully suppressed, lest they further exacerbate the domain gap, our fully data-driven approach, at least in theory, automatically learns on its own to ignore most variations due to instrumental systematics.

We expect that the approach that we have developed will perform particularly well in a number of regimes. For example, we expect it to perform well on low-resolution spectra, where blended features lead to compounding model inaccuracies, and on M-type stars, where molecular features complicate the retrieval process. In general, our method will likely be an efficient and effective approach where theoretical models are inaccurate or the observed spectra itself is plagued by complex systematics. This is, for example, the case for M dwarfs (Behmard et al. 2019; Birky et al. 2020).

Although our data-driven algorithm shows excellent performance, one may wonder whether there is still room for further improvements, particularly since strong chemical tagging, if ever possible, will require improvements in our chemical similarity measurements (Ting & Weinberg 2021). There are reasons to be hopeful here.

First, our data-driven model comes with clear limitations which are not inherent to the approach, but rather imposed by our modeling choices. There are many algorithmic choices for building a metric-learning model optimized at distinguishing open clusters (e.g., Shental et al. 2002; Goldberger et al. 2004; Weinberger & Saul 2009; Murphy 2021) and ours is only one of many choices. Our algorithm, RSCA, differentiates itself from other algorithms in that it returns a representation that is a linear transformation of spectra and in which basis vectors are ordered. The fact that RSCA returns an ordered basis helps with algorithmic interpretability and with constraining the chemical dimensionality of stellar spectra. However, the ability of RSCA to extract the chemical content of stellar spectra is constrained by its linear nature. Although this linearity is convenient for avoiding overfitting, enabling better out-of-distribution performance and facilitating cross-validation, it also artificially limits the precision with which "pure" features containing only chemical signals can be learned. One can be hopeful that a suitably regularized, nonlinear metric-learning model, such as for example a twin neural network (Chopra et al. 2005; Murphy 2021), could surpass our model. However, building such a model from our limited number of open clusters would present its own unique challenges.

Second, by being entirely data driven, the performance of this approach is inexorably linked to the quality and quantity of data available. This makes it poised to benefit from new open-cluster discoveries and/or deliberately targeted observations. Improvements may also be possible by leveraging other sources of chemically similar stars such as wide binaries.

Our method also comes with caveats. Data-driven methods do not extrapolate well outside of their training data set. As such, performance may be lower for clusters that are atypical compared to those in the open-cluster reference set. Since open clusters are typically younger stars (Portegies Zwart et al. 2010), this means performance may be decreased on older cluster stars. However, given the tight relationships between RSCA dimensions and chemical parameters for stars in X_pop, such an effect is likely to be small. Additionally, our model makes no use of the error information in spectra, which is valuable information that could likely be squeezed out for even better performance.

Another downside of our approach is its coarse-grained nature. While stellar labels provide a fine-grained view into chemical similarity with a breakdown into chemical composition of individual species, our approach only provides a coarse-grained measurement of chemical similarity. This limits the types of scientific problems the approach can be used to answer. However, it may very well be possible to extend the method from measuring overall chemical similarity to measuring individual elemental abundances. A way in which this could possibly be done is through applying it to windows centered on the locations of stellar lines instead of to the full spectrum. Also, it is not always clear what exact information is captured by the representation, and in particular there is always a risk, despite all of our checks, that the model is acting on nonchemical information within the spectra.

6. Conclusion

Large-scale Galactic surveys, the likes of APOGEE, the Large Sky Area Multi-Object Fibre Spectroscopic Telescope, and GALAH, have collected hundreds of thousands of high-quality stellar spectra across the Galaxy. These surveys are vastly broadening our understanding of the Milky Way. How best to analyze these spectra, however, still remains an open question. One limitation is that traditional spectral-fitting methods currently do not make full use of the information in stellar spectra. This is largely because our stellar models are approximations.

In this paper we developed a fully data-driven, linear metric-learning algorithm operating on spectra for extracting the chemical information within stellar families. Through experiments on APOGEE, we demonstrated that our metric-learning model identifies stars within open clusters more precisely compared to using stellar labels, which indicates an improved ability to discriminate between chemically identical stars. We further found that our model's capacity to distinguish open clusters could largely be attributed to a two-dimensional subspace of our final representation, which was found to approximately coincide with metallicity and α-elemental abundances. That our model's capacity at recognizing open clusters plateaus at N ∼ 4 supports the idea that the dimensionality of chemical space probed by APOGEE is, even at the (quite high) level of signal-to-noise typical of the APOGEE spectra, for Galactic archeology purposes low in the disk. However, we do find hints of further dimensions potentially containing chemical information.

There are several reasons why our metric-learning approach could be favored over using stellar labels. It can be applied to spectra of stars for which we do not yet have very good synthetic spectra and so would otherwise not be able to analyze well. It is completely independent of our theoretical knowledge of stellar atmospheres and so could be used to validate existing astronomical results in a way which is independent of any biases that may exist in our synthetic spectra. Finally, and perhaps most importantly, whereas the traditional derivation of stellar labels is fundamentally limited by our inability to generate faithful synthetic spectra, our metric-learning approach does not suffer from such a limitation. This means that by improving the quality of the training data set and the metric-learning approach used, performance may be further improved.

D.D.M. is supported by the STFC UCL Centre for Doctoral Training in Data Intensive Science (grant No. ST/P006736/1). D.D.M. thanks Serena Viti for helpful discussions.

M.K.N. is supported in part by a Sloan Foundation Fellowship.

Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions.

SDSS-IV acknowledges support and resources from the Center for High Performance Computing at the University of Utah. The SDSS website is www.sdss.org.

SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics ∣ Harvard & Smithsonian, the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional / MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Appendix A: Interstellar Masking

Regions containing interstellar absorption features are identified from the APOGEE data using a data-driven procedure as described below.

The method makes use of two data sets: one containing spectra of stars at low extinction, X_low, which should not contain interstellar features, and one of high-extinction stars, X_high, which should contain strong interstellar features. X_low is formed through a data set cut on X_pop in which only stars with AK_TARG < 0.005 are kept, X_high only preserves stars with AK_TARG > 0.5.

We apply PCA with 30 principal components to the data set of low-extinction stars, X_low, to obtain a PCA basis capturing the natural variations among stellar spectra at low extinction. Since this PCA basis only captures the variations among low-extinction stars, features in high-extinction stars associated to the interstellar medium will be poorly reconstructed by the projection onto this low-extinction PCA hyperplane.

In Figure 6, we plot the mean residual per wavelength between stellar spectrum and their projection on the low-extinction PCA hyperplane averaged over all stars in the high-extinction data set, X_high. High residual regions in this fit will correspond to regions poorly captured by the low-extinction PCA hyperplane. Comparing the high residual regions with the locations of known diffuse interstellar bands reveals excellent agreement (Elyajouri et al. 2016, 2017). For this paper, those regions that should be censored were selected manually with the final choice of regions overlaid in yellow in Figure 6.

Appendix B: Visualizing Radial Velocity Instrumental Systematics

We apply a similar approach to that taken for extinction in Appendix A to radial velocities. That is to say, we create a data set of low-radial-velocity stars by selecting only stars for which ∣VHELIO_AVG∣ < 5 km s⁻¹. We train a PCA model on the low-velocity spectra and visualize the PCA model's residuals on a data set of high-velocity spectra with VHELIO_AVG > 80 km s⁻¹. As in the previous section, we use a 30-dimensional PCA model.

In Figure 7, we plot the mean absolute residual per wavelength between stellar spectrum and their projection on the low-velocity PCA hyperplane averaged over all stars in the high-radial-velocity data set, X_high. High residual regions in this fit will correspond to regions poorly captured by the low-velocity PCA hyperplane. In this plot, in addition to the diffuse interstellar bands, there appears to be other regions with weak instrumental systematics correlated with radial velocity.

It is worth mentioning that a variant of RSCA can be used for removing radial velocity imprints on the spectra. By applying the RSCA algorithm to the same radial velocity stellar groups instead of open clusters, one can identify a hyperplane of the stellar spectral space capturing solely spectral features correlated to radial velocity. Substracting variations within this hyperplane from stellar spectra then yields spectra in which features correlated with radial velocity are selectively suppressed. Here, we do not apply such a preprocessing procedure as it complicates the analysis while not improving over the simpler procedure of only keeping the first three dimensions.

Appendix C: Checking for Instrumental Systematics

It is worthwhile to ascertain that our model, when identifying open clusters, is only relying on chemical features within the spectra and not on instrumental systematics that happen to be predictive of open clusters and would not transfer toward identifying dissolved clusters. This is especially important as any dependency on instrumental systematics would lead to overly optimistic doppelganger rates.

Because the stars in open clusters are gravitationally bound, they are often part of the same telescope field of view and so observed simultaneously on nearby fibers of a plate. Being observed together could plausibly introduce systematics (due to instrumental imperfections or telluric residuals), which could then be actioned on by the metric-learning model when identifying open clusters. Here, we run an experiment to ascertain that this is not an issue for our model.

Since any such shared instrumental systematic will only affect stars observed simultaneously, we can validate that our model is not exploiting shared instrumental systematics by comparing the similarity distributions for stellar siblings that were observed together on the same plate and for stellar siblings that were not. The idea being that a model exploiting instrumental systematics would have a lower doppelganger rate on the pairs of stars observed together than on the pairs of stars observed separately.

To separate stellar siblings into pairs of stars observed together and pairs observed separately, we went through all pairs of stellar siblings in the open-cluster data set. Using the individual observation dates of exposures that comprise the combined spectra, as provided by the VISITS allStar field, we categorized pairs of siblings into two groups: those pairs composed of stars observed together, i.e., with the same dates of visits for exposures, and those pairs of stars observed separately. When doing this analysis we discarded the small fraction of stellar pairs for which visit dates only partially overlapped.

In Figure 8, we show the distributions of chemical similarities for pairs of open-cluster stars observed simultaneously compared to pairs observed separately. At left, we show our metric-learning approach applied to spectra (with a 30-dimensional latent). At right, we show this applied to abundances. For both metric-learning models, stars observed together are predicted to be slightly more chemically similar than stars observed separately. Since both approaches, applied to spectra and to abundances, provide similar behaviors with visit overlap, we conclude that our method is not making strong use of instrumental systematics when recognizing open clusters. It is nonetheless interesting that both approaches seem to marginally favor stars observed together as being more chemically similar, although given the small number of open clusters this may be due to small sample sizes.

Appendix D: RSCA Pseudo-code

The pseudo-code for the RSCA algorithm.

Algorithm 1. RSCA Algorithm

Date: ${X}_{\mathrm{clust}},{X}_{\mathrm{pop}}$ ${X}_{\mathrm{clust}},{X}_{\mathrm{pop}}$

$\mathrm{compressor}=\mathrm{PPCA}(\mathrm{data}={{\rm{X}}}_{\mathrm{pop}},{{\rm{n}}}_{\mathrm{components}}={{\rm{N}}}_{{\rm{K}}})$ $\mathrm{compressor}=\mathrm{PPCA}(\mathrm{data}={{\rm{X}}}_{\mathrm{pop}},{{\rm{n}}}_{\mathrm{components}}={{\rm{N}}}_{{\rm{K}}})$ ; // Step 1

${Z}_{\mathrm{pop}}=\mathrm{compressor}.\mathrm{transform}({{\rm{X}}}_{\mathrm{pop}})$ ${Z}_{\mathrm{pop}}=\mathrm{compressor}.\mathrm{transform}({{\rm{X}}}_{\mathrm{pop}})$ ;

${Z}_{\mathrm{clust}}=\mathrm{compressor}.\mathrm{transform}({{\rm{X}}}_{\mathrm{clust}})$ ${Z}_{\mathrm{clust}}=\mathrm{compressor}.\mathrm{transform}({{\rm{X}}}_{\mathrm{clust}})$ ;

$\mathrm{spherer}=\mathrm{sphere}(\mathrm{data}={{\rm{Z}}}_{\mathrm{pop}})$ $\mathrm{spherer}=\mathrm{sphere}(\mathrm{data}={{\rm{Z}}}_{\mathrm{pop}})$ ; // Step 2

${Z}_{\mathrm{pop}}=\mathrm{spherer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{pop}})$ ${Z}_{\mathrm{pop}}=\mathrm{spherer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{pop}})$ ;

${Z}_{\mathrm{clust}}=\mathrm{spherer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{clust}})$ ${Z}_{\mathrm{clust}}=\mathrm{spherer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{clust}})$ ;

${Z}_{\mathrm{intra}-\mathrm{cluster}}=\mathrm{ZeroCenterClusters}({{\rm{Z}}}_{\mathrm{clust}})$ ${Z}_{\mathrm{intra}-\mathrm{cluster}}=\mathrm{ZeroCenterClusters}({{\rm{Z}}}_{\mathrm{clust}})$ ; // Step 3

$\mathrm{reparametrizer}=\mathrm{PCA}(\mathrm{data}={{\rm{Z}}}_{\mathrm{intra}-\mathrm{cluster}},{{\rm{n}}}_{\mathrm{components}}={{\rm{N}}}_{{\rm{K}}})$ $\mathrm{reparametrizer}=\mathrm{PCA}(\mathrm{data}={{\rm{Z}}}_{\mathrm{intra}-\mathrm{cluster}},{{\rm{n}}}_{\mathrm{components}}={{\rm{N}}}_{{\rm{K}}})$ ;

${Z}_{\mathrm{pop}}=\mathrm{reparametrizer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{pop}})$ ${Z}_{\mathrm{pop}}=\mathrm{reparametrizer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{pop}})$ ;

${Z}_{\mathrm{clust}}=\mathrm{reparametrizer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{clust}})$ ${Z}_{\mathrm{clust}}=\mathrm{reparametrizer}.\mathrm{transform}({{\rm{Z}}}_{\mathrm{clust}})$ ;

for i = 1 to N_K do

${\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}=\tfrac{{\sum }_{j=1}^{k}\left({n}_{j}-1\right);{\sigma }_{{ji}}^{2}}{{\sum }_{j=1}^{k}\left({n}_{j}-1\right)}$ ${\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}=\tfrac{{\sum }_{j=1}^{k}\left({n}_{j}-1\right);{\sigma }_{{ji}}^{2}}{{\sum }_{j=1}^{k}\left({n}_{j}-1\right)}$ //Step 4

${\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}^{2}=1$ ${\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}^{2}=1$ ; // because of sphering

${\sigma }_{{{\rm{r}}}_{{\rm{i}}}}=\tfrac{{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}}{\sqrt{{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}^{2}-{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}}}$ ${\sigma }_{{{\rm{r}}}_{{\rm{i}}}}=\tfrac{{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}}{\sqrt{{\sigma }_{{\mathrm{pop}}_{{\rm{i}}}}^{2}-{\sigma }_{{\mathrm{clust}}_{{\rm{i}}}}^{2}}}$ ;

${Z}_{{\mathrm{pop}}_{{\rm{i}}}}={Z}_{{\mathrm{pop}}_{{\rm{i}}}}\div{\sigma }_{{{\rm{r}}}_{{\rm{i}}}}$ ${Z}_{{\mathrm{pop}}_{{\rm{i}}}}={Z}_{{\mathrm{pop}}_{{\rm{i}}}}\div{\sigma }_{{{\rm{r}}}_{{\rm{i}}}}$ ;

${Z}_{{\mathrm{clust}}_{{\rm{i}}}}={Z}_{{\mathrm{clust}}_{{\rm{i}}}}\div{\sigma }_{{{\rm{r}}}_{{\rm{i}}}}$ ${Z}_{{\mathrm{clust}}_{{\rm{i}}}}={Z}_{{\mathrm{clust}}_{{\rm{i}}}}\div{\sigma }_{{{\rm{r}}}_{{\rm{i}}}}$

end

Download table as: ASCIITypeset images: 1 2

Appendix E: Per-cluster Doppelganger Rates

The doppelganger rates for all open clusters in X_clust (Figures 9 and 10).

**Figure 9.** Histograms of the chemical similarity between open-cluster pairs of stars as predicted by the metric-learning approach. For each open cluster, the distribution of intercluster similarities, calculated as the distribution of similarities between pairs of stars composed of one random cluster member and a random field star, is shown in green and the distribution of intracluster similarities—the similarity between pairs of stellar siblings—is shown in blue. The median intracluster similarity, as used in doppelganger rate calculations, is marked by dashed vertical lines. The left-most panel displays the histograms derived from applying the metric-learning approach to stellar spectra. The right-most panel displays the histograms derived from applying the metric-learning approach to the "abundance subset" as defined and described in Section 4.5. Doppelganger rates for individual clusters are shown in the top-left corner of every panel.
Download figure:
Standard image High-resolution image

**Figure 10.** Continuation of Figure 9.
Download figure:
Standard image High-resolution image

Measuring Chemical Likeness of Stars with Relevant Scaled Component Analysis

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Concepts and Assumptions

2.1. Chemical Similarity as Metric Learning

2.2. Principal Component Analysis

3. Relevant Scaled Component Analysis Algorithm

3.1. Overview

3.2. Step 1: Compress the Spectra with PCA to Reduce the Risk of Overfitting

3.3. Metric Learning: Sphering, Reparameterization, and Rescaling

3.3.1. Assumptions

3.3.2. Step 2: Sphering to Transform the Population Covariance Matrix into the Identity Matrix

3.3.3. Step 3: Reparameterization to Diagonalize the Cluster Covariance Matrix

3.3.4. Step 4: Scaling to Maximize Discriminative Power in Identifying Chemically Similar Stars

4. Experiments on APOGEE Data

4.1. Data Set Preparation

4.2. Measuring Chemical Similarity

4.3. PCA Dimensionality

4.4. RSCA Interpretability

4.5. Comparison of Using RSCA versus Measured Abundances in Calculating Chemical Likeness

4.6. Dimensionality of Chemical Space

4.7. Impact of Data Set Size

5. Discussion

6. Conclusion

Appendix A: Interstellar Masking

Appendix B: Visualizing Radial Velocity Instrumental Systematics

Appendix C: Checking for Instrumental Systematics

Appendix D: RSCA Pseudo-code

Appendix E: Per-cluster Doppelganger Rates

Footnotes

Measuring Chemical Likeness of Stars with Relevant Scaled Component Analysis

Article metrics

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Concepts and Assumptions

2.1. Chemical Similarity as Metric Learning

2.2. Principal Component Analysis

3. Relevant Scaled Component Analysis Algorithm

3.1. Overview

3.2. Step 1: Compress the Spectra with PCA to Reduce the Risk of Overfitting

3.3. Metric Learning: Sphering, Reparameterization, and Rescaling

3.3.1. Assumptions

3.3.2. Step 2: Sphering to Transform the Population Covariance Matrix into the Identity Matrix

3.3.3. Step 3: Reparameterization to Diagonalize the Cluster Covariance Matrix

3.3.4. Step 4: Scaling to Maximize Discriminative Power in Identifying Chemically Similar Stars

4. Experiments on APOGEE Data

4.1. Data Set Preparation

4.2. Measuring Chemical Similarity

4.3. PCA Dimensionality

4.4. RSCA Interpretability

4.5. Comparison of Using RSCA versus Measured Abundances in Calculating Chemical Likeness

4.6. Dimensionality of Chemical Space

4.7. Impact of Data Set Size

5. Discussion

6. Conclusion

Appendix A: Interstellar Masking

Appendix B: Visualizing Radial Velocity Instrumental Systematics

Appendix C: Checking for Instrumental Systematics

Appendix D: RSCA Pseudo-code

Appendix E: Per-cluster Doppelganger Rates

Footnotes