Taxonomy matching between asteroids and meteorites: supervised clustering approach

We propose a data-driven method for investigating taxonomy relations between asteroids and meteorite. The method clusters data of reflectance spectra of asteroids with the guidance of relatively reliable taxonomy of meteorites. By comparing the clustering accuracy of asteroid between with and without guidance, we observe that the guidance of meteorites taxonomy improve the accuracy. This serves as an evidence that there is a common taxonomical structure between meteorites and asteroids, confirming a long-standing prediction.


Introduction
Linking meteorites to asteroidal bodies is an important subject toward better understanding of the Solar System. This is, however, not easy for various reasons such as space weathering. Some strong links between specific types are known so far. A link was confirmed between V-type asteroids and Howardite-Eucrite-Diogenite (HED) meteorites by the Dawn mission [2]. The link between S-type asteroids and ordinary chondrite meteorites was verified by the analysis of the sample taken from Itokawa by the Hayabusa mission [5]. A less certain link is known between C-type asteroids and carbonaceous chondrites. Other more uncertain links have been discussed also [3]. For investigating these links, there are some works that take data-driven approaches. Britt et al [1], Tholen et al [9], and Watabiki et al [10], for example, provided a map to overview the similarity in reflectance spectra between meteorites and asteroids by using principal component analysis.
This work proposes a data-driven method for linking the taxonomies between meteorites and asteroids. The proposed method classifies asteroid data with guidance of the known taxa of meteorites [11], and matches the taxa between the two domains of asteroids and meteorites. For this clustering and matching procedure, reflectance spectral data of asteroids are clustered so that the similarity between the cluster structures of the asteroids and meteorites is maximized. To represent the cluster structures, a nonlinear method, the so-called kernel method [7], is used to quantify the mutual relations among clusters. For the teaching data to guide the clustering of asteroid data, the reflectance spectral data and element composite data of meteorites are used. We call this analysis procedure supervised clustering.
As a first step, we confine our analysis to the three types in each of the asteroids and meteorites: C-, S-, and V-type for asteroids, and the carbonaceous chondrites, ordinary chondrites, and HED for meteorites. The reflectance spectral data are used for asteroids, while for meteorites each of the reflectance spectra and element composites is served for teaching data.
2. Method for supervised clustering 2.1. Data sets The data set of reflectance spectra for asteroid contains 365 data from 0.45 to 2.45 µm in wavelength divided by 401 bins of equal size. The dataset includes also the labels for the taxonomical types including C-, S-, V-, which are given by domain experts. These labels are not used for clustering but only for evaluation. As pre-processing, DeMeo's method [4] is applied. The numbers of C-, S-, and V-type data are 19, 95, and 8, respectively; 122 in total.
For meteorites, the dataset of reflectance spectra contains 731 data, in which 221 carbonaceous chondrites, 245 ordinary chondrites, 108 HED meteorites, 574 in total, are used for the analysis. The preprocessing is the same as the asteroid data. The element composition dataset of meteorites consists of 481 data, consisting of 30 carbonaceous, 388 ordinary and 63 HED meteorites. The dataset includes 12 elements: Fe, Si, Al, Ca, Mg, Na, P, K, Mn, Ni, Cr, and Na 2 O.

Supervised clustering
Our method for supervised clustering consists of two steps: [1st step] as initialization, the spectral clustering [6], which is a standard non-hierarchical clustering method, is applied to asteroid data, [2nd step] the meteorite data with taxonomic information is used to guide the clustering procedure for the asteroid data. We also refer to the standard clustering in the first step as unsupervised clustering in contrast with supervised clustering.
We review the spectral clustering in the sequel, since it serves as a basics of our supervised clustering. The spectral clustering operates on a similarity matrix, which contains as an element a similarity between two data points x and y. In this paper, the similarity is evaluated by Gaussian kernel function exp(− x − y 2 /2σ 2 ) with bandwidth parameter σ. In the case of asteroid data, N A × N A data-similarity-matrix K A is used, which contains similarity between each pair of spectral data, where N A is the size of the data set. The algorithm computes eigenvectors of the normalized graph Laplacian, defined by where T A is the diagonal matrix T A,ii = j K A,ij , representing the degree of the nodes in the graph represented by K A . We used the algorithm proposed by Ng et al [6], which further applies K-means clustering after projecting the data onto the eigenspaces corresponding to the least eigenvalues.
The proposed method of supervised clustering is also based on similarity matrices. As an overview, the method first constructs a data-similarity-matrix for each of the asteroid and meteorite data, and then computes respective cluster-similarity-matrices for each domain to represent the cluster structures. A similarity measure between these two cluster-similaritymatrices is introduced for evaluating the similarity of taxonomies, serving as an objective function for clustering asteroid data and matching the clusters of the two domains.
For more details, we explain for the asteroid data. Given data-similarity-matrix K A computed in the same way as spectral clustering, the cluster-similarity matrix is calculated by where W A is N A × C cluster assignment matrix with common cluster size C for the asteroid and meteorite domains, and D A is the normalizer. More specifically, W A,ia = 1 if and only if i-th data belongs to a-th cluster with the other components 0, and C × C matrix D A is given by where N a A is the number of data in the a-th cluster C a A of the asteroid dataset. The (a, b)-component of the cluster-similarity-matrix is then equal to which represents the average total similarities between the data in the two clusters. The cluster-similarity-matrix S M for the meteorite dataset is obtained in a similar way. Note that different data-formats, dimensionalities, and data sizes can be handled by the proposed similarity representations; the meteorite and asteroid datasets can have different data formats such as reflectance spectra and element composition, as demonstrated in Section 3. The proposed objective function for our supervised clustering is to maximize where H is the centering matrix defined by H ij = δ ij − 1/C, and M F denotes the standard Frobenius (Euclidean) norm. To optimize W A , a greedy search is used in a similar way to CLUHSIC [8]. Before the search, W A is initialized by the spectral clustering, W M is obtained by the labels given by experts, and the cluster order is also optimized so that Eq.(4) is maximized.

Clustering results
For evaluation, the cluster purity and cluster matching accuracy are computed based on the type information labeled by experts for the asteroid data. The cluster purity of a clustering is the portion of data that accord with a dominant type in a cluster. The cluster matching accuracy is the portion of the correct cluster assignments to the corresponding meteorite type. This is similar to classification accuracy. In the following results, the best bandwidth parameters σ A (asteroid) and σ M (meteorite) are chosen, but from the experimental results not shown here, similar results are obtained by a wide range of the parameters. Figure 1 shows supervised clustering of the asteroid data set with guidance of meteorite reflectance spectra data (σ A = 3.0, σ M = 1.0) (left) together with the teaching data of meteorites (right) for reference. They are shown after projecting onto the three dimensional space given by the principal component analysis. The cluster purity is 99.1% at best (σ A = 2.5, σ M = 1.0); this means significant improvement from the the cluster purity of the unsupervised clustering (spectral clustering), which is 92.8% at best (σ A = 1.5).
As for cluster matching accuracy, the supervised clustering gives 99.0%. In Figure 1 Next, we show supervised clustering results with guidance of element composites dataset for meteorites. Figure 2 is made in a similar manner to Figure 1. The cluster purity is 98.8% (σ A = 2.5, σ M = 1.5), which is significantly improved from the unsupervised clustering. The cluster matching accuracy is 98.9% (σ A = 2.5, σ M = 2.0) at best. It is important to note that the proposed supervised clustering works for different data types: reflectance spectra and element composites.
Those analyses show that taxonomic guidance of meteorite data in fact improves making asteroidal clusters, implying that a common cluster structure exists between the meteorite and asteroid domains.

Conclusion
We have studied a common taxonomical structure in asteroids and meteorites using a datadriven approach. We have proposed supervised clustering to approach this problem: it is investigated whether the taxonomy information of meteorite improves clustering of asteroid data. The numerical studies have demonstrated that the taxonomical information of meteorites does improve the accuracy of clustering of asteroids. This serves as an evidence that there is a common taxonomical structure between these two domains.    Figure 1. Clustering of asteroid with guidance of meteorite reflectance spectra. Three dimensional plots by PCA for the clustering results of asteroid data (left) and the teaching data of meteorite (right).