Classification of High-resolution Solar Hα Spectra Using t-distributed Stochastic Neighbor Embedding

Meetu Verma; Gal Matijevič; Carsten Denker; Andrea Diercke; Ekaterina Dineva; Horst Balthasar; Robert Kamlah; Ioannis Kontogiannis; Christoph Kuckein; Partha S. Pal

doi:10.3847/1538-4357/abcd95

1. Introduction

The volume and complexity of data is increasing in solar physics with the advent of new observing facilities and instrumentation. A typical three-dimensional data cube contains one wavelength and two spatial dimensions. However, photon-efficient instruments and fast cameras facilitate recording time series with high cadence, adding time as a fourth dimension. Furthermore, whenever magnetic fields are observed, the polarization state of the light constitutes another dimension. Various linear and nonlinear techniques are proposed for dimensionality reduction with the aim to preserve the structure of the data. Stochastic neighbor embedding (SNE; Hinton & Roweis 2002) is one of these techniques, which performs well on artificial data while struggling in visualizing real-world, high-dimensional data. Therefore, van der Maaten & Hinton (2008) proposed t-distributed stochastic neighbor embedding (t-SNE), which transforms a high-dimensional data set into a matrix of pairwise similarities. Thus, t-SNE captures the local structure and at the same time reveals the global structure of the data set. The mathematical details of the technique (see Section 3) were elaborated in van der Maaten & Hinton (2008), and implementations of the code in various programming languages are publicly available.⁴

t-SNE is a very powerful tool for the initial selection of data, thus reducing their dimensionality. Matijevič et al. (2017) employed t-SNE in three stages to detect metal-poor stars in the complete survey of the radial velocity experiment (RAVE; Steinmetz et al. 2006). They used t-SNE to create a low-dimensional projection of the spectrum space and to segregate the metal-poor star region. The final projection of spectra enables them to detect classes of the stars with similar atmospheric parameters, in particular clusters of metal-poor stars, which were used in the subsequent in-depth spectral analysis.

Traven et al. (2017) used t-SNE to recognize different spectral features and to classify stellar spectra in the Galactic Archaeology with Hermes (GALAH; De Silva et al. 2015) survey. About 77,500 spectra were selected out of 210,000 GALAH spectra using iteratively t-SNE and an algorithm for density-based spatial clustering (DBSCAN; Ester et al. 1996). The classification procedure arrived at six classes of stellar spectra. Further work on the GALAH survey was carried out by Kos et al. (2018) who used t-SNE as a tool for chemical tagging and as a way to visualize data in high-dimensional space. From about 187,000 stars with 13 chemical abundances, the t-SNE algorithm retrieved nine clusters in the chemical abundance space. The robustness of the t-SNE algorithm was further demonstrated by Anders et al. (2018) who visualized chemical abundances of stars contained in the HARPS-GTO exoplanet search program (Delgado Mena et al. 2017). The t-SNE algorithm allowed them to define more reliably the chemical subpopulations of solar-neighborhood stars. In addition, t-SNE projected maps revealed many stars with peculiar chemical compositions.

Solar physics as well as other branches of astronomy entered an era of rapidly growing data volumes, e.g., high-resolution imaging and spectroscopy, synoptic observations, and sophisticated numerical modeling of solar phenomena. Hence, machine-learning techniques became mainstream and are used by the community to tackle ever more complex tasks. When analyzing spectral observations with machine-learning techniques, spectral profiles serve as portals to information about the physical state of plasma at various levels of the solar atmosphere.

Carroll & Staude (2001) examined three different strategies for the inversion of spectral lines and Stokes profiles using artificial neural networks (ANNs). They used the photospheric infrared Fe i line and estimated magnetic field and other physical parameters with remarkable speed. Socas-Navarro (2005) carried out similar work with a three-stage approach for spectral-line inversion based on ANNs. First, the network was trained with synthetic spectra from a simplified model. Second, additional preprocessing using autoassociative neural networks projected the observation to the theoretical model space. The third stage included regularization of the neural network (NN). Carroll & Kopf (2008) expanded their previous work using multilayer perceptrons (MLPs) to coordinate the training of their network with a quiet-Sun simulation to obtain three-dimensional information of temperature, velocity, and magnetic field vector. Recently, Asensio Ramos & Díaz Baso (2019) presented a fast inversion code using convolutional neural networks (CNNs). They trained two different CNN architectures, where they used synthetic Stokes profiles from two snapshots of a three-dimensional magnetohydrodynamic numerical simulation containing different structures of the solar atmosphere. For lines originating in solar transition region as observed with Interface Region Imaging Spectrograph (IRIS; De Pontieu et al. 2014), Sainz Dalda et al. (2019) presented a faster way to obtain thermodynamic properties by combining the results provided by the traditional methods with machine- and deep-learning techniques.

Although, machine learning is commonly used in solar spectral-line inversions, these techniques are not yet extensively used for classification and identification of the spectral profiles. One of the few works is by Panos et al. (2018), who employed supervised hierarchical k-means to identify typical Mg ii flare spectra. Panos & Kleint (2020) extended their previous work to real time prediction of solar flare by applying a deep neural network (DNN) to Mg ii spectra. In addition to using NNs, they employed principal component analysis (PCA) and t-SNE for low-dimensional representation to examine the behavior of various features. Furthermore, Kuckein et al. (2020) employed the unsupervised machine-learning algorithm k-means to classify He i 10830 Å spectra, with the aim of significantly improving the quality and speed of spectral-line inversions of this triplet.

Various machine-learning techniques have been used in solar physics but employing t-SNE to classify or identify clusters in spectral data is still very new (Verma et al. 2019). In this work, we present an exploratory study of t-SNE-based classification of high-resolution Hα spectra. Data and methods are introduced in Sections 2 and 3, respectively. Results, including a parameter study to find an optimal t-SNE setup, are described in Section 4. Discussions and an outlook are presented in Sections 5 and 6, respectively.

2. Observations and Data

High-spectral resolution Hα (6562.8 Å) spectra were obtained on 2018 September 11 with the Echelle spectrograph of the Vacuum Tower Telescope (VTT; von der Lühe 1998) at Observatorio del Teide, Tenerife, Spain. The details of observations and data processing are presented by Verma et al. (2020), who investigated a surge in active region NOAA 12722. This is the reference publication for the temporal evolution of the active region—its morphology in photosphere, chromosphere, and upper atmosphere, its complex velocity fields associated with ejected and returning plasma, and its spectral characteristics of surging plasma. Starting at 08:05 UT, 21 Hα spatio-spectral scans were acquired in three and a half hours. Each scan took about 9 minutes with a few longer time gaps between some scans. The field of view (FOV) of 100'' × 120'' is covered in each scan with 630 scan steps and 660 pixels along the slit as shown in the slit-reconstructed line-core intensity map in Figure 1. In the center of the FOV resides an arch-filament system connecting two opposite polarities. The size of a scan step is 0 farcs 16 and a pixel along the slit corresponds to 0 farcs 18. The data processing steps were described in Dineva et al. (2020), which included PCA for noise-stripping and computation of cloud model (CM) inversions (Beckers 1964). The final spectra are resampled to 601 wavelength points, which cover a wavelength range of ±3 Å around the Hα line core. These high-spectral and moderate-spatial resolution spectra are the basis of this work.

**Figure 1.** Slit-reconstructed Hα line-core intensity image of active region NOAA 12722 observed at 08:05 UT on 2018 September 11.
Download figure:
Standard image High-resolution image

The time series of 21 Hα spatio-spectral data cubes contains about 2 × 8.7 million intensity and contrast profiles. The contrast profiles refer to

$\begin{eqnarray}&&C(\lambda )=\displaystyle \frac{I(\lambda )-{I}_{0}(\lambda )}{{I}_{0}(\lambda )},\end{eqnarray} \tag{ 1 }$

where I₀(λ) is the quiet-Sun spectral profile obtained either from observations or model atmospheres. We used the observed quiet-Sun background profiles of David (1961) and interpolated them for the heliocentric angle of μ = 0.86 as explained in Verma et al. (2020) and Dineva et al. (2020). The morphology of the spectral shapes and their frequency of occurrence are visualized summarily in Figure 2. The two-dimensional histograms have a bin size of 10 mÅ along the wavelength axis and 2 × 10⁻³ and 5 × 10⁻³ along the intensity and contrast axes, respectively. The histograms were normalized by the number of spectra and are displayed on a logarithmic scale, where light orange refers to about 10 spectral points per bin and the darkest orange indicates that all spectral points are contained in the bin. The Doppler shift of individual spectral-line profiles was not corrected, which results in the extended wings of the two-dimensional histograms. In addition, the logarithmic display may be misleading as the vast majority of intensity profiles stay close to the average profile, which is depicted as a red-white dashed curve. The average contrast profile stays close to zero, while the frequency distribution indicates a preponderance of W-shaped profiles.

**Figure 2.** Two-dimensional histograms of observed, noise-stripped Hα intensity (*top*) and contrast (*bottom*) profiles. The distributions were divided by the number of profiles (about 8.7 million) and are displayed on a logarithmic scale between 10⁻⁶ and 10⁰, where darker colors refer to a higher number density. The dashed curves in the top and bottom panels refer to the average Hα intensity and contrast profiles, respectively.
Download figure:
Standard image High-resolution image

To demonstrate the variation in contrast profiles, we plotted four examples in Figure 3. The location of these four profiles are marked on the slit-reconstructed line-core intensity map in Figure 1. These profiles provide a glimpse of various contrast profiles ranging from contrast profiles with a strong or weak central component (2 or 3) to contrast profiles where the central maximum is less pronounced and the contrast is almost everywhere negative (1 and 4). The red-blue asymmetry is indicative of Doppler shifts.

CM inversions are widely used in solar physics to infer physical parameters for dark cloud-like absorption features observed in the Hα line (e.g., Tziotziou 2007). With some fundamental assumptions (see, e.g., Kuckein et al. 2016; Dineva et al. 2020), four parameters are determined that define the radiative transfer and line formation. These four parameters are the optical thickness τ₀, Doppler velocity of the cloud v_D, Doppler width Δλ_D of the absorption profile, and source function S. Profiles that are suitable for CM inversions are characterized by strong absorption profiles. Note that CM inversions cannot reproduce the strong central component with positive contrasts. More details on the implementation and computation of CM inversions are given in Dineva et al. (2020).

In general, the seeing conditions were good at the beginning and deteriorated toward the end of the observing run. The best seeing conditions were encountered for scans No. 1, 2, 3, and 6, whereas the seeing conditions were worst for scans No. 20, 21, 9, and 15. Scan No. 1 with the best seeing conditions is used in the following for benchmarking the t-SNE algorithm. The seeing and image quality were determined from the granular contrast and the median filter gradient similarity (MFGS; Deng et al. 2015; Denker et al. 2018) of slit-reconstructed pseudo-continuum images. The granular contrast covered the range 1.56%–2.14%. The MFGS values reside by definition in the interval [0, 1], whereby higher values indicate better image quality and in turn better seeing conditions. Since the solar surface is scanned by the spectrograph, the seeing varies at each scanned step. However, visual inspection as well as MFGS values show that seeing estimates based on regions with granulation are representative for the full FOV. At the lowest contrast and MFGS values, the granular pattern is completely washed out, and only the dark sunspot and pores as well as bright Hα grains remain visible.

The benchmark spectral data comprise intensity profiles, contrast profiles, and their counterparts based on the superposition of specific eigenfunctions derived from PCA and CM inversions. According to Dineva et al. (2020), ten eigenfunctions are sufficient to construct the observed Hα profiles. In the present study, application of PCA and CM inversions yields noise-stripped contrast profiles, which are used as input for the t-SNE algorithm unless stated otherwise.

3. Methods

Many machine-learning algorithms are available for dimension reduction and visualization of multidimensional data. In our application, i.e., finding clusters or classes of high-resolution chromospheric spectra, t-SNE was chosen because it delivered very good results in classifying stellar spectra (Matijevič et al. 2017). The method was proposed by van der Maaten & Hinton (2008) as a successor of SNE (Hinton & Roweis 2002). In this study, we primarily used contrast profiles of scans with good seeing, which contain n_s = 660 × 630 = 415,800 samples and n_f = 601 features (wavelength points). In terms of a multidimensional space, we have 415,800 points in a 601-dimensional space. Our task is to evaluate the level of similarity between these points.

The following brief description of t-SNE will provide the necessary background for the subsequent data analysis (see van der Maaten & Hinton 2008, for details). A Gaussian probability distribution centered on each point in this initially 60 one-dimensional space (corresponding to the wavelength sampling) can be defined with a variance of ${{\sigma }_{i}}^{2}$ . The similarity between points p_i and p_j (two profiles) is the conditional probability P_j∣i for point p_i to pick point p_j as its neighbor. If neighbors were picked in proportion to their probability density defined by the Gaussian distribution, then

$\begin{eqnarray}&&{P}_{j| i}=\displaystyle \frac{\exp (-| | {p}_{i}-{p}_{j}| {| }^{2}/2{{\sigma }_{i}}^{2})}{{\displaystyle \sum }_{k\ne j}\exp (-| | {p}_{i}-{p}_{k}| {| }^{2}{/2{\sigma }_{i}}^{2})}.\end{eqnarray} \tag{ 2 }$

Since our interest is to find only pairwise similarities, the value of P_i∣i can be set to zero. For the low-dimensional counterparts q_i and q_j of the high-dimensional points p_i and p_j, similar conditional probability Q_i∣j can be computed as

$\begin{eqnarray}&&{Q}_{j| i}=\displaystyle \frac{\exp (-| | {q}_{i}-{q}_{j}| {| }^{2})}{{\displaystyle \sum }_{k\ne j}\exp (-| | {q}_{i}-{q}_{k}| {| }^{2})},\end{eqnarray} \tag{ 3 }$

where the variance is ${{\sigma }_{i}}^{2}=1/2$ . This value for ${{\sigma }_{i}}^{2}$ is chosen for simplicity as it only results in a rescaled version of the final projection. Since we are again interested in pairwise similarity, Q_i∣i can also be set to zero. In case the low-dimensional data points q_i and q_j correctly model the high-dimensional data points p_i and p_j, then the conditional probabilities P_j∣i and Q_j∣i will be equal. However, this never happens in practice. The aim is to arrive at this condition, i.e., to minimize the mismatch between the two distributions. Here, it is obtained by minimizing the sum of the Kullback–Leibler divergence (Kullback 1959) over all data points using a gradient descent algorithm. The minimized cost function is given by

$\begin{eqnarray}&&C=\displaystyle \sum _{i}\displaystyle \sum _{j}{P}_{i| j}\,\mathrm{log}\displaystyle \frac{{P}_{i| j}}{{Q}_{i| j}}.\end{eqnarray} \tag{ 4 }$

Details about the mathematical implementation are presented in van der Maaten & Hinton (2008). In the present study, we used the improved version of the SNE, namely the t-distributed stochastic neighbor embedding (t-SNE). In this version (van der Maaten 2014), the Barnes-Hut (Barnes & Hut 1986) algorithm is implemented for faster cost function gradient approximation. We used the multicore implementation of t-SNE.⁵ The t-SNE algorithm has a non-convex objective function, which is minimized using a gradient descent optimization. This leads to different solutions for different computation runs, where the results are similar but not exactly the same. Perplexity p and the Barnes-Hut parameter θ are the two free hyper-parameters, which regulate the generation of the projection. The perplexity p is associated with the number of nearest neighbors that is used in manifold learning algorithms. The variance of the Gaussian distribution in Equation (2) is controlled by p. The basic understanding is that larger data sets require a larger perplexity value. Moreover, θ is the parameter that affects the speed of the Barnes-Hut algorithm. However, its value is a trade-off between speed and accuracy. We discuss these and other parameters in detail in the following section. Using the default values p = 50 and θ = 0.5, it took about 18 minutes on a 48-core computer to perform the two-dimensional projection of 415,800 contrast profiles.

In photospheric images, simple intensity-based thresholding produces often satisfactory classification results, identifying quiet-Sun regions, pores, and sunspots including substructures such as penumbra and umbra as well as fine-structures such as penumbral grains and umbral dots. However, the highly structured and dynamic chromosphere is a taxing task for spectral classification, which motivated the present study. The chromospheric region depicted in Figure 1 includes quiet-Sun regions, an arch-filament system, surges, and some bright plage regions. The main objective of t-SNE is to distinguish among these features in two-dimensional projections of physical parameters. The main challenge is the large number of samples and features, which may obfuscate clear cluster boundaries. Data aggregation as demonstrated in Figure 4 benefits visualization but also makes t-SNE results accessible to statistical tools.

The two-dimensional t-SNE projection using contrast profiles of the scan with the best seeing conditions (Figure 4) is based on n_s = 415,800 samples (locations within the FOV) and n_f = 601 features (wavelength points). The t-SNE projection of all samples is displayed in the right panel of Figure 4. The x- and y-axes of the t-SNE projection are machine-learned reduced dimensions, which have no physical significance, and both axes were normalized so that the coordinates are casted in the range [−1, +1]. The coordinate system is only required to determine distances within and between clusters, whereby the distance should not be taken as a quantitative measure of the classification success. A good classification is characterized by compact, well-separated clusters. However, the large number of samples used for t-SNE projection renders a point-cloud-like appearance of varying density and with several gaps separating center and periphery. In the absence of clearly distinct clusters, collecting samples in hexagonal bins and labeling the bins with physical properties (middle panel of Figure 4) are the next steps in classifying spectra.

The observed chromospheric scene predominantly covers quiet Sun with an embedded activity cluster. The middle panel of Figure 4 displays as an example a two-dimensional histogram with about 12,000 hexagonal bins, where the suitability of contrast profiles for CM inversions is the dependent variable. This is originally a binary parameter, i.e., it is unity if the linear and rank-order correlation coefficients between observed and CM-inverted contrast are ρ_p > 0.95 and ρ_s > 0.95, respectively. In addition, the CM parameters, i.e., optical thickness τ₀, Doppler velocity of the cloud v_D, Doppler width of the absorption profile Δλ_D, and source function S, have to be within the bounds specified in Dineva et al. (2020). Only after binning and taking the average, the suitability parameter becomes a floating-point number in the interval [0.0, 1.0]. Binning and taking the average is carried out using the number of contrast profiles per hexagonal bin as shown in the middle panel of Figure 4.

Particularly, the right panel of Figure 4 reveals that the number of the contrast profiles is high in the central hexagonal bins compared to those at the periphery. This is also evident as a higher density of points in the left panel of Figure 4. This arrangement is the most compact representation of a large number of similar quiet-Sun samples. All other profiles with different spectral characteristics are pushed to the periphery, where they form individual clusters. While interpreting the t-SNE maps, this behavior of the algorithm has to be taken into account.

In principle, any other physical property of the projected data set can become the dependent variable with the potential to reveal any clustering in the projection. However, the apparent separation of the red and green colors already indicates two classes of contrast profiles, i.e., those that are suitable for CM inversions (green) and those where CM inversions fail (red). The former profiles belong to dark cloud-like features in Hα line-core intensity maps, whereas the latter are associated with quiet-Sun regions and profiles with enhanced line-core intensities or even emission. The identification of the two classes in the t-SNE projection is striking, especially in the absence of any a priori knowledge of the underlying physics or implementation of the CM inversion algorithm. This provided the motivation for a detailed parameter study of t-SNE with the goal to find a procedure for the bulk classification of Hα spectra, among others, expected from telescopes for high-resolution solar observations.

The output of a t-SNE projection is a two-dimensional coordinate for each contrast profile (left panel of Figure 4), and a one-to-one correspondence exists between these coordinates and those in slit-reconstructed images (Figure 1). Thus, it becomes possible to back-project physical properties of the t-SNE maps to the observed FOV (Figure 5). This also applies to average values (for example, the middle panel of Figure 4) and higher stochastic moments (i.e., variance/standard deviation, skewness, and kurtosis) of physical properties, which can be computed for each hexagonal bin (see Section 4.3). The interplay between t-SNE map and back-projection allows us to associate clusters in t-SNE maps with physical properties of the observed scene on the solar surface. Thus, human inference is still needed to exploit the potential of t-SNE. Figure 5 illustrates such a back-mapping of the data shown in the middle panel of Figure 4. This serves as a sanity check that t-SNE produces meaningful clusters of the input contrast profiles. For clarity of the display, the averaged suitability parameter takes on only three values: zero, unity, and one-half. Compared to the binary map of the suitability parameter, a clearly defined transition zone surrounds the regions with profiles suitable for CM inversions. Within contiguous regions of either unity or zero, other values are rarely encountered and appear as a noise-like pattern. Regions where the suitability parameter is unity belong to dark surges, arch filaments, and mottles.

**Figure 5.** Back-projection of contrast profiles suitable for CM inversions (*green*) to the observed FOV (same data as in Figure 4). Regions where CM inversions fail (*red*) represent either the quiet Sun or belong to profiles with enhanced line-core intensities or with emission.
Download figure:
Standard image High-resolution image

Another way of visualizing t-SNE results is to compute the normalized distance from the center of the t-SNE map and back-project it into the observed FOV (Figure 6). The distance from the center of the map is reasonable because the maps tend always to be very circular, almost regardless of how the hyper-parameters are set. The largest distances are found in a compact region in the center of the FOV that encompasses the active region with distinct absorption and emission features in Hα. This compact region is surrounded by granular-scale clusters of low-distance values, which are typical for the surrounding quiet Sun. Since profiles in quiet-Sun regions outnumber those in active regions, the t-SNE algorithm concentrates the very similar quiet-Sun profiles in the center of the t-SNE map. All other profiles with a broad variety of different shapes are pushed to the periphery. Back-projection of the normalized distance is always possible and does not require any additional information beyond the input contrast profiles. However, a clear separation, as in Figure 6, will not always occur and depends on the observed features and their frequency of occurrence within the FOV.

**Figure 6.** Back-projection of the normalized distance from the center of the t-SNE projection to the observed FOV (same data as in Figure 4). Zero refers to the center of the t-SNE map, whereas unity indicates the outermost edge.
Download figure:
Standard image High-resolution image

4. Results

In this section, we present a parameter study to find an optimal t-SNE setup adapted to input data, i.e., contrast profiles of the chromospheric Hα line. Once the optimal setup is established, we explore back-mapping and re-projection of selected clusters in the t-SNE maps. Furthermore, classification of profiles based on t-SNE projection are presented and discussed.

4.1. Parameter Study

Frequently, t-SNE projections are used for data dimensionality reduction and visualization. However, to interpret and understand the projection, the user's expertise and domain knowledge of the input data are needed. In addition, an acute awareness of the parameters that control the t-SNE projection of the data is required, in order to avoid misinterpreting the results. Therefore, it is important to grasp how the t-SNE projection depends on perplexity p, Barnes-Hut parameter θ, and number of iterations n. The multicore implementation of t-SNE depends only on these parameters, which are investigated in the following parameter study. A detailed account of these parameters is given by Wattenberg et al. (2016) who presented a graphical description of t-SNE using various combinations of parameters.

The default parameters of t-SNE are the Barnes-Hut parameter θ = 0.5, perplexity p = 50, and number of iterations n = 1000. Meaningful parameters, i.e., parameters that were commonly used in other studies, cover the intervals θ ∈ [0.2, 0.8], p ∈ [10, 100], and n ∈ [200, 4000] resulting in numerous triples of parameters. For example, the selected parameter range of the perplexity is based on van der Maaten & Hinton (2008), who recommended a typical range of 5–50. However, the three-dimensional space spanned by the t-SNE parameters can be truncated to save computing time while preserving the most notable dependencies. Two parameters are kept fixed at the default values while changing the third one. This scheme produces 13 t-SNE maps and allows us a comprehensive comparison of the t-SNE projections (Figure 5). The input data are again the contrast profiles of the best scan, and the maps are color-coded as before to discern if the observed contrast profiles are suitable for CM inversions.

The top row in Figure 7 demonstrates that varying the Barnes-Hut parameter θ only leads to minute changes in the t-SNE maps, which are mainly restricted to the periphery of the maps, in particular for high values of θ. When the perplexity p is changed from low to high values (middle row of Figure 7), profiles that are not suitable for CM inversions become increasingly concentrated in the center of the map. The values in the periphery of the t-SNE map are initially almost randomly distributed but they become more clustered for p > 30, in particular for the profiles that are suitable for CM inversions. Once p = 50 is reached, the difference between the maps is again minute. If the number of iterations is very low (n = 200 in the bottom row of Figure 7), the circular coordinate space of the t-SNE map is only incompletely filled, which changes to the contrary for n = 400. In the latter case, the hexagonal bins cover the entire disk-shaped region but predominant yellow and orange colors indicate a deficient classification of the contrast profiles. A clear clustering of the t-SNE maps becomes apparent at n = 1000, i.e., a large number of iterations is needed for large data sets to optimize the t-SNE projection. Further increasing the number of iterations does not change the morphology of the maps and only slightly boosts the quality of the classification—at the cost of rising computational effort. The computing time using 48 of 64 cores of a compute server (AMD Opteron 6378) takes 18, 32, and 63 minutes for n = 1000, 2000, and 4000, respectively. Thus, the computing time depends almost linearly on the number of iterations n.

Line width and Doppler shift of spectral lines are the most obvious parameters affecting morphology and position of spectra. Therefore, we expect that they have a significant impact on the classification. Using the line-core Doppler shift, which was derived from parabola fitting of the line core, the 13 t-SNE maps are plotted in Figure 8 with the same layout and parameter sets as in Figure 7. Negative and positive plasma flows, i.e., blue- and redshifts of the spectral line form patterns that are aligned with clusters that are already apparent in Figure 5, regardless of whether these clusters refer to contrast profiles that are suitable or unfit for CM inversions. Exceptions are encountered for low values of the perplexity (p = 10 and p = 30) and small numbers of iterations (n = 200 and n = 400). The morphology of the remaining 11 maps is very similar. The Doppler velocity serves as a secondary criterion to separate the flow speed within each cluster. Thus, sharp transitions from positive to negative velocities provide clues where the borders of clusters may be located. In addition, the highest flow speeds are encountered at the periphery of the t-SNE maps because they are by definition absent in quiet-Sun regions at the center of the maps. Therefore, the t-SNE correctly perform the coarse classification between profiles characteristic for the active and quiet Sun without a priori knowledge of the underlying physics. In summary, comparing the t-SNE projections and computing time for various combinations of parameters, we conclude that the default parameters θ = 0.5, p = 50, and n = 1000 are already a very good choice for our data set.

**Figure 8.** Parameter study of t-SNE projections as a function of θ (*top row*), perplexity p (*middle row*), and number of iterations n (*bottom row*). The projection with θ = 0.5, perplexity p = 50, and number of iterations n = 1000 refers to the default settings (*third panel, middle row*). The two-dimensional projections are color-coded according to the line-core Doppler velocity.
Download figure:
Standard image High-resolution image

4.2. Choice of Input Data and Impact of Seeing Conditions

Apart from optimizing the hyper-parameters θ, p, and n of the t-SNE projection, various forms of input data will result in different t-SNE projections, which are scrutinized in this section. Ground-based observations are affected by varying seeing quality. This raises the question of whether or not seeing conditions impact t-SNE projections. In addition, input data can be contrast or intensity profiles, either observed or noise-stripped using PCA, or just the PCA coefficients of the first 10 eigenfunctions. We investigated these issues by creating t-SNE projections for some variants of input data, which are compiled in Figure 9.

**Figure 9.** Comparison of t-SNE projections using different input data: (a) noise-stripped contrast profiles for a bad seeing scan, (b) PCA coefficients for the best seeing scan, (c) observed contrast profiles for the best seeing scan, and (d) noise-stripped intensity profiles for the best seeing scan. All projections are color-coded according to their suitability for CM inversions (*top*) and depending on the line-core Doppler velocity (*bottom*).
Download figure:
Standard image High-resolution image

As noted in Section 3, different computational runs will result in somewhat different t-SNE projections. Hence, it is impossible to carry out a one-to-one comparison of the t-SNE projections in Figure 9. Yet, the overall morphology provides some guidance regarding quality and choice of input data. Red and green values are well separated, and the profiles suitable for CM inversions are in general pushed to the periphery. However, the maps for bad seeing data appear noisier (e.g., scan No. 20). In particular, regions are not as compact, where the contrast profiles are unfit for CM inversions. Interestingly, the t-SNE maps based on just the PCA coefficients (Figure 9(b)) are virtually identical to the reference data set (middle panel of Figure 4 and central maps in Figures 7 and 8), indicating that 10 eigenfunctions are sufficient to capture the essential morphology of Hα contrast profiles. There are subtle indications (darker green clusters in Figure 9(b) as compared to Figure 9(c)) that PCA-based noise-stripping performs better in identifying contrast profiles for CM inversions. A similar trend is evident for the t-SNE projection using noise-stripped intensity profiles (Figure 9(d)). Thus, noise-stripping is a distinct advantage for classifying Hα profiles of chromospheric absorption features. Since the average computing time (about 18 min) was the same for all t-SNE maps, it is not an important criterion for choosing the input data for t-SNE.

4.3. t-SNE Projections of CM Inversions

Much of the chromospheric complexity of the active Sun is contained in absorption profiles of cool plasma that, compared to its surroundings, is suspended by the magnetic field above the solar surface. Before exploiting the results from Sections 3 and 4.1, we apply t-SNE to all Hα contrast profiles of the best scan, which fulfill the criteria ρ_p > 0.95 and ρ_s > 0.95, i.e., they produce reliable CM inversion. The results are compiled in Figure 10, which presents t-SNE projections for the four CM parameters, i.e., Doppler velocity of the cloud v, optical depth τ, source function S, and Doppler width Δλ_D.

**Figure 10.** Contrast profiles, which are suitable for CM inversions, are used for t-SNE projections of the four CM parameters: cloud velocity v, optical depth τ, source function S, and Doppler width Δλ_D (*from left to right*). The scale bars at the top of each panel correspond to the range of the CM parameters.
Download figure:
Standard image High-resolution image

Visually, the t-SNE maps of the CM parameters can be characterized by their complexity. The map of the Doppler width Δλ_D is the simplest, where a vertical line is sufficient to roughly separate small and large values. The maps of optical depth τ and source function S both display a wedge-like intrusion from the bottom. The map of the source function S basically consists of three regions, i.e., a large uniform region of low values to the right (light green), a smaller region with medium values to the left (medium blue-green), and the wedge-like intrusion from the bottom (dark blue). Overall, the map of optical depth τ has an almost inverse look compared to the source function S. However, here the wedge-like region is more uniform whereas the other regions reveal gradients. Morphological image processing could be applied to the maps of optical depth τ and source function S to clearly separate the three classes. Finally, the map of the cloud velocity v possesses the highest level of complexity. However, on closer inspection, it is just the velocity gradient that further differentiates the aforementioned three classes. In summary, visualizing patterns or clusters in t-SNE maps is a powerful tool of human inference, enabling a physics-based classification of Hα contrast profiles.

We plot the standard deviation, which represents the second central moment of the probability distribution for each hexagonal bin, for the four CM parameters in Figure 11. In standard deviation maps of Doppler width Δλ_D, optical depth τ, and source function S, a wedge-like intrusion from the bottom is evident. In the case of source function S and Doppler width Δλ_D, this belongs to large values of standard deviation, whereas for optical depth τ, this trend is inverse. For cloud velocity v, the appearance of the standard deviation map is as complex as the original map. In general, the large values of standard deviation for all four CM parameters belong to the regions with large values in the maps compiled in Figure 10.

**Figure 11.** Standard deviation of the four CM parameters, i.e., cloud velocity v, optical depth τ, source function S, and Doppler width Δλ_D (*from left to right*), corresponding to t-SNE projections shown in Figure 10.
Download figure:
Standard image High-resolution image

4.4. Classification of CM-invertible Contrast Profiles

To go one step further, we exploit the clustering visible in the t-SNE projection. These distinct islands are easily recognizable in t-SNE projection and comprise spectral profiles with similar shape. We identify the ten largest contiguous clusters (left panel of Figure 12) where the hexagonal bins exceed the threshold of 0.9 for the suitability parameter (see Section 4.3). This is achieved by applying a threshold of 0.9 to the two-dimensional histogram with hexagonal bins. In the next step, contrast profiles have to be labeled, which belong to the hexagonal bins above the threshold. The relatively high threshold preserves some of the binary character of the suitability parameter (see Section 3) while adding clustering information from the t-SNE algorithm. The ten largest, contiguous clusters were selected from the periphery of the t-SNE projection. They are color-coded and numbered in the clock-wise direction. Four out of ten clusters are isolated while the remaining one appears in pairs (2 and 3, 4 and 5, and 8 and 9), which are separated by narrow lanes. Since the information about the location of the profiles is available, the cluster can be back-projected to the two-dimensional slit-reconstructed map of, for example, the Hα line-core intensity (middle panel of Figure 12).

The result of this back-mapping is compiled in the middle panel of Figure 12, which establishes the link between clusters and different chromospheric absorption features. Almost all clusters are associated with the dark, cloud-like plasma encountered predominantly in the surge and arch filaments but also in regions of mottles surrounding the active region. Cluster 1 profiles mainly refer to arch filaments and the outer edge of the surge. Cluster 2 profiles are associated with footpoint regions of surges and darker regions of arch filaments. Clusters 4 and 5 are related to the upper and middle region of the surge, respectively. A small patch associated with Cluster 4 constitutes a dark region in Hα line-core map located away from the central FOV. Clusters 6, 7, 8, and 9 are scattered and not clearly linked to specific features. Cluster 10 contains a set of profiles with enhanced line-core intensity near the base of the surge as well as near the footpoints of the arch filaments. So far, we have only referred to clusters as opposed to classes because the term "class" should be reserved for Hα contrast or intensity profiles with the same or at least similar physical origin.

As a sanity check, we subjected the thresholded and clustered profiles to an additional round of t-SNE to determine if the clusters and their spatial relationship survive. The results are depicted in the right panel of Figure 12. By and large, the initial clusters are also present in the new projection. However, the pairwise relationship remained only for Clusters 4 and 5, whereas a new pair was formed for Clusters 1 and 3. Once more, human inference is required in t-SNE to establish meaningful physical relationships among clustered profiles.

Therefore, in Figure 13, we compiled 300 randomly selected intensity and contrast profiles for each of the ten clusters. In addition, average contrast profiles for each cluster are also displayed to identify characteristic profiles shapes. Similarities and differences in the profiles are clearly evident for the ten selected clusters. Profiles belonging to Cluster 1 exhibit enhanced line-core intensities and are shifted to blue wavelengths indicating plasma upflows. Profiles are broad in Clusters 2, 3, and 5 and have line-core intensities close to those of the quiet Sun. They display blueshifted asymmetries, which become stronger from Cluster 3, over to Cluster 5, to Cluster 2. Cluster 4 contains deep, unshifted profiles, while the deep profiles of Clusters 6, 8, and 7 (ordered according to decreasing line depth) extend mostly to the blue. Profiles in Clusters 9 and 10 have significantly enhanced line-core intensities, whereby the profiles belonging to Cluster 9 are broadened and unshifted while those of Cluster 10 are extremely broad and asymmetric, including well pronounced shoulders indicative of dual-flow components. A similar behavior is observed in the contrast profiles, where the amplitude variations facilitate an easier recognition of spectral characteristics.

As previously discussed, some clusters may form classes. Combining the re-projected cluster in the right panel of Figure 12 with the spectral characteristics summarized in Figure 13 yields three classes: (1) contrast profiles with a pronounced central component, i.e., Clusters 3, 1, 9, and 10 ordered according to increasing positive contrast; (2) broad and deep profiles of Clusters 2 and 5, where central maxima and neighboring minima exhibit similar amplitudes in contrast profiles; and (3) contrast profiles, where the central maximum is less pronounced and the contrast almost everywhere is negative, i.e., in Clusters 6, 7, and 8. Only the profiles of Cluster 4 cannot be clearly classified. Their location in the t-SNE projection (right panel of Figure 12) favors Class 2, whereas the predominantly negative contrast in Figure 13 suggests Class 3.

In Figure 14, we present the re-projection of the ten selected clusters in terms of the four CM parameters, i.e., cloud velocity v, optical depth τ, source function S, and Doppler width Δλ_D. The most conspicuous difference to Figure 10 is evident in the Doppler width Δλ_D, which no longer shows a continuous gradient. Instead of that, a vertical line with a small Doppler width is now located in the center of the t-SNE map with the largest values to the right and moderately large values to the left. Overall, the projection can be separated into two large clusters with high (Class 1) and low (Classes 2 and 3) values of the source function. The distinction between Classes 2 and 3 is mainly due to differences in the cloud velocities, i.e., blue and redshifts of the spectral line. Closer inspection of the t-SNE projections of CM parameters hints at a subclass of profiles belonging to Clusters 2 and 5, which reside in a wedge-like structure in the bottom-right corner. In summary, the initial ten clusters can be reduced to two or three classes of spectral profiles, which are associated with chromospheric Hα absorption features.

5. Discussion

The visual classification of (solar) spectra has a long history and has led to a variety of terms describing spectra based on their morphology, e.g., line width, line depression, rest intensity, line gaps, line-wing or line-core emission, satellite components, multilobed profiles, etc. Thus, recovering some of these features or combinations thereof with machine learning is expected. However, interpreting the two-dimensional t-SNE maps is still subjective and requires an experienced scientist who brings together the morphology of the spectra and their relation to observed features on the Sun and their underlying physics.

t-SNE is not the only machine-learning algorithm that can be used for clustering or classification of spectral data. We inspected another method, i.e., uniform manifold approximation and projection (UMAP; McInnes et al. 2018),⁶ which belongs to the class of k-neighbor-based graph learning algorithms. It provides nonlinear dimensionality reduction and furnishes visualization of patterns or clusters in the data without having a priori knowledge of the labels in the data. The foundation of UMAP is largely based on manifold theory and topological data analysis. The initial results of UMAP (not presented here) indicate a very good performance in identifying profiles suitable for CM inversions. A full evaluation of UMAP will add another layer of complexity and is thus beyond the scope of this work. However, we see its potential when dealing with larger data sets encompassing observations of different features of the active chromosphere.

Only 27.9% (n_CM = 116,363) of the profiles n_s in the best scan produce reliable CM inversions, i.e., they have high linear and rank-order correlation coefficients between observed and CM-inverted contrast profiles (see Figure 10). The t-SNE projection with hexagonal bins is, in principle, a two-dimensional histogram. Each hexagonal bin contains a certain number of profiles, which are then averaged as demonstrated in Figure 4. Applying a threshold to the suitability parameter (Section 4.4) yields significantly fewer profiles for the ten clusters in Figure 12, i.e., only 8.9% of n_s (n₁₀ = 37,033). Computing the intersection of n_CM ∩ n₁₀ = 36, 392 indicates that less than 1000 profiles, which were previously deemed unsuitable for CM inversions, share properties common to all profiles in a specific cluster. Selection of these profiles should not be considered as a misclassification but a result from the subjective thresholds applied to the correlation coefficients and to the representation of the suitability parameter as a real number (e.g., Figure 4).

Even though it is not possible to use t-SNE directly for CM inversions, because t-SNE performs a nonparametric mapping of input data, reliably identifying quiet-Sun profiles already significantly reduces the computing time of the CM inversions depending on the quiet-Sun fraction within the FOV. However, van der Maaten (2009) suggests "parametric" t-SNE as an enhancement, which implements a regressor that minimizes the t-SNE loss function directly so that new data can be incorporated. Thus, new spectral profiles can be inverted using representative CM parameters from its neighborhood in the two-dimensional t-SNE projections as a starting point. This will speed up the iterative part of CM inversions (see Dineva et al. 2020) and prevents the LevenbergMarquardt least-squares minimization scheme (Markwardt 2009) getting trapped in local minima. Currently, the CM inversions as implemented in Dineva et al. (2020) take about five hours on an AMD Opteron 6378 processor for a single spatio-spectral data cube. Only some IDL built-in functions and procedures use multi-threading in the CM inversion, i.e., other means of parallel computing are not employed—in contrast to the multicore implementation of t-SNE. Thus, computing times for t-SNE projections and CM inversions are not directly comparable and are only given for reference. Parametric t-SNE and other approaches of embedding new data into existing t-SNE projections, which promise further computational gains, will be explored in forthcoming studies with even more diverse spectral profiles and larger data sets.

In Verma et al. (2020), we used the same data to study the homologous surge activity related to the continuous flux emergence and strong proper motions in the region. We discussed the surge in terms of changes in high-resolution noise-stripped spectral and contrast profiles and followed the changes in higher atmospheric layers using UV and EUV data from the Solar Dynamics Observatory (SDO; Pesnell et al. 2012). We noticed that the surge is not a single entity but consists of regions with different optical thickness and velocity values (see Figure 11 in Verma et al. 2020). Similar structures are revealed in the two-dimensional line-core intensity map with ten back-projected cluster profiles (middle panel of Figure 12). The surge is located at x = [84'', 100''] and y = [34'', 64'']. The different parts of the surge are traced by different clusters. The tip of the surge is traced by Cluster 4, the middle part is marked by Cluster 5 with outside edges covered by Cluster 8, whereas the lower part is traced by Cluster 2 interleaved with Cluster 5. The distinctive shape of contrast profiles for each of the mentioned clusters is indicative of different plasma properties. Nóbrega-Siverio et al. (2016) discerned four different plasma populations in a surge in their 2.5-dimensional experiment simulating the emergence of magnetized plasma through the solar atmosphere. To perform a one-to-one comparison of the four Clusters 2, 4, 5, and 8 indicated in the surge with numerical modeling is beyond the scope of this work. However, the t-SNE projection unveiled the composite nature of surge plasma.

The goal of this study was to establish a framework for classifying high-resolution spectra using t-SNE. Such a framework is particularly relevant for "big data," not only because of the ease of its application but also because of huge databases attached to solar research infrastructures. The next step is to apply the presented framework to space data (e.g., the Hinode Spectro-Polarimeter, SP; Tsuneta et al. 2008). First, space data are not affected by seeing. Second, SP data comprises high-spatial resolution full-Stokes spectra, adding polarization properties as another dimension. Science cases for t-SNE applications include, for example, identifying profiles belonging to shear flows or selecting penumbral filaments carrying the Evershed flow. In addition, synoptic observations with moderate spatial and spectral resolution, e.g., vectormagnetogram data of the SDO, offer another treasure trove yet to be explored by machine learning. Although the full-Stokes spectra cover only a few wavelength points in the spectral line, it should be possible to identify and classify unique profiles belonging to solar features. The upcoming Chinese spacecraft Solar Hα Imaging Spectrometer (SHIS; Chen 2018), which recently received approval, will furnish Hα spectra without seeing contamination. Thus it is ideally suited for testing and expanding our t-SNE based framework. Commissioning the 4 m Daniel K. Inouye Solar Telescope (DKIST; Tritschler et al. 2016) is imminent, and its comprehensive instrument suite will produce a plethora of data including high-spectral, -spatial, and -temporal resolution spectropolarimeteric data covering multiple spectral lines.

Although t-SNE proves to be efficient in clustering high-dimensional data, human inference is required at each step to interpret the results. This study provides a framework and ideas on how to tailor a classification scheme toward specific spectral data and science questions. The level of complexity increases when moving from the simple question if spectral profiles can be successfully inverted to problems of relating classes of spectra to certain features of chromospheric activity. This exploratory work establishes t-SNE as a suitable tool to cluster high-resolution Hα spectra but also demonstrates that unsupervised machine-learning algorithms provide the means to explore the ever-increasing data volume in solar spectroscopy.

6. Conclusions

In the present work, we exploited the capabilities of t-SNE for classifying high-resolution Hα spectra. We answered the following questions: How does t-SNE depend on the Barnes-Hut parameter θ, perplexity p, and number of iterations n? Do t-SNE projections differ for good and bad seeing data? What is the best choice for input data to be passed to the t-SNE algorithm (i.e., preprocessing and initial dimensionality reduction)? Is t-SNE "as is" sufficient to recover meaningful physical properties or are iterations and combinations with other techniques needed?

As an unsupervised machine-learning algorithm, t-SNE is capable, without any a priori knowledge, of identifying (clusters of) Hα intensity or contrast profiles, which are suitable for CM inversions. Depending on the preponderance of quiet-Sun profiles, profiles characteristic of the active chromosphere were mainly pushed to the periphery of the t-SNE projection. The detailed parameter study revealed that the default values of the Barnes-Hut parameter θ = 0.5, perplexity p = 50, and number of iterations n = 1000 yield already very good results while being computationally efficient. Furthermore, the projections were in general comparable irrespective of the input data used. However, both noise-stripped contrast profiles and their PCA decomposition based on ten eigenvectors performed best. We devised different approaches for the classification of Hα spectra using either the full data set or selected profiles, which were suitable for CM inversions. Extracting clusters can be based on this latter criterion or on spectral-line properties. Spectral classes can be defined based on t-SNE projections of the CM parameter, and they can be validated by back-projection on slit-reconstructed images or maps of other physical parameters. Human inference is an essential part of the classification, which allowed us to reduce the initial number of 10 clusters to two or three classes of Hα intensity and contrast profiles. The exact number of classes will depend on the observed scene on the Sun and on the science questions and goals.

In this exploratory study, t-SNE proved to be a valuable tool for spectral classification and spectral inversions. The large volume of spectroscopic data provided by major solar research infrastructures such as Hinode/SP and IRIS, and in the near future DKIST and SHIS, confronts solar physics with a data challenge. Machine-learning algorithms such as t-SNE are essential in overcoming limitations in analyzing bulk data. Developing novel machine-learning codes for solar physics applications may not be the right path, considering that much of the progress in machine-learning and artificial intelligence has arisen in the commercial sector. However, awareness of this rapidly developing sector and adaptation of existing codes promises expedient scientific returns.

The VTT at the Spanish Observatorio del Teide of the Instituto de Astrofísica de Canarias is operated by the German consortium of the Leibniz-Institut für Sonnenphysik in Freiburg, the Leibniz-Institut für Astrophysik Potsdam, and the Max-Planck-Institut für Sonnensystemforschung in Göttingen. This study was supported by grants DE 787/5-1 (CD, CK, IK, MV) and VE 1112/1-1 (MV) of the Deutsche Forschungsgemeinschaft (DFG) and by European Commissions Horizon 2020 Program under grant agreements 824064 (ESCAPE—European Science Cluster of Astronomy & Particle Physics ESFRI Research Infrastructures) and 824135 (SOLARNET—Integrating High Resolution Solar Physics).

Facility: Vacuum Tower Telescope (Echelle spectrograph). -

Software: Interactive Data Language (IDL), MPFIT (Markwardt 2009), Python, SolarSoft (Bentley & Freeland 1998; Freeland & Handy 1998), sTools (Kuckein et al. 2017), t-SNE (van der Maaten & Hinton 2008).

Classification of High-resolution Solar Hα Spectra Using t-distributed Stochastic Neighbor Embedding

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Observations and Data

3. Methods

4. Results

4.1. Parameter Study

4.2. Choice of Input Data and Impact of Seeing Conditions

4.3. t-SNE Projections of CM Inversions

4.4. Classification of CM-invertible Contrast Profiles

5. Discussion

6. Conclusions

Footnotes

Classification of High-resolution Solar Hα Spectra Using t-distributed Stochastic Neighbor Embedding

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Observations and Data

3. Methods

4. Results

4.1. Parameter Study

4.2. Choice of Input Data and Impact of Seeing Conditions

4.3. t-SNE Projections of CM Inversions

4.4. Classification of CM-invertible Contrast Profiles

5. Discussion

6. Conclusions

Footnotes