Capturing the Physics of MaNGA Galaxies with Self-supervised Machine Learning

Regina Sarmiento; Marc Huertas-Company; Johan H. Knapen; Sebastián F. Sánchez; Helena Domínguez Sánchez; Niv Drory; Jesus Falcón-Barroso

doi:10.3847/1538-4357/ac1dac

1. Introduction

Over the past few decades, the sizes of the available data sets in astronomy have grown by orders of magnitude (e.g., Sloan Digital Sky Survey, SDSS, Gunn et al. 2006; COSMOS, Taniguchi et al. 2007), and the trend is expected to accelerate in this new decade (e.g., Vera C. Rubin Observatory Legacy Survey of Space and Time, Ivezić et al. 2019; Euclid, Laureijs et al. 2011; Nancy Grace Roman Space Telescope, Spergel et al. 2015). As data sets grow not only in size but also in complexity, efficient exploratory methods are needed to boost discovery. Dimensionality reduction techniques are vital in this respect (Baron & Ménard 2021; Portillo et al. 2020). However, astronomy data present some specific problems, such as noise and complex selection effects, that challenge the use of standard techniques out of the box. This is a field where recent advances in machine learning can undoubtedly be of major utility.

We explore in this work the application of self-supervised contrastive learning as a tool to explore multidimensional astronomy data. We employ it to analyze maps derived from the SDSS-IV (Blanton et al. 2017) Mapping Nearby Galaxies at APO (MaNGA; Bundy et al. 2015) survey of close to 10⁴ nearby galaxies with dedicated integral field spectroscopic fiber bundles. This is a particularly complex survey in which multiple sampling sizes are used, and it presents a complex selection function.

Contrastive learning has rapidly emerged in the machine-learning community as a powerful solution to finding meaningful representations of data. As opposed to more classical dimensionality reduction approaches, such as principal component analysis (PCA; Pearson 1901) or autoencoders (Rumelhart et al. 1986), which aim at projecting data into a low-dimensional space to minimize the reconstruction error, contrastive learning maximizes the agreement between different transformations of the same data point with no reconstruction involved in the process. This is particularly well adapted to data sets where observational effects can make objects with similar physical properties appear very different (e.g., due to point-spread function variations, sampling, signal-to-noise ratio (S/N), etc.). The representations learned can be used not only for exploration but also for downstream tasks such as anomaly detection or classification with the advantage of reducing the sizes of the training sets. A major breakthrough of last year (e.g., 2020) was indeed that a supervised classification based on contrastive-learning representations achieved better accuracy than a purely convolutional neural network (CNN)–based supervised classification on the ImageNet database (Chen et al. 2020b). Even more recently, Abul Hayat et al. (2021) proved a similar behavior in astronomical data. They analyzed multiband optical images from the SDSS and found that photometric redshift and galaxy morphology classifications can be estimated with the same accuracy as a completely supervised approach but using half the number of labels during training.

In this work, we use contrastive learning to explore the multidimensional kinematic and stellar population maps of galaxies derived from the MaNGA data set. This is the first time contrastive learning has been applied to integral field spectroscopy data. We use prior knowledge on observational biases in order to remove these artificial features and extract physical information from the data. We aim to explore to what extent physically relevant parameters and relations can be reproduced or uncovered automatically. Testing such algorithms on MaNGA data allows us to compare our findings with results obtained within the survey project using more traditional analysis tools. This is an important validation step that must be taken before we can consider using self-supervised analysis on larger, more complicated, or newer data sets.

This paper is organized as follows. In Section 2, we describe the data we used, and in Section 3, we discuss the machine-learning software applied. Section 4 describes our results in terms of visualization, regression, feature decomposition, and clustering. The implications of our work are discussed in Section 5 and summarized in Section 6.

2. Data

We use data products from the MaNGA survey (Bundy et al.2015; Drory et al. 2015; Aguado et al. 2019). MaNGA is one of the three core surveys in the fourth-generation SDSS (SDSS-IV; Smee et al. 2013; Blanton et al. 2017). It uses 17 fiber bundle integral field units (IFUs) that vary in diameter from 12'' (19 fibers) to 32'' (127 fibers of 2'' diameter) to observe a total of 10⁴ nearby galaxies (Law et al. 2015; Yan et al. 2016a; Wake et al. 2017), constituting a representative sample of all types of nearby galaxies in a redshift range 0.01 < z < 0.15. Data cubes are obtained with IFUs of 19, 37, 61, 91, and 127 fibers, matched to the galaxy size, with spectra covering the range 360−1000 nm in wavelength at a spectral resolution R ∼ 2000 (Law et al. 2016; Yan et al. 2016b).

The data set used in this work consists of postprocessed maps derived from the MaNGA MPL-10 data release using the Pipe3D pipeline (v3.0.1b; Sánchez et al. 2015, 2016). Pipe3D performs a spectral fit of a combination of three synthetic components: stellar populations, dust, and gas. The pipeline produces maps by deriving the properties of the averaged stellar population. To test the performance of the algorithm, we use integrated magnitudes derived from the postprocessed MaNGA data cubes, which are included in the DR15Pipe3D primary catalog.

The input data for which we aim at finding low-dimensional representations consists of five binned (2 × 2) and stacked hexagonal maps for each galaxy, namely, the V-band reconstructed images, luminosity-weighted age and metallicity maps, and radial velocity and velocity dispersion maps. Previous to the binning, the pixels with an S/N lower than 3 in the median intensity maps are set to zero. As the MaNGA IFUs have different numbers of fibers, the sizes of the resulting maps vary as well. Since CNNs typically require input maps of a fixed shape, we zero-padded the hexagonal maps to the same shape and size, 32 × 32. In order to deal with the different dynamic ranges, we scaled all maps linearly to the range [0, 1], except for the V-band image, to which we applied logarithmic and square-root functions previous to the linear normalization. Two different approaches will be analyzed regarding the normalization. On the one hand, we consider a normalization that is equivalent to a change of units that is uniformly applied to all of the galaxies (we will refer to it as the relative norm) as we aim to analyze the relative physical magnitudes among the galaxies. In this case, all maps have values between zero and 1 but do not necessarily span the complete range. On the other hand, a normalization that forces each galaxy map to span the [0, 1] range is used to evaluate the effect of the spatial structures of the galaxies (the individual norm).

Our final data set consists of 9507 MaNGA galaxies, of which 1099, 2235, 2235, 1122, and 2816 were observed with a 19-, 37-, 61-, 91-, and 127-fiber bundle, respectively.

3. Deriving Meaningful Representations of MaNGA Galaxies

3.1. Contrastive Model

We use the Simple framework for Contrastive Learning of visual Representations (SimCLR; Chen et al. 2020a). SimCLR is a contrastive-learning framework to extract meaningful representations of the data. More precisely, we use the model in Tensorflow (Abadi et al. 2015) and follow the findings on optimization by Chen et al. (2020a) with minor changes in the architecture.

This algorithm optimizes representations that keep the semantic, or physical, information of the data by maximizing the agreement between the representations of a data object and its transformed pair ( x , x ^T). This is particularly beneficial when similar objects appear to be different due to observational biases. By selecting transformations that simulate these observational biases, the model is encouraged to neglect these differences while capturing the common remaining features, which correspond solely to the galaxy. Likewise, agreement is minimized when representations originate in different galaxy maps. This way, the program is self-taught to recognize meaningful patterns present in the data.

More specifically, the model consists of a CNN block followed by a set of fully connected layers, after which the contrastive loss l is computed. The CNN block acts as a base encoder that outputs the representations, while the fully connected layers project the representations to the comparison space. We will refer to these last fully connected layers as the head projection function. While both components are trained together, the head projection is removed to obtain the final galaxy representations. Chen et al. (2020a) found that this configuration yields better representations in the previous layers.

The CNN base encoder consists of four convolutional layers with kernel sizes of 5, 3, 3, and 3 and 128, 256, 512, and 1024 filters each. Max-pooling layers and exponential linear unit activation functions (Clevert et al. 2016) are used between the convolutional layers. The output of this set of layers encodes the array-like representations of the galaxies of 1024 features each.

The projection head is composed of three fully connected layers (512, 128, and 64 neurons each). This nonlinear function takes the galaxy representations to a latent space where the contrastive loss is computed.

The contrastive loss is calculated for each L²-normalized representation z _i in the contrastive space with the following equation:

$\begin{eqnarray}&&{l}_{i,j}=-\mathrm{log}\displaystyle \frac{\exp (\langle {{\boldsymbol{z}}}_{i},{{\boldsymbol{z}}}_{j}\rangle /h)}{{{\rm{\Sigma }}}_{k=1,k\ne i}^{2N}\,\exp (\langle {{\boldsymbol{z}}}_{i},{{\boldsymbol{z}}}_{k}\rangle /h)},\end{eqnarray} \tag{ 1 }$

where 〈, 〉 is the dot product, and h is a temperature factor that regulates the distribution of the output representations by decreasing or increasing its concentration (Hinton et al. 2015; Wu et al. 2018); in our case, h is set to 0.5. Here z _j denotes the representation of the augmented view of x _i, while k sweeps the other 2N − 1 representations in the batch.

This loss, first introduced in Sohn (2016), does not assume a prior distribution of the representation space and therefore can be used to learn representations of both continuous or discrete data sets. The model benefits from a large number of negative examples and long training times (Chen et al. 2020a). Our model was trained using a batch size of 1024 during 500 epochs, which represents 40 minutes of computing time on one GeForce RTX 2080 Ti GPU with NVIDIA V- 450.102.04.

3.2. Augmentations

Data augmentation is a common practice in machine learning that consists of applying transformations to the input data during the training step. The goal is to create different views of the objects while keeping their semantic information. In a supervised approach, data augmentation can be used to enlarge the training sample, since the different views of the data will have the same labels. In the contrastive-learning framework, augmentations are essential to create positive pairs of objects; these are the transformed images that originate from the same image.

When generating meaningful representations of the data in our machine-learning approach, we aim that galaxies that are similar in intrinsic physical parameters will appear close to each other in the representation space. A standard but nonoptimal feature decomposition of the data may reproduce features that are intrinsic to the data but not representative of the galaxies. The possible origin of such features is diverse and may be related to the survey design, instrumental effects, the pipeline used to process the data, or top observational constraints, such as orientation or apparent size of the target. Therefore, the augmentation functions we used during the training of the CNN were designed for the MaNGA data set and with the purpose of mitigating nonphysical dependencies as detailed below and illustrated in Figure 1.

1.
Model variation. To prevent the model from learning modeling systematics that could be introduced by the pipeline, a transformation that replaces the input set of maps by a set of differently modeled maps is included. With this transformation, the kinematic maps are replaced by those obtained from the MaNGA Data Analysis Pipeline (DAP), which uses Voronoi binning to ensure S/N = 10 in the g-band (Cappellari & Copin 2003) and MILES template library for the stellar continuum fitting (Sánchez-Blázquez et al. 2006; Falcón-Barroso et al. 2011). The luminosity-weighted age and metallicity maps are replaced by the FIREFLY catalog (Goddard et al. 2017; Neumann et al. 2021) analogs. The V-band reconstructed image is replaced by the scale-matched SDSS r-band photometric image,⁸ which represents a change of view of the galaxy morphological features rather than a modeling variation.
2.
Random flip and rotation (0°, 90°, 180°, 270°). Since there is no preference in the orientation of the galaxies, the model should be invariant to the galaxy position. We quantify this dependence with the galaxy's position angle (PA) and rotation angle (RA). While the first is used as listed in the DR15Pipe3D catalog, the second was estimated from each velocity map as the perpendicular angle to the direction defined by the mean positions of the negative and positive pixels separately.
3.
Noise perturbation. The data are affected by errors inherent to the instrument and the postprocessing of the observations. This transformation aims to compensate for measurement uncertainties by shifting the values of the map pixels. The transformation consists of multiplying each value by a random normal factor ${ \mathcal N }(1,\sigma$ ), with σ adjusted to the expected error for each pixel of each map. The σ maps are calculated as the averaged S/N maps for each channel.⁹
4.
Resize. Since the model uses a fixed size for the input, IFUs with a smaller number of fibers will be more intensively zero-padded than IFUs with more fibers. Even though the IFU size used for each galaxy is linked to its apparent size, the MaNGA survey has an uneven sampling of galaxy angular sizes to IFU size. The IFU size is chosen for each galaxy by optimizing 1.5R_e angular coverage of the target object in the primary sample and 2.5R_e in the secondary sample (Bundy et al. 2015), which means that the ratio ${D}_{\mathrm{IFU}}[\mathrm{arcsec}]/{R}_{{\rm{e}}}[\mathrm{arcsec}]$ ranges from 1.5 to 2.5 when combining the primary and secondary samples. To prevent the algorithm from considering the IFU size and resolution as relevant features, the resize transformation randomly enlarges or reduces the image size to up to 25% of the side size of the original frame. Cubic interpolation is used.
5.
Gaussian blur. A 2D Gaussian kernel of 3 × 3 pixels² and σ [pixels] from ${ \mathcal U }[0.1,2)$ is convolved with each layer of the data input. This encourages the model to become invariant to different resolutions and sharp artificial features in the data maps.

**Figure 1.** Augmentations used during training and their effect on the different input channels of an example galaxy. The leftmost column shows the original five maps of the example galaxy in dimensional units; from top to bottom, V-band reconstructed image, luminosity-weighted age and metallicity, radial velocity, and velocity dispersion. The columns to the right show the resulting maps after each transformation has been applied, namely, model variation, rotation, resize, noise perturbation, and blur.
Download figure:
Standard image High-resolution image

While the noise perturbation, resize, and Gaussian blur transformations have a 50% probability of being applied to the raw data input copy, random flip and rotation is always applied for a better performance. The model variation augmentation is applied with a lower probability, since the Voronoi tessellation of the stellar maps could affect the performance of the kernels of the CNN. Therefore, this augmentation randomly replaces 25% of the copies.

4. Removing Instrumental Biases: Comparison with PCA

We now assess how the learned representations correlate with the known physical properties of the data. For this purpose, the input data normalized to keep relative values between galaxies are used. We first visually explore how galaxies are distributed in the representation space. We then illustrate a clustering application to characterize groups in specific regions of the representation space.

4.1. 2D Visualization of Representations

In order to easily visualize the representation space, we perform a dimensionality reduction of the representations of dimension 1024 to a 2D space using Uniform Manifold Approximation and Projection (UMAP; McInnes et al. 2018) and color-code using different physical and instrumental properties. This is shown in Figure 2. We emphasize that the UMAP representation is only used here for visualization purposes. The actual dimensionality of the representation space is higher than these two dimensions. Our goal is to understand how galaxies are organized in the representation space and, more precisely, whether nonphysical dependencies have been properly removed. We therefore include, on the one hand, the parameters related to instrumental effects (two left panels in Figure 2), which are (top to bottom) the number of fibers in the IFU, galaxy angular size ( ${R}_{{\rm{e}}}[\mathrm{arcsec}]$ ), number of zero pixels, PA and RA of the galaxy, angular coverage ( ${D}_{\mathrm{IFU}}[\mathrm{arcsec}]/{R}_{{\rm{e}}}[\mathrm{arcsec}]$ ), and redshift z, to account for physical resolution. On the other hand, we include a set of integrated physical properties derived from the input maps (two right panels), namely (top to bottom), the effective radius (R_e [kpc]), specific angular momentum within 1.5R_e ( ${\lambda }_{{R}_{{\rm{e}}}}$ ), radial velocity dispersion at the center of the frame (σ_cen), luminosity-weighted age at R_e, and luminosity-weighted metallicity at R_e normalized by the solar metallicity ( ${[{\rm{Z}}/{\rm{H}}]}_{{R}_{{\rm{e}}}}$ ). We also include the slopes of the gradients within 0.5–2R_e of the last two parameters ( ${\rm{\nabla }}{\mathrm{age}}_{{R}_{{\rm{e}}}}$ and ${\rm{\nabla }}{[{\rm{Z}}/{\rm{H}}]}_{{R}_{{\rm{e}}}}$ ).

To better quantify the efficiency of the removal of instrumental dependencies, we contrast representations obtained by a PCA with those obtained with the SimCLR algorithm. For the first set, the five MaNGA maps are decomposed to 1024 principal components to match the SimCLR representation dimensions. In all panels, the distribution of the 2D projection of the PCA feature decomposition, is displayed on the left (labeled with "a" in the top right corner), while the representations learned by contrastive learning, are on the right (labeled with "b" in the top right corner). These projections are simply obtained by applying the UMAP to the features obtained with PCA on the one hand and SimCLR on the other.

The principal components directly projected via UMAP are primarily organized based on instrumental and observational parameters (Figure 2). The main features that drive the organization of the PCA space are indeed the number of zero pixels, the number of fibers on the IFU, and the angular sizes of the galaxies (panels (a1)–(a3)). Although some slight dependencies on these parameters are recognizable in the SimCLR representation space, they are remarkably reduced (panels (b1)–(b3)). More precisely, we find that the principal components are locally organized according to the orientation parameters (PA and RA) in panels (a4) and (a5), while the representations learned through the contrastive algorithm remove these orientation dependencies efficiently (panels (b4) and (b5)). We notice, however, that the angular coverage dependencies were not completely removed (panel (b6)); an overdensity of the MaNGA secondary sample is visible in the representation space. Panel (b7) also shows a dependence of the representations with redshift; this could be due to the fact that the MaNGA survey has an overpopulation of massive galaxies toward higher z (Wake et al. 2017). When not considering a volume-limited sample, using z to quantify nonphysical dependencies could lead to the conclusion that representations depend on spatial resolution, whereas the features learned actually reflect physical properties. Therefore, even if redshift indicates a physical resolution dependency, it is not a reliable indicator for this particular data set.

The rightmost column of Figure 2 shows that the SimCLR representation space smoothly transitions as a function of physical properties: age, metallicity, and central velocity dispersion (panels (b11), (b12), and (b10)). Furthermore, parameters that are less trivial to obtain from the maps, such as ${\lambda }_{{R}_{{\rm{e}}}}$ (b9) and ${\rm{\nabla }}{\mathrm{age}}_{{R}_{{\rm{e}}}}$ (b13), are also learned by the model and show smooth transitions in the SimCLR representations. In general, the distribution of the principal components also shows a clear dependence on these parameters, but in this space, galaxies that share the same properties tend to appear in separate areas of the diagram due to instrumental biases.

These results confirm that the SimCLR representations efficiently remove the complex instrumental and observational biases present in the MaNGA survey data and primarily organize galaxies based on physical parameters.

4.2. Unsupervised Clustering of Kinematics and Stellar Populations of Galaxies

Based on the encouraging results of the previous subsection, we further investigate how the sample of galaxies is clustered in the representation space by dividing the sample into groups using the K-means clustering algorithm (MacQueen 1967; Pedregosa et al. 2011) on the 1024 dimensions of the representation space. The K-means is a standard iterative algorithm to cluster unlabeled data by minimizing the sum of the squared euclidean distances of the data points within each cluster to their average position, or centroid. The number of clusters provided by the user will determine the number of centroids that will be initialized. We assume the same distance metric in the representation space as the one used to calculate the contrastive loss in the head projection space and, therefore, normalize the representations with L² norm such that the euclidean and cosine (as [1 − 〈 a , b 〉], where a , b are two array-like representations) distances between representations are equivalent.

The resulting clusters represent a first-order division of representation space and will allow further understanding of the underlying astrophysical properties that drive the space distribution by studying the galaxy groups. Since our representation space is continuous in nature, this exercise will not lead to a sharply defined, discrete classification. Nevertheless, we define a criterion to obtain a repeatable and physically driven classification, allowing further study of the galaxy groups thus identified. To define the number of clusters, we require the division to be robust; i.e., it should be independent from the training run and initialization parameters. Additionally, the clusters must be independent of features that are not physically relevant.

To ensure a robust division, we train the SimCLR framework 10 independent times such that the input maps are perturbed differently in each run. By construction, the features obtained in each run are not exactly the same, since the training has some inherent stochastic behavior. Nevertheless, our assumption is that the representation spaces must encode the same information irrespective of the initialization, such that neighboring galaxies in one run should remain neighbors in the other resulting spaces. We set the first run as the reference run and maximize the matching clusters of the remaining runs with the linear assignment algorithm implemented in SciPy (Crouse 2016; Virtanen et al. 2020). As the clusters of each run will not necessarily be identified under the same labels by the clustering algorithm, the linear assignment algorithm provides a new set of cluster labels such that the number of galaxies that fall in the same groups as the reference run is maximized. An agreement score is defined as the fraction of objects that are consistently placed in the same group in at least 90% of the runs. Although this score might appear too restrictive when a large number of clusters is considered, it shows how reproducible the resulting groups are. We consider the divisions to have good agreement when the agreement score is over 85%.

Finally, to account for undesired dependencies, e.g., on instrumental parameters, we compute the linear contribution of the nonphysical parameters to the classifications found. We perform a linear regression fit of the parameters to a binary classification for each cluster. One class contains galaxies that belong to the given cluster, while the other class is a random selection of nonmember galaxies. The fit is presented with the same number of examples of both classes, while a spare 20% of the sample is used to obtain the validation accuracy. The resulting coefficient associated with each parameter is then normalized to indicate its contribution percentage to the classification. We reject clusters when the accuracy of this fit is higher than 65%, in which case, we assume that the cluster is the result of instrumental and/or nonphysical effects.

For comparison purposes, we also perform a K-means clustering on a representation space derived through PCA on the five MaNGA maps. We note that the metrics chosen are compatible with the SimCLR framework, while this might not be the case for PCA. Therefore, a more favorable additional test is performed for the PCA decomposition. In this case, only the first 10 PCA parameters are selected, and each parameter is scaled to have a zero mean and a standard deviation of 1.

The agreement score and the most accurate fit to the parameters that are not physically relevant are shown in Table 1 for 14 divisions of the SimCLR representations and four divisions of the 1024 and 10 most relevant PCA features. For comparison, the relevance of the physical and nonphysical parameters in each cluster—computed through linear regression—are also listed in the last two columns.

Table 1. Cluster Division Performance

SimCLR
N Clusters	% Agree	${\mathrm{Acc}}_{\max }$ Nonphys	Acc_meanPhys

2	89.419 ± 0.006	64 ± 2	93.1 ± 0.5
3	93.63 ± 0.03	60.5 ± 0.5	89.3 ± 0.5
4	62.5 ± 0.1	63 ± 2	86.4 ± 0.3
5	84.43 ± 0.08	68 ± 1	85.5 ± 0.6
6	85.04 ± 0.09	68 ± 1	84.1 ± 0.5
7	60 ± 1	70 ± 1	83.1 ± 0.5
8	58 ± 5	72 ± 1	82.4 ± 0.3
9	56 ± 2	72.9 ± 0.7	82.6 ± 0.5
10	58 ± 2	71 ± 1	81.9 ± 0.5
11	65 ± 3	71 ± 2	81.5 ± 0.9
12	47 ± 2	71 ± 2	80.9 ± 0.7
13	50 ± 3	72 ± 3	80.1 ± 0.7
14	51 ± 5	73 ± 3	79.1 ± 0.8
15	64 ± 8	75 ± 4	78.7 ± 0.9

PCA 1024
N clusters		${\mathrm{Acc}}_{\max }$ Nonphys	Acc_meanPhys

2		92.13 ± 0.09	72.28 ± 0.07
3		94.9 ± 0.8	71.3 ± 0.2
4		92.7 ± 0.7	64.7 ± 0.2
5		91.9 ± 0.3	69.5 ± 0.5

PCA 10
N clusters		${\mathrm{Acc}}_{\max }$ Nonphys	Acc_meanPhys

2		95 ± 3	72 ± 2
3		93 ± 2	67.4 ± 0.6
4		92 ± 1	68 ± 1
5		94.1 ± 0.5	64.2 ± 0.5

Note. For each division into N clusters, an agreement score is calculated from 10 different runs of the contrastive algorithm. The third column indicates the most accurate fit to the nonphysical parameters among the N clusters, while the fourth column shows the average accuracy of the fit to the integrated astrophysical parameters. The uncertainty for the agreement score is estimated as the standard deviation due to K-means initialization with 10 different seeds, while accuracy uncertainties were estimated as the standard deviation corresponding to five SimCLR independent runs with five different seeds each in the clustering step. The values that are within the agreement score or accuracy required are incated in bold.

Download table as: ASCII Typeset image

In all cases, the SimCLR representations yield group divisions that are more dependent on physical features than on those that are not physically relevant, thus confirming the qualitative results of the previous section. In contrast, the divisions derived from the PCA are highly dependent on nonphysical features (>90% correlation). In particular, the number of fibers in the IFU and the number of zeros in the maps show the strongest correlation with the group divisions in the PCA features.

The maximum number of clusters that fulfills our agreement and nonphysical accuracy requirements is three. This number also corresponds with the natural divisions of the UMAP maps of Figure 2, which clearly show a large cluster in the bottom left corner and a strong age trend in the right blob.

In the following, we only consider these three clusters and analyze their physical properties. Table 2 shows the correlation of the clusters with several integrated physical parameters. The strongest correlations are with velocity dispersion, age, physical size, and ${\lambda }_{{R}_{{\rm{e}}}}$ . For better visualization, we show in Figure 3 four well-studied planes for galaxies: $\mathrm{log}\,{M}_{* }\mbox{--}\mathrm{log}\,\mathrm{SFR}$ , –λ_R, $\mathrm{log}\,{M}_{* }\mbox{--}[{\rm{Z}}/{\rm{H}}]$ , and $\mathrm{log}\,\sigma \mbox{--}\mathrm{log}\,\mathrm{age}$ .

**Figure 3.** From left to right: stellar mass vs. SFR plane, ellipticity vs. specific angular momentum, stellar mass vs. LW metallicity, central velocity dispersion vs. LW age. The top row shows integrated parameters color-coded according to the three clusters found in the SimCLR representation space when the relative norm (see Section 2) is applied to the raw data. In the middle row, the cluster division was performed on the PCA 1024 feature space. The bottom row shows the division into two clusters of the SimCLR representation space when the individual norm is applied to the input data. Mean values of each cluster are displayed in solid colors, and error bars account for the standard deviation of the cluster. Groups found with SimCLR representations show less overlap in all four physical planes than groups found by clustering PCA features.
Download figure:
Standard image High-resolution image

Table 2. Cluster Dependence with Integrated Parameters

Cluster	$\mathrm{log}(\sigma )$	LW ${\mathrm{Age}}_{{R}_{{\rm{e}}}}$	$\mathrm{log}({R}_{{\rm{e}}}[\mathrm{kpc}])$	${\lambda }_{{R}_{{\rm{e}}}}$	log( ${\mathrm{SFR}}_{{{\rm{H}}}_{\alpha }}$ )	LW ${\mathrm{ZH}}_{{R}_{{\rm{e}}}}$	∇LW ${\mathrm{ZH}}_{{R}_{{\rm{e}}}}$	∇LW ${\mathrm{Age}}_{{R}_{{\rm{e}}}}$
(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)
1	13.4% ± 3.7%	35.2% ± 1.6%	14.6% ± 1.4%	17.9% ± 1.9%	6.8% ± 0.6%	4.5% ± 0.8%	6.4% ± 0.3%	1.1% ± 0.3%
2	3.7% ± 2.2%	61.5% ± 3.1%	8.9% ± 0.4%	4.5% ± 1.1%	10.9% ± 1.2%	7.4% ± 1.4%	1.9% ± 0.8%	0.9% ± 0.5%
3	43.6% ± 1.6%	14.2% ± 1.6%	11.2% ± 1.2%	16.7% ± 1.5%	1.3% ± 0.5%	4.9% ± 0.7%	3.9% ± 0.4%	4.0% ± 0.3%

Note. Dependence with integrated parameters of the three groups found with the K-means clustering in the SimCLR representation space.

Download table as: ASCII Typeset image

We clearly observe that the three different clusters identified in a purely unsupervised way populate different regions in the different planes. While star-forming disks (with $\langle \mathrm{log}\,\mathrm{age}\ [\mathrm{yr}]\rangle =9.9$ and 〈〉 = 0.36) with low metallicity (〈Z/H〉 = − 0.25) group in cluster 2, quenched fast-rotating galaxies with low-to-intermediate masses ( $\mathrm{log}\,{M}_{* }/{M}_{\odot }\lesssim 10.9$ ) can be found in cluster 1. Lastly, cluster 3 includes the massive ( $\mathrm{log}\,{M}_{* }/{M}_{\odot }\gtrsim 10.4$ ) slow-rotating galaxies. We emphasize that the star formation rate is not included in the maps used for the representation. Nevertheless, most of the main-sequence galaxies are associated with one unique cluster, which naturally suggests that star-forming galaxies share similar kinematic and stellar population properties over several decades of stellar mass. We will discuss this further in Section 6. For comparison, the bottom row of Figure 3 shows the same four physical planes obtained through clustering on the PCA space. Interestingly, we see that there is very little dependence on physical properties, which again highlights the effectiveness in extracting physical information of the SimCLR representations when compared to those of the PCA.

Additionally, we analyze the morphological features of the galaxies in each cluster using the MaNGA deep-learning DR-17 morphological catalog (MDLM-VAC; private communication, H. Domínguez Sánchez et al. 2021, in preparation). This catalog is an extension of the deep-learning DR-15 catalog presented in Fischer et al. (2019), and it was obtained with deep-learning models trained on SDSS-DR7 images (see Domínguez Sánchez et al. 2018 for more details on the methodology and model performance). Figure 4 shows how the galaxies distribute in each cluster according to the morphological parameter T type.

**Figure 4.** The T-type classification of all MaNGA galaxies (gray) and how they distribute across the unsupervised clusters (red, green, and blue). The colors of the clusters have been chosen as the first row of panels in Figure 3. Early-type galaxies (T type < 0) are predominantly in the cluster that includes the massive slow-rotating galaxies. Galaxies belonging to the star-forming main-sequence cluster (blue) mostly distribute around the largest values of T type, while intermediate-mass quenched galaxies show intermediate T-type values.
Download figure:
Standard image High-resolution image

We find that galaxies that have been classified as early types (T type < 0) belong predominantly to the cluster containing high-mass slow-rotating galaxies, while a smaller fraction belongs to the intermediate-mass quenched galaxy cluster. Galaxies with T types > 0 distribute in the clusters that include less massive galaxies, with the latest T types belonging to the star-forming cluster.

5. Internal Structure

In the previous analysis, the relative values of the observed maps were kept, since the normalization applied works as a uniform change of units across all galaxy maps. We now analyze how the representation space organizes when only the spatial features of the input maps are considered. For this purpose, a linear normalization is applied for each map individually, such that the values of each map span the [0, 1] range (individual norm). With this normalization, information about the absolute values of the different quantities is removed, and the model is forced to focus exclusively on the internal structure of the observed galaxies. The training process is repeated on the newly normalized input data using the same augmentations and model settings. We present the analysis of the representation space performed as in Section 4.

5.1. 2D Visualization of Representations

As in Section 4.1, the SimCLR representations obtained after training the model are projected to a 2D space with UMAP (McInnes et al. 2018) to visualize how the representation space correlates with known physical parameters. For comparison, the projections of both sets of normalized maps are presented together in Figure 5.

**Figure 5.** As in Figure 2, a 2D UMAP projection of SimCLR representations using relative norm (top row) and individual norm (bottom row). Color-coding corresponds to physical properties derived from the maps; from left to right, effective radius, specific angular momentum, central velocity dispersion, and luminosity-weighted age and metallicity and their gradients. Although the individual norm removes the relative information of these parameters from the maps, the spatial features still correlate with them.
Download figure:
Standard image High-resolution image

While the relative norm representations show smoother transitions for central velocity dispersion, projected angular momentum, age, and metallicity (panels (R2)–(R5)), the correlation between these parameters and the individual norm representation space is not negligible. Furthermore, panels (I3)–(I5) show that young and metal-poor galaxies with low velocity dispersion tend to populate the left side of the maps, while old metal-rich galaxies with higher velocity dispersion primarily locate on the right-hand side.

5.2. Unsupervised Clustering of the Representation Space

As the representation space distribution changes when applying different norms to the input data, we analyze the dominant drivers of the new space by performing an unsupervised clustering.

In this representation space, we find that only a division into two clusters fulfills the agreement requirement defined in Section 4.2 with an agreement score of 90.89 ± 0.03, while more partitions yield scores of <78%. The nonphysical parameter fits have comparable accuracy to the ones obtained with the relative norm. Specifically for the two-cluster division, this value is 67.6 ± 1.3, which is only slightly larger than the 65% threshold imposed in the previous section. To avoid confusion with the clusters of the previous section, we refer to the new groups as clusters A and B.

The distribution of the two clusters in the physical planes $\mathrm{log}\,{M}_{* }\mbox{--}\mathrm{log}\,\mathrm{SFR}$ , –λ_R, $\mathrm{log}\,{M}_{* }\mbox{--}[{\rm{Z}}/{\rm{H}}]$ , and $\mathrm{log}\,\sigma \mbox{--}\mathrm{log}\,\mathrm{age}$ is shown in the bottom row of Figure 3. While cluster A is constrained in the planes toward higher-mass, older, and more metal-rich regions (with average values $\langle \mathrm{log}\,{M}_{* }/{M}_{\odot }\rangle =10.6$ , $\langle \mathrm{log}\,\mathrm{age}[\mathrm{yr}]\rangle =9.5$ , and 〈Z/H〉 = −0.11), cluster B spreads across all of the planes with a lower concentration in the overlapping regions (with average values $\langle \mathrm{log}\,{M}_{* }/{M}_{\odot }\rangle =10$ , $\langle \mathrm{log}\,\mathrm{age}[\mathrm{yr}]\rangle =9$ , and 〈Z/H〉 = − 0.23). The low-mass and star-forming galaxies are contained primarily in the last cluster. Although there is overlap between the two groups, a physical difference is captured and clearly visible in age, mass, and metallicity. This is interesting because the information about the absolute values has been removed; however, we still see that the main clusters found correlate with integrated properties such as stellar mass and velocity dispersion.

To further visualize the typical spatial features found in each group, we select the five galaxies whose representations are closer to the centroid of each cluster. The example galaxies of each cluster are shown in Figure 6. While the example galaxies from cluster A show an early-type morphology and negative gradients in velocity dispersion and metallicity maps, galaxies in cluster B tend to have gradients that are inverted compared to those in cluster A, and their V-band reconstructed images show structure and disk features compatible with later-type morphology.

**Figure 6.** Example galaxies from clusters A and B (top and bottom panels, respectively). The V-band reconstructed image, age, metallicity, radial velocity, and dispersion maps (rows) were normalized with the individual norm for each galaxy (columns).
Download figure:
Standard image High-resolution image

As this division roughly separates early-type galaxies from late types, the morphological features encoded in the V-band reconstructed images could be driving the clustering. Therefore, in Appendix A, we evaluate the influence of the kinematic and composition maps in the division. When the V-band reconstructed images are excluded from the input maps, we find that the division can be recovered with good agreement (88.621% ± 0.007% overlap with the previous clusters), indicating that the information of the four remaining maps is enough to find the two groups.

6. Summary and Discussion

6.1. Unsupervised Exploration of Multidimensional Data Sets

A key property of contrastive learning–based dimensionality reduction, in contrast with other techniques, is that representations are encouraged to become invariant to a set of transformations. We have shown that this feature has a lot of potential when the data set is nonhomogeneous and affected by known biases, since these can be tackled with convenient transformations in the contrastive framework.

We confirm that the representations obtained in this work efficiently reduce their impact and focus on physical properties, even though MaNGA maps include important instrumental effects. This allows one to explore the physical correlations existing in the high-dimensional space.

Therefore, the presented framework might constitute an important tool to explore and discover patterns in future data sets. We have illustrated in this work how the representations organize based on five kinematics and stellar population maps. However, the method has no limit on the number of channels in the input data; thus, it is possible to extend the analysis to a larger number of maps, including, for example, gas properties or even complete IFU data cubes without postprocessing.

Additionally, because transformations included in the training procedure can be tuned, one could eventually combine data sets of different origins and selection effects. A natural extension of this work is thus the combination of IFU data from different surveys (i.e., MUSE, Henault et al. 2003; CALIFA, Sánchez et al. 2012; MaNGA, Bundy et al. 2015; SAMI, Croom et al. 2012) represented in a unique low-dimension space. Combining observations and predictions from recent hydrodynamic cosmological simulations is also a promising avenue to identify possible discrepancies between the physical properties of observed and simulated galaxies.

In addition to visualization, the representations learned can be used in a variety of downstream tasks. Abul Hayat et al. (2021) and Chen et al. (2020b) found that this framework can obtain the same levels of accuracy for classification as a supervised approach but reduce the number of training labels. We present in this work one application that makes use of the representation space as a tool to explore the data set.

6.2. Clustering

Since contrastive learning locates objects with similar physical properties in nearby regions of the representation space, it can be used to identify groups of objects with shared physical properties in a purely data-driven manner. In Sections 3 and 5, we illustrated this procedure with two different normalizations of the MaNGA maps: with and without information about the absolute values of the physical properties. This allows us to study the interplay between the integrated properties of galaxies and their resolved internal structure.

6.2.1. Absolute Physical Properties

When applying the relative norm to the input data, the three groups identified naturally divide galaxies into three previously well-defined categories using integrated properties: main-sequence rotating disks, low-mass quiescent rotating galaxies, and massive quiescent slow-rotating galaxies. This highlights the potential of our proposed framework for future discoveries.

Although no direct star formation rate indicators were included in the input, the clustering algorithm naturally puts most of the main-sequence galaxies in the same group; i.e., the main sequence is rediscovered in a data-driven way. It suggests that galaxies in the main sequence share kinematic and stellar population properties and therefore evolve with similar physical processes despite covering several decades in stellar mass.

The well-known slow-rotating galaxies, which have been put forward by several previous IFU surveys (Emsellem et al. 2007, 2011; Graham et al. 2018; Falcón-Barroso et al. 2019), are primarily included in this group. The so-called slow rotators are thought to have an assembly history more dominated by major mergers (Cappellari 2016), which therefore increases their stellar mass and velocity dispersion. Our framework naturally isolates these objects.

There is a third interesting group highlighted in Figure 3 that is composed of low-mass quiescent galaxies. Interestingly, these galaxies are rotating fast, as opposed to the slow rotators of the previous cluster. It suggests that the quenching mechanisms for this population are different. This could be predominantly a population of satellite galaxies quenched through strangulation processes (Peng et al. 2015) without a significant impact on the kinematics of stars.

While the three clusters found are well-known groups from previous morphokinematic works, our methodology found this classification with additional stellar population information: the luminosity-weighted age and metallicity maps. In Appendix B, we analyze whether the clusters can be recovered by the framework when only the V-band image and kinematic maps are considered. While the clusters are highly dependent on these input maps, without the age and metallicity maps, the classification cannot be fully recovered (∼25% of the galaxies are misplaced in the new division). Furthermore, not only is an expected smaller dependence with age and metallicity found in the average cluster properties, but a greater overlap of the clusters in T-type classification is noticeable. We note, however, that the massive early-type galaxies tend to be more consistently classified in the same group, again highlighting that the distinctive kinematic features of these galaxies isolate them from the rest. Interestingly, removing the V-band reconstructed image from the input data and including the four remaining maps has no significant effect on the T-type classification per cluster (see Appendix A).

We conclude that the divisions that naturally arise from this framework confirm extensively studied groups of galaxies. The clusters found reflect the dominant differences between three regions of the representation space. The clusters share their main properties with well-known galaxy categories, confirming that these groups of galaxies present intrinsic differences. This again highlights the potential of our proposed framework to discover the physical mechanisms that shape galaxies.

The results might, however, still be affected by selection effects. The MaNGA sample is not a volume-limited sample, and this is not taken into account in the representations. Although a set of weights to obtain a statistically representative sample of a single volume is provided in Wake et al. (2017), limiting the sample would mean a significant reduction of the number of galaxies available to train the deep-learning model. This may conflict with the method, as deep-learning models require a large training set to avoid an overfit of the network's parameters. Although the different groups do exist, their relative fractions or relevance might not be representative of the local universe. Also, small groups of galaxies might not be detected.

6.2.2. Internal Structure

We have repeated the cluster analysis with a different normalization that removes information about the absolute values of the physical properties. This allows us to quantify the relation between the integrated and resolved internal structure of nearby galaxies. Although the previous clusters cannot be completely recovered with this new normalization, it is possible to find two clusters with good agreement that are separated primarily by the spatial features present in the maps. These clusters correlate with the mean and integrated physical values derived from the maps, suggesting that the internal structure of galaxies serves as a footprint that is tightly related to the observed integrated properties. It suggests that these two galaxy types are different not only because they differ in absolute values but also because their internal structure is different, pointing toward distinct assembly histories, as highlighted in previous works.

The first cluster found when only spatial features are considered contains intermediate- to high-mass quenched galaxies. These galaxies show, in general, early-type morphologies and present negative slopes in the velocity dispersion and metallicity maps (Figure 6). The second cluster primarily includes the star-forming main sequence of galaxies. Figure 6 shows that galaxies in this cluster tend to present flat metallicity and velocity dispersion distributions and more disky stellar structure.

The comparison between the clusters obtained with the different normalizations reveals interesting trends. Figure 7 shows that ∼80% and ∼70% of the galaxies that were in different clusters when including the information about the absolute values are also separated by internal structure. It confirms that the distinction between high and low velocity dispersion systems reveals different assembly histories and is not only a reflection of a stellar mass bimodality. Interestingly, the third cluster found in the relative normalization run is distributed among the two main ones, with ∼35% being closer to the star-forming rotating disks and ∼65% being more similar to the massive slow-rotating systems. It suggests that this intermediate-mass quenched population is formed through a mixture of smooth and more violent assembly processes, but there is no particular signature in the internal kinematics of the galaxies that clearly distinguishes them from the two other groups. A fraction of the objects also change cluster when the normalization is modified. We will study these systems in more detail in future work.

7. Conclusions

We applied a method based on the SimCLR deep contrastive-learning approach for obtaining meaningful representations of galaxy-resolved maps from the MaNGA survey. The method produces a low-dimension representation of the data input while preserving meaningful physical information and reducing the impact of the instrumental effects present in the data.

We showed that the proposed representation learning scheme outperforms, at least for this particular data set, more standard dimensionality reduction algorithms such as PCA, which is more sensitive to nonphysical correlations.

We have shown that the representation space can be efficiently used for visualization, clustering, and exploring the properties of subcategories of objects. We argue that contrastive-learning techniques represent a promising avenue to explore and discover patterns in future high-dimensional astronomical data sets.

We thank Sara Ellison, Mallory Thorp, and David Patton for comments on a previous version of this manuscript. We acknowledge financial support from the European Union's Horizon 2020 research and innovation program under Marie Skłodowska-Curie grant agreement No. 721463 to the SUNDIAL ITN network, the State Research Agency (AEI-MCINN) of the Spanish Ministry of Science and Innovation under the grant "The structure and evolution of galaxies and their central regions" with reference PID2019-105602GB-I00/10.13039/501100011033, and IAC project P/300724, financed by the Ministry of Science and Innovation through the State Budget and the Canary Islands Department of Economy, Knowledge and Employment through the Regional Budget of the Autonomous Community. J.F.-B. acknowledges support through the RAVET project by grant PID2019-107427GB-C32 from the Spanish Ministry of Science, Innovation and Universities (MCIU) and the IAC project TRACES, which is partially supported through the state and regional budgets of the Consejería de Economía, Industria, Comercio y Conocimiento of the Canary Islands Autonomous Community.

This project makes use of the MaNGA-Pipe3D data products. We thank the IA-UNAM MaNGA team for creating this catalog and Conacyt Project CB-285080 for supporting them.

Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. The SDSS-IV acknowledges support and resources from the Center for High Performance Computing at the University of Utah. The SDSS website is www.sdss.org. The SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration, including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics ∣ Harvard & Smithsonian, the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), the National Astronomical Observatories of China, New Mexico State University, New York University, the University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, the United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, the University of Colorado Boulder, the University of Oxford, the University of Portsmouth, the University of Utah, the University of Virginia, the University of Washington, the University of Wisconsin, Vanderbilt University, and Yale University.

Appendix

To analyze the dependence of the representation space with the input maps, we repeat the clustering analysis with different combinations of maps. In particular, we consider two variations to the proposed methodology that tackle specific questions. On the one hand, we examine the cluster division when morphology information is excluded. For this purpose, we perform K-means clustering on a representation space obtained by excluding the V-band reconstructed image. On the other hand, we analyze whether the three clusters found in Section 4.2 can be recovered solely from the morphokinematic input data.

Appendix A: Age, Metallicity, and Kinematics

We train the SimCLR model on the age, metallicity, radial velocity, and dispersion maps. We later use this trained model to extract the representations of this set of maps and perform K-means on them. When the input data are normalized with the relative norm, the clusters overlap at 92.04% ± 0.03% (error estimated as the standard deviation of 10 different K-means initializations). This value is slightly smaller than the agreement score for the clustering on the representation space obtained when the V-band reconstructed image is included; therefore, this channel does contribute to the three-cluster division. We find that the clusters show more discrepancy in the later-type galaxies (T type > 0, as seen the left panel of Figure 8). However, we conclude that the overlap is significant, since it is indeed greater than the agreement threshold (85%) defined in Section 4.2.

**Figure 8.** As in Figure 4, the gray filled histograms show the T-type classification of all MaNGA galaxies, and the colored histograms correspond to unsupervised clusters. The dotted histograms correspond to clusters obtained with the five maps as input: V-band reconstructed image, age, metallicity, radial velocity, and velocity dispersion. Solid lines show the cluster divisions when the V-band reconstructed image is excluded. Left panel: colors correspond to the three clusters found when the relative norm is applied to the input data. Right panel: colors correspond to the two clusters found when the individual norm is applied. In both panels, the clusters' trends are maintained after removing the V-band image from the input. The discrepancy between clusters is clearer toward later-type galaxies.
Download figure:
Standard image High-resolution image

When this analysis is repeated on the maps normalized with the individual norm, an overlap of 88.621% ± 0.007% is recovered. After removing the V-band image, the clusters tend to equalize their T-type distributions in later-type galaxies (Figure 8). However, the agreement of the clusters before and after excluding the V-band image is significant, and the overall trends of both clusters are still clear. This indicates that not only does the division depend on the galaxy morphology, the remaining maps have sufficient information to separate the two groups.

Appendix B: Morphokinematics

While the three clusters found in Section 4.2 are well-known groups from previous morphokinematic works, our methodology found these classifications with additional stellar population information: the luminosity-weighted age and metallicity maps. Therefore, we analyze whether the clusters can be recovered by the framework when only the morphokinematic maps are considered.

The clusters that arise when the age and metallicity maps are excluded overlap at 76.95% ± 0.02% with those clusters found in Section 4.2. While the morphokinematic maps greatly influence the clusters found in Section 4.2, we consider that the initial groups cannot be fully recovered because ∼25% of the galaxies are misplaced when the age and metallicity maps are excluded.

We further inspect how the average properties of each cluster vary in Figure 9. While cluster 3 is the least affected of all three groups, clusters 1 and 2 show more overlap in T-type classification, and their mean group properties tend to equalize. However, there are parameters that show larger separations in the physical planes (middle and right panels of Figure 9), like σ_cen and ${\lambda }_{\mathrm{Re}}$ . The fact that the difference between the clusters is enhanced only for these parameters is consistent with these parameters being derived from the maps used in this test while excluding the age and metallicity information. These results are consistent with the correlations analyzed in Table 2, which suggest that the cluster with the most massive galaxies is primarily separated due to the kinematics, while the low- to intermediate-mass galaxies are separated into old/metal-rich and young/metal-poor groups.

Capturing the Physics of MaNGA Galaxies with Self-supervised Machine Learning

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Data