The Four Cosmic Tidal Web Elements from the β-skeleton

, , , and

Published 2021 November 30 © 2021. The American Astronomical Society. All rights reserved.
, , Citation John F. Suárez-Pérez et al 2021 ApJ 922 204 DOI 10.3847/1538-4357/ac1fed

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0004-637X/922/2/204

Abstract

Precise cosmic web classification of observed galaxies in massive spectroscopic surveys can be either highly uncertain or computationally expensive. As an alternative, we explore a fast Machine Learning-based approach to infer the underlying dark matter tidal cosmic web environment of a galaxy distribution from its β-skeleton graph. We develop and test our methodology using the cosmological magnetohydrodynamic simulation Illustris-TNG at z = 0. We explore three different tree-based machine-learning algorithms to find that a random forest classifier can best use graph-based features to classify a galaxy as belonging to a peak, filament, or sheet as defined by the T-Web classification algorithm. The best match between the galaxies and the dark matter T-Web corresponds to a density field smoothed over scales of 2 Mpc, a threshold over the eigenvalues of the dimensionless tidal tensor of λth = 0.0, and galaxy number densities around 8 × 10−3 Mpc−3. This methodology results on a weighted F1 score of 0.728 and a global accuracy of 74%. More extensive tests that take into account light-cone effects and redshift space distortions are left for future work. We make one of our highest ranking random forest models available on a public repository for future reference and reuse.

Export citation and abstract BibTeX RIS

1. Introduction

The galaxy distribution on large spatial scales follows a structured pattern commonly known as the cosmic web. This web can be described as dense peaks connected by anisotropic filaments and walls woven across vast under-dense voids (Bond et al. 1996). The emergence of this pattern is understood as the evolution of initial density fluctuations growing through gravitational instability (Zel'Dovich et al. 1970; White et al. 1987), a picture that can be followed in great detail with N-body cosmological simulations (Schmalzing et al. 1999; Vogelsberger et al. 2014).

Cosmic web classification methods usually strive to classify it into four elements: peak, filament, sheet, or void (Libeskind et al. 2018). However, most of the methods that take observed galaxies as an input focus on the detection of some, not all, cosmic web elements. An example is the great diversity of void-finders (Aikio & Mahonen 1998; Padilla et al. 2005; Platen et al. 2007; Neyrinck 2008; Elyiv et al. 2015; Sutter et al. 2015; Xu et al. 2019) and filaments-finders (Novikov et al. 2006; Stoica et al. 2007; Zhang et al. 2009; Aragón-Calvo et al. 2010; Sousbie 2011; Cautun et al. 2013; Chen et al. 2015; Luber et al. 2019; Bonnaire et al. 2020).

The methods that classify observed galaxies into the four cosmic web elements can be roughly divided into two types according to their computational cost and global accuracy. The first type interpolates galaxy positions on a grid to process this number density field in the same way as a dark matter density field (Eardley et al. 2015; Alpaslan et al. 2016; Tojeiro et al. 2017; Alam et al. 2019). This approach has a low computational cost, but it is uncertain how accurate are their results with respect to the expected dark matter cosmic web.

A second type of algorithms first performs a full dark matter density field reconstruction on the observed galaxy distribution. Most of these algorithms use Bayesian statistics coupled with Monte Carlo sampling and N-body simulations (Jasche et al. 2010; Jasche & Wandelt 2013; Bos et al. 2014; Leclercq et al. 2015; Horowitz et al. 2019; Burchett et al. 2020); some others are based on the density distribution around halos that are associated with galaxy groups (Wang et al. 2009; Muñoz-Cuartas et al. 2011). This generic approach can provide accurate dark matter distributions but can be very expensive from the computational point of view.

Machine-Learning (ML) methods applied onto network statistics represent a middle point between these two extremes. They promise a reasonable accuracy at a low computational cost. ML methods have already been applied to this problem of finding the four web elements starting from the dark matter halo distribution (Hui et al. 2018; Tsizh et al. 2020) and the dark matter particles (Buncher & Carrasco Kind 2020). For instance, Tsizh et al. (2020) built networks by linking halos within a fixed linking length. From these graphs, they computed 10 different network metrics to feed different ML algorithms and predict the cosmic web for each halo. The true cosmic web environments were quantified using five different cosmic web classifications.

This ML approach avoids the uncertain grid interpolation of tracers and the expensive dark matter density reconstruction. Under this approach of supervised ML classification, the algorithm needs a feature data set (some halo or galaxy properties), the labels to classify the data set (the four cosmic web environments), and an algorithm to link the features to the classes.

The work we present here contributes to this new line of work. Here we explore to what extent the β-skeleton graph (Fang et al. 2019) can be used to predict the T-Web (Forero-Romero et al. 2009) environment classification. We test three different supervised ML algorithms (Classification Trees, Extra Gradient Boosting, and Random Forests) to train from the features and predict the environment. We perform a thorough exploration of parameter space to find the best match between the T-Web and the graph-based features. A crucial improvement of the current study is that we use galaxies as tracers. To this end, we use the Illustris-TNG simulation (Marinacci et al. 2018; Naiman et al. 2018; Nelson et al. 2018, 2019; Pillepich et al. 2018a; Springel et al. 2018) at redshift z = 0. Finally, we also release the best ML model for future reference and reuse.

This paper is organized as follows. In Section 2, we describe the relevant aspects of the Illustris-TNG simulations related to our work, the T-Web algorithm, and the β-skeleton graph. In Section 3, we describe the mechanism to link the β-skeleton to the T-Web using machine-learning algorithms. There we explain the features and meta-parameters we used, the classification algorithms, and the metrics for model evaluation. In Section 4, we present our results. In Section 5, we explore a brief discussion comparing our results against Tsizh et al. (2020). In Section 6, we discuss about the application of our results to observational data to finally summarize our conclusions in Section 7.

2. Algorithms and Simulations

2.1. T-Web Classification

The tidal web (T-Web) method (Hahn et al. 2007; Forero-Romero et al. 2009) classifies the large-scale structure into four cosmic web types: voids, sheets, filaments, and peaks. This classification is based on the eigenvalues of the deformation tensor Tij computed as the Hessian matrix of the gravitational potential

Equation (1)

where ψ is a normalized gravitational potential that follows the equation

Equation (2)

and δ is the dark matter overdensity. This tensor has three real-valued eigenvalues. The cosmic web environment is defined by the number of eigenvalues larger than a threshold value λth. Locations with three eigenvalues larger than λth correspond to a peak, two to a filament, one to a sheet, and zero to a void.

Computationally speaking, to define an environment in an N-body simulation, we implement the following seven steps: (1) interpolate the mass particles with a Cloud-In-Cell (CIC) scheme over a grid to obtain the density, (2) smooth it with an isotropic Gaussian filter, (3) compute the overdensity, (4) find the normalized potential with Fast Fourier Transform methods, (5) compute the deformation tensor using finite differences, (6) find the eigenvalues, and finally (7) count the number of eigenvalues larger than the threshold λth.

2.2. The β-skeleton Algorithm

The β-skeleton is an algorithm that computes a graph over a distribution of nodes. This algorithm depends on a positive and continuous parameter β that defines an exclusion region between two nodes. Here we use the Lune-based definition for the exclusion region (Kirkpatrick & Radke 1985).

For a node-set S in a three-dimensional Euclidean space, the β-skeleton defines an edge set so that any two points p and q in S are connected if there is not a third point into an exclusion region. This region is defined as the intersection between two congruent spheres with diameter β d, centered at ${\rm{p}}+\beta ({\rm{q}}-{\rm{p}})/2$ and ${\rm{q}}+\beta ({\rm{p}}-{\rm{q}})/2$. The centers are located on the axis joining the nodes p and q.

For β = 1, the exclusion region is a sphere with a diameter d with the two points under consideration located opposed to each other on the surface of that sphere. This case receives the name of Gabriel Graph, while the β-skeleton with β = 2 is known as the Relative Neighbor Graph (RNG). The results in this paper are based on the Gabriel Graph.

More details on the properties of the β-skeleton applied to large-scale structure data can be found in Fang et al. (2019) and Garcia-Alvarado et al. (2020).

2.3. Illustris-TNG

We test our algorithms on a simulation from the Illustris-TNG project. This project is a series of large, cosmological gravo-magnetohydrodynamical simulations that follows the coupled evolution of dark matter, gas, stars, and black holes from redshift z = 127 to z = 0 (Marinacci et al. 2018; Naiman et al. 2018; Nelson et al. 2018, 2019; Pillepich et al. 2018a; Springel et al. 2018).

These simulations are based on the cosmological standard model Λ Cold Dark Matter (CDM) with the following cosmological parameters (Ade et al. 2016): cosmological constant ΩΛ,0 = 0.6911, matter density Ωm,0 = 0.3089, baryonic density Ωb,0 = 0.0486, normalization σ8 = 0.8159, spectral index ns = 0.9667, and Hubble constant Ho = 100h km s−1 Mpc−1 with h = 0.6774. These simulations used the AREPO moving-mesh code (Springel 2011). The galaxy-formation physics included prescriptions for star formation and associated supernovae, magnetic fields, stellar evolution, and feedback with galactic out-flows, chemical enrichment, primordial and metal-line radiative cooling of the gas, and black hole accretion and feedback (Weinberger et al. 2017; Pillepich et al. 2018b).

The Illustris-TNG suite includes simulations with boxes of sizes 50, 100, and 300 Mpc at different resolutions. In this work, we use the largest simulation volume with the highest mass resolution, named Illustris TNG300-1, at z = 0. It consists of a cube of 302.63 Mpc3 in volume, with a baryonic mass resolution of 8.8 × 107 M, a dark matter resolution of 4.7 × 108 M, and a 25003 dark matter particles (Nelson et al. 2019). We use the galaxy catalogs from the public release (Pillepich et al. 2018b) to build the β-skeleton graph. Moreover, we use the dark matter particles at z = 0 to obtain the dark matter T-Web classification. More extensive tests that take into account light-cone effects and redshift space distortions (RSD) are left for future work.

3. Linking the β-skeleton to the T-Web

Figure 1 illustrates the correlation between the T-Web classification of galaxies and their corresponding β-skeleton for a sub-box of the TNG simulation with a volume of 150 Mpc × 150 Mpc × 20 Mpc. This spatial correlation suggests that graph-derived features might have a good chance to predict the dark matter web environment. However, there is a variety of free-parameter combinations that have to be explored to find the best match between the β-skeleton and the T-Web. For instance, the T-Web classification depends on the smoothing scale, σ, and the threshold parameter, λth; the β-skeleton changes with β; and the number density of tracers we use to build the β-skeleton can also change. In what follows, we explore those aspects.

Figure 1.

Figure 1. Comparison between the galaxies classified by its T-Web environment (left) and their β-skeleton computed with β = 1 (right).

Standard image High-resolution image

3.1. T-Web and β-skeleton Free Parameters

We perform the T-Web computation over a cubic grid with 2563 cells. This corresponds to a cell size around of 1.18 Mpc. We explore five different values for the smoothing parameter σ = 0.5, 1.0, 1.5, 2.0, and 2.5, expressed in units of the cell size. For each smoothing length, we take six different values for the eigenvalue threshold λth = 0.0, 0.1, 0.2, 0.3, 0.4, and 0.5.

For the galaxy selection function, we use a simple stellar mass cut (${\mathrm{log}}_{10}({M}^{* }/{M}_{\odot }\ {h}^{-1})\gt {M}_{\mathrm{lim}}^{* }$) with ${M}_{\mathrm{lim}}^{* }$: 8, 8.5, 9, and 9.5. These thresholds in stellar mass correspond to 515,334, 349,485, 221,279, and 129,756 galaxies, which translate into number densities of 20 × 10−3, 13 × 10−3, 8 × 10−3, and 4 × 10−3 Mpc−3, respectively.

We find that the β-skeleton features for β = 1 are sufficient to make the T-Web environment prediction, i.e., adding features computed for different values of β do not significantly improve the results. If we include the features for β = 2, the weighted F1 score of our best model only increases by 0.003. Including features for β = 3 the increase of F1 is 0.001. Therefore, we keep β = 1 fixed for the results reported in this paper. This choice is supported by a recent study that showed that the β-skeleton entropy has its largest value for β = 1 (Garcia-Alvarado et al. 2020), meaning that the connectivity of this graph is larger than other β-skeletons with β > 1. For more details on the meaning of the β-skeleton reader we refer the reader to Garcia-Alvarado et al. (2020).

Once the four free parameters β, σ, λth, and ${M}_{\mathrm{lim}}^{* }$ are chosen the β-skeleton and the T-Web are fixed. Next, we define the input features to predict the four cosmic web elements.

3.2. Galaxy Features

We use six features for each galaxy to train the ML algorithm:

  • 1.  
    The number of connections minus the global median number of connections, $\eta ={N}_{{\rm{c}}}-{\bar{N}}_{{\rm{c}}}$.
  • 2.  
    The ratio between the average edge length to first neighbors normalized by the global average edge length, $\alpha ={D}_{\mathrm{av}}/{\bar{D}}_{\mathrm{av}}$.
  • 3.  
    The logarithm of the pseudo-density over the graph, $\varrho ={\mathrm{log}}_{10}\rho $. Where the pseudo-density ρ is defined as the inverse of the volume computed from the ellipsoid that best fits the distribution of first neighbors over the graph. If a galaxy has fewer than three neighbors, its pseudo-density is fixed to be the ${D}_{\mathrm{av}}^{-3}$.

We define three more features that we call Δ features. Their values are computed as the difference between the corresponding value in a node and the average value over first-neighbor nodes. They can be loosely interpreted as a divergence computed over the graph.

  • 1.  
    Δ over the number of connections, Δη.
  • 2.  
    Δ over the normalized edge length, Δα.
  • 3.  
    Δ over the the pseudo-density, Δϱ.

Figure 2 shows the correlations among all features. The diagonal panels present the distribution of a given parameter.

Figure 2.

Figure 2. Histograms and correlations curves for the six features used to train the classifiers. The parameter η is the number of connections minus the global median number of connections, α is the ratio between the average edge normalized by the global average edge length, ϱ is the logarithm of the pseudo-density ρ, which is the inverse of the volume of the ellipsoid that best fit the distribution of first neighbors. Δη, Δα, and Δϱ are the differences over the first neighbors of η, α, and ϱ, respectively. The contour lines in the 2D histograms represent the levels that include 39%, 86%, and 98% of the data corresponding to the 1σ, 2σ, and 3σ reference contour values for a 2D Gaussian probability density.

Standard image High-resolution image

Some of these β-skeleton topological features were studied in previous works. For instance, Fang et al. (2019) used the average connection length to characterize the size and anisotropy of the Large-Scale Structure.

Also, when we include the stellar mass M* as a feature, the weighted F1 score only increases by 0.001. Therefore, we do not include the stellar mass as a feature.

3.3. Classification Algorithms

We arranged experiments with three supervised classification algorithms: Random Forests (RF), Extra Gradient Boosting (XGB), and Classification Trees (CT). We use Scikit-Learn (Pedregosa et al. 2011), Python's implementation of RF and CT, and Xgboost (Chen & Guestrin 2016), Python's implementation for XGB.

Details on the implementation of these machine-learning models can be found in Nordhausen (2009). For CT, we change the maximum tree depth MD from 1 to 30. For XGB we use a tree depth MD from 1 to 30. For RF, we use a fixed maximum tree depth of 10 and change the number of estimators NE (Trees) from 1 to 100.

3.4. Model Evaluation

We partition the simulation box into eight cubic sub-boxes of equal volume. For each sub-box, we compute the β-skeleton and the related feature parameters. Afterward, we discard the galaxies located within 5 Mpc away from the sub-box limits to avoid boundary effects coming from incomplete β-skeleton information. The T-Web information comes from the computation over the full box to take advantage of the periodic boundary conditions.

As the training data set, we use four sub-boxes that only touch on one vertex. The validation data set is composed by the features from two more sub-boxes; this data set is used to select what T-Web parameters and algorithm meta-parameters provide the best results. Once the free parameters and meta-parameters are fixed, we use one sub-box as test data set to report the final scores and confusion matrix. The remaining sub-box is not used in this paper, but is made available in a public repository 4 together with the best ML model.

As a performance metric to compare our models, we use the F1 score, which is the harmonic mean between the purity (also called precision) and completeness (also called recall). We compute the F1 score separately for each web environment. Given the imbalance across classes, we use a weighted F1 average to report a global F1 metric. We also use a global accuracy (the fraction of total correct predictions) in order to perform a comparison against the results by Tsizh et al. (2020).

4. Results

4.1. ML Algorithm Performance

We start by comparing the performance of the different ML algorithms when the T-Web and the β-skeleton parameters change. Figure 3 shows the weighted F1 distribution for all the classification experiments. The histograms are discriminated into the three ML algorithms. From these three models, the results from the RF are better; the F1 value distribution is skewed toward higher values than the distribution from the CT and XGB. From this overall response over all possible models, RF emerges as the best ML algorithm. We can confirm this result by performing a more detailed analysis by taking a look at the results for each cosmic web element separately.

Figure 3.

Figure 3. Histograms of the weighted F1 score for the three machine-learning algorithms (Random Forest (RF), Extra Gradient Boosting (XBG), and Classification Trees (CT)). Each F1 value corresponds to a different set of free parameters and meta-parameters.

Standard image High-resolution image

Table 1 summarizes the F1 performance split by cosmic web element in the first four rows. The mean values and standard deviations are computed over all the ML experiments. From this table, we confirm that the RF classifier performs consistently better than CT for every web element. For three out of the four web elements, the RF produces better weighted F1 scores than CT by 0.03. The results for XGB and RF are comparable but the F1 weighted is better in the RF case.

Table 1. Mean Values and Standard Deviations Over the Weighted F1 Scores for All Models with Different Meta-parameters

F1 scoreClassificationRandomExtra Gradient
 TreesForestBoosting
F1peaks 0.55 ± 0.240.55 ± 0.280.57 ± 0.24
F1filaments 0.69 ± 0.070.72 ± 0.050.71 ± 0.05
F1sheets 0.50 ± 0.120.53 ± 0.130.53 ± 0.11
F1voids 0.22 ± 0.180.26 ± 0.190.27 ± 0.17
F1weighted 0.62 ± 0.050.65 ± 0.040.64 ± 0.04

Note. The first four rows differentiate the results for each environment. The fifth row corresponds to the weighted F1 score over all environments.

Download table as:  ASCIITypeset image

The improved performance by the RF over the other algorithms is an expected generic outcome when the number of instances is larger than the number of features, as it is our case (Ali et al. 2012; Neira et al. 2020). The RF algorithm uses multiple CT making possible the description of more complex boundaries in feature space to separate classes, while the use of a random set of features makes it robust against over-fitting.

4.2. Performance Across Web Elements

Table 1 also shows that some cosmic web elements are easier to predict than others. Focusing our attention on the RF classifier, the algorithm with the best results, we readily observe that filaments are the environment with the best mean F1 score (0.72). This score is followed by peaks and sheets (0.55 and 0.53, respectively) and finally by voids (0.26).

Void is the hardest class to predict. The reason is the class unbalance. There are too few galaxies in voids. Across all models, less than 0.3% of all galaxies are classified as void galaxies. This has two consequences. First, the low number of instances makes it hard for the ML algorithm to define the region in the parameter space occupied by voids. Second, a small number of void galaxies that are misclassified is translated into large changes in the F1 score.

In comparison, filament is the class with the best score. Its F1 score is the highest across all four web elements with an average value of 0.72. Notably, the F1 dispersion across models is also the lowest corresponding to 0.05. The reason is also the class imbalance. Galaxies in filaments are the most represented class; approximately 54% of all galaxies are classified as filament galaxies. This helps to explain that it is easier for the ML algorithm to learn the characteristics that make a galaxy sit in a filament. Peak and sheet have intermediate F1 scores between those for void and filament, with average values of 0.55 and 0.53. The fraction of galaxies in peaks and sheets are also intermediate around 31% and 15%, respectively.

4.3. Best Match Between T-Web and β-skeleton

We now focus on understanding what parameters produce the best match between the T-Web and the features derived from the β-skeleton. In this exploration, we only use the RF algorithm because it provides the best classification results. To do that, we choose the best N = 15 models in terms of their weighted F1 across classes and see what they have in common. Choosing N = 10 or N = 20 only changes the weighted F1 score by 0.001.

Figure 4 summarizes the parameter and meta-parameter distribution for the best N = 15 RF models. From these 15 best models, we pick as the best T-Web parameters λth = 0.0 and σ = 2.0. These values are close to the expected range for a good visual cosmic web classification (Hahn et al. 2007; Forero-Romero et al. 2009; Bustamante & Forero-Romero 2015). The cosmic web produced by those parameters turns out to be well matched by the β-skeleton produced by galaxies with a stellar mass cut of ${M}_{\mathrm{lim}}^{* }=9$. Galaxy populations with different number densities, i.e., a different threshold for the stellar mass, produce a worse match between the T-Web and the β-skeleton properties.

Figure 4.

Figure 4. Parameter and meta-parameter distributions for the 15 RF models with the highest weighted F1 score. From these results we choose λth = 0.0, σ = 2.0, and ${M}_{\mathrm{lim}}^{* }=9$ as the best parameters to match the β-skeleton and the T-Web. NE is the number of estimators (trees) used in the RF algorithm.

Standard image High-resolution image

Finally, we select NE = 80 the meta-parameter in the RF algorithm that gives the highest F1 score after all the other parameters have been fixed.

4.4. Confusion Matrix and Feature Importance

Figure 5 shows the confusion matrix computed on the test data set using our preferred model. The numbers are the fraction of objects that correspond to the "truth" class but are classified into the "prediction" class. What are the most common misclassifications? In first place, comes the 91% of void galaxies that are misclassified as sheet galaxies. In second place, we have the 52% of sheets that are misclassified as filaments, and finally, the 30% of peaks that are misclassified as filaments. Less than 0.01% of all galaxies end up classified as void galaxies by the RF algorithm. For this model, the weighted F1 score is 0.728 and its global accuracy is 74%.

Figure 5.

Figure 5. Confusion matrix for the model {λth = 0.0, σ = 2.0, ${M}_{\mathrm{lim}}^{* }=9$, NE = 80} computed on the test data set.

Standard image High-resolution image

The most common misclassifications highlight the difficulty to find void galaxies. This turns out to be an almost impossible task for one simple reason already mentioned before: there are very few void galaxies. Only 0.3% of the galaxies in the sample are in voids. The difficulty of training ML algorithms to find instances of under-represented classes is a generic result not only in the cosmic web context (Tsizh et al. 2020) but also in other astronomical problems such as transient classification (Bloom et al. 2012; Neira et al. 2020).

The second kind of misclassification shows that the galaxies classified as belonging to a filament, actually belong to neighboring environments: sheets and peaks.

These misclassifications can also be understood in terms of the topological properties of the cosmic web elements. Voids are surrounded by sheets (hence the misclassification between the two), sheets have filaments inside them and filaments in turn have peaks inside them (Cautun et al. 2014) (explaining the misclassification of peaks and sheets into filaments). In this progression, the average density increases and the tidal field anisotropy goes from being isotropic (voids) to anisotropic (sheets and filaments) and isotropic (peaks) (Bustamante & Forero-Romero 2015).

The ranking of correct classifications (the diagonal over the confusion matrix, which also corresponds to the completeness) naturally follows the F1 trends shown in Table 1. Filaments are the class most correctly classified (83%) followed by peaks (70%) and sheets (47%). Void galaxies have the worst results, with zero correct classifications. This ranking also follows the trend in the fraction of instances belonging to filaments (54%), peaks (31%), sheets (15%), and voids (0.3%). A higher number of instances translates into a better chance for the algorithm to give a correct classification.

Figure 6 shows the feature importance for the classification. As could probably be expected, the pseudo-density (ϱ) and the average connection length (α) turn out to be crucial for the classification. The other feature with importance near to the average connection length is Δϱ, the local change in the average pseudo-density.

Figure 6.

Figure 6. Feature importance in the RF classifier for the model {λth = 0.0, σ = 2.0, ${M}_{\mathrm{lim}}^{* }=9$, NE = 80} computed on the test data set.

Standard image High-resolution image

Figure 7 summarizes our results in a qualitative fashion. It shows the spatial distribution for the galaxies into the four web elements. The left panels correspond to the predictions from the best model, the right panels correspond to the truth. The visual impression for filament galaxies is the most similar between truth and prediction, as expected by the quantitative results presented here. Sheet galaxies in the prediction appear less clustered than they should. The failure in the void classification is clearly visible as galaxy deficit in the prediction.

Figure 7.

Figure 7. Spatial distribution for the galaxies classified into the four web elements as predicted from the best model (left) and its corresponding truth (right).

Standard image High-resolution image

4.5. Tests to Compensate for Class Unbalance in Voids

We explore two strategies to improve the results for void classification. In a first strategy, we change the loss function to up-weight these galaxies. To do this, we set the class_weight parameter to "balanced" for the parameters in the best RF model. With this change, the model weights the classes according to the abundance in each class. Figure 8 shows the confusion matrix for this strategy. With this modification, the correct classification for peaks increases from 70% to 80%, for sheets from 47% to 73%, and for voids from 0% to 45%. However, for filaments, the correct classification drops from 83% to 56%. Overall, the weighted F1 score drops from 0.73 to 0.62. For this reason, we discard this strategy.

Figure 8.

Figure 8. Confusion matrix for the best Random Forest Model {λth = 0.0, σ = 2.0, ${M}_{\mathrm{lim}}^{* }=9$, NE = 80} using algorithmic up-weighting to compensate for the class imbalance.

Standard image High-resolution image

A second strategy consists in subselecting the galaxy-training catalog to have roughly equal size for each environment. This means that we train with a number of galaxies for each class limited to have the same size as the number of void galaxies. Figure 9 shows the confusion matrix for this experiment. The correct classification increases for voids and sheets, from 0% to 86% and 47% to 56%, respectively. However, it drops from 70% to 20% for peaks, and from 83% to 19% for filaments. The weighted F1 score for this case decreases from 0.73 to 0.27. We also discard this strategy.

Figure 9.

Figure 9. Confusion matrix for the best Random Forest model {λth = 0.0, σ = 2.0, ${M}_{\mathrm{lim}}^{* }=9$, NE = 80,} after training with classes downsampled to have the same number of instances as the void class.

Standard image High-resolution image

5. Comparison Against Similar Approaches

To elucidate the strengths of the different aspects in the approach presented here, we focus the discussion of our results on the comparison against previous work by Tsizh et al. (2020) that use a similar methodology. Their work combines network analysis with ML to predict the environments in the cosmic web. Their input data comes from publicly available files 5 from the Tracing the Cosmic Web comparison project (Libeskind et al. 2018).

That project performed a dark matter only simulation with 5123 particles within a cubic box of 290 Mpc on a side. The halos in the simulation were identified by a Friends-Of-Friends algorithm. Furthermore, the catalogs also include the cosmic web classification according to different algorithms. The minimum DM halo mass included in the catalog is 1011 M h−1. This corresponds to a total of 281,465 halos and a number density of 10 × 10−3 Mpc−3, which is close to the galaxy number density in our best model with ${M}_{\mathrm{lim}}^{* }=9$.

Tsizh et al. (2020) built networks with fixed linking lengths in the range l = 2–4 h−1. From the networks, they computed 10 different metrics, which, together with the mass and the peculiar velocity, made up the features for the ML algorithms. They also explored five different cosmic web classification algorithms. As a score metric, they use the global accuracy (the ratio of correct predictions to total number of instances). A crucial difference with our work is that Tsizh et al. (2020) did not explore the influence of different smoothing lengths, σ, nor eigenvalue thresholds, λth, in the T-Web cosmic web classifications for their data set. Likewise, they did not explore the influence of a dark matter halo mass cut, to produce graphs with different number density, in their prediction results.

With that setup, Tsizh et al. (2020) found that the best algorithm to predict the T-Web classification was the extra gradient boosting decision trees (xgboost) (Chen & Guestrin 2016). They reported a global accuracy of 51%. Unfortunately, they did not report the confusion matrix for T-Web classifications and we cannot compare the F1 scores.

If we use the same catalog they used to build the β-skeleton and then compute the features to find the best RF model, we have a global accuracy of 54%, 3% points higher than Tsizh et al. (2020). If we also allow for cuts on halo mass in order to vary the number density, we find that using only halos more massive than 1012 M h−1 increases the global accuracy to 58%. Using the full setup we present here, having well-resolved galaxies from the Illustris-TNG simulation, we achieve an accuracy score of 74%, 23 percentage points higher than Tsizh et al. (2020).

These tests with increasing complexity tells us that there are three main factors that explain the improved performance of our approach. First, the β-skeleton, due to its adaptive nature, seems to provide a slightly better T-Web representation than a network with a fixed linking length. Second, the exploration in number density is fundamental to find the best match between the graph features and a given T-Web classification. Finally, the galaxies provided by Illustris-TNG add the information needed to improve the graph match against the dark matter T-Web environment.

6. Application to Observational Data

Applying the work presented here to observational data from large spectroscopic surveys has two main caveats to take into account: the light-cone effect and RSD.

The light-cone effect is the time evolution of the cosmic web along the line-of-sight direction. For sufficiently deep surveys, the cosmic web in nearby galaxies is statistically different from the cosmic web in deeper regions in the survey. Fortunately, the cosmic web shows a slow varying evolution. Quantities such as the mean dark matter density and volume fraction for each one of the four cosmic web environments evolve less than 10% over redshift steps of 0.2 (Cautun et al. 2014). This allows for a reasonable approximation that survey slices in the radial direction with widths on the scale of Δz = 0.2 should not present significant light-cone effects.

On the other hand, the radial RSD has two important aspects that are more difficult to overcome. First, neighboring galaxies in redshift space may actually belong to very different environments in real space. In the case we use in this paper, the galaxies share the same environment on the scale of σ. This is not the case anymore once observed redshifts are translated into comoving positions. Second, regions that are statistically isotropic, such as clusters, now show a filamentary shape along the radial direction. Furthermore, this effect has a strong dependence on the local density.

Future ML work that seeks to fully account for the light-cone and RSD effects will have to be retrained from mock catalogs that properly include these effects. It is also very likely that new graph features will have to be included to account for the additional radial anisotropy. For all these reasons, we advise that using our best model on observational data should only be considered as a zeroth-order approximation to define cosmic web environments.

7. Conclusions

Generally speaking, the methods already available in the literature to estimate the cosmic web environment of observed galaxies fall into two categories: computationally inexpensive but uncertain in their results (i.e., by directly using the galaxy number overdensity as a proxy for the dark matter overdensity); and highly accurate but computationally expensive (i.e., by performing thousands of N-body simulations to build a full dark matter reconstruction in a Bayesian framework).

In this paper, we explored an approach that represents a middle ground in computational cost and accuracy in the final results. We used the T-Web as the cosmic web definition (Forero-Romero et al. 2009) and the β-skeleton graph (Fang et al. 2019) to describe the relative spatial distribution of the galaxies. The link between the T-Web and the β-graph was done through three different machine-learning algorithms: Classification Trees, Extra Gradient Boosting, and Random Forest.

We tested the method using data from the Illustris-TNG simulation (Naiman et al. 2018; Nelson et al. 2018, 2019; Marinacci et al. 2018; Pillepich et al. 2018a; Springel et al. 2018) at z = 0. Using the weighted F1 score as a benchmark, we first found that the Random Forest algorithm provides the best results. Then, we showed that the most accurate predictions between the graph properties and the T-Web environment are provided for a dark matter density field smoothed over a scales 2 Mpc and an eigenvalue threshold of λth = 0.0. The preferred eigenvalue threshold turns out to be the in ballpark range favored in previous publications for T-Web studies for the resulting classification to match the visual impression of the cosmic web (Forero-Romero et al. 2009).

For the best model, the weighted F1 score is 0.728 and its global accuracy is 74%. The environments, ranked from higher to lower completeness, are filaments (83%), peaks (70%), sheets (47%), and voids (0%). The great difficulty of classifying galaxies or isolated halos inside voids continues to be an outstanding problem as discussed in the previous publications (Libeskind et al. 2018; Tsizh et al. 2020). In general, high completeness and F1 scores correlate with the number of instances in each class. A larger number of instances make it easier for the algorithm to find the relevant features for its correct classification.

We compared our results against a similar methodology proposed by Tsizh et al. (2020). They used dark matter halos, fixed linking length network metrics, and extra gradient boosting decision trees to predict the T-Web environment. They reported a 51% accuracy, meaning that our method achieves 23% points more. We showed that this improvement could be explained by the combination of three factors: the β-skeleton providing a slightly better description of the T-Web features, the meta-parameter search to find the best match between the β-skeleton and a given T-Web classification, and finally, having galaxies instead of halos as tracers.

Our results provide the baseline for future work that will have to quantify the effect of RSD and survey incompleteness to predict the dark matter T-Web form survey data. In order to facilitate future comparison and reuse of our results, we make our best-trained model public, together with a test data sets at https://github.com/jsuarez314/cosmicweb_bsk.

Data Availability

The data sets were derived from sources in the public domain: Illustris-TNG project, https://www.tng-project.org/and Tracing the Cosmic Web comparison project https://data.aip.de/projects/tracingthecosmicweb.htm.

The test data set and the best model presented in this paper are available at https://github.com/jsuarez314/cosmicweb_bsk.

This work was supported by National SKA Program of China No. 2020SKA0110401. X.D.L. acknowledges the support from the NSFC grant (No. 11803094), the Science and Technology Program of Guangzhou, China (No. 202002030360).

J.F.S.P. and J.E.F.R. acknowledge the support by the LACEGAL network with support from the European Union's Horizon 2020 Research and Innovation program under the Marie Sklodowska-Curie grant agreement number 734374.

We are thankful to the community developing and maintaining open source packages fundamental to our work: numpy & scipy (Van Der Walt et al. 2011), the Jupyter notebook (Kluyver et al. 2016), matplotlib (Hunter 2007), corner.py (Foreman-Mackey 2016), SciKit-learn (Pedregosa et al. 2011), and Xgboost (Chen & Guestrin 2016).

Footnotes

Please wait… references are loading.
10.3847/1538-4357/ac1fed