Untangling the Galaxy. I. Local Structure and Star Formation History of the Milky Way

Marina Kounkel; Kevin Covey

doi:10.3847/1538-3881/ab339a

1. Introduction

It is typically recognized that stars tend to form in clustered environments (Lada & Lada 2003), although the sizes and longevity of these clusters do vary significantly. In the solar neighborhood, however, there are a number of examples of more diffuse modes of star formation as well. The closest young OB association, Sco OB2, has only a few dense subclusters; most of the young stars belong to various diffuse populations (Damiani et al. 2019). Another nearby massive star-forming complex, Orion, has a number of massive clusters (including the Orion Nebula), but it has a significant distributed component as well (Kounkel et al. 2018). The Taurus Molecular Clouds are notable for not having any major clusters (Luhman 2018).

After star-forming molecular clouds dissipate, stars outside of the most massive clusters are expected to disperse throughout the galaxy. The timescales of this process are not yet fully constrained. However, because a newly born population of stars would form from the same molecular cloud, it would have very similar kinematics (to within a few km s⁻¹), and it may take several tens if not hundreds of megayears for the population to fully lose coherence.

While there is a long-standing history of studying extended populations in young (<10 Myr) star-forming regions, studies of evolved structure are primarily limited to dense clusters. Some evolved nonclustered moving groups have been previously identified, but they were typically found within 100 pc. They include regions such as TW Hya, Tuc-Hor, AB Dor, Ursa Major, and several others (e.g., Bannister & Jameson 2005; Bell et al. 2015; Faherty et al. 2018; Lee & Song 2018). Few such populations are known at larger distances as, until recently, no sufficiently precise data on stellar kinematics were available.

This has recently changed with the release of Gaia DR2 (Gaia Collaboration et al. 2018), which provides unprecedented precision and sensitivity in measurements of stellar parallaxes and proper motions for 1.3 billion stars. On the bright end (G < 15) it can achieve precision to better than 40 μas in parallax and 60 μas yr⁻¹ in proper motions. On the faint end (G > 20) the typical uncertainties are 0.7 mas in parallax and 1.2 mas yr⁻¹ in proper motions. In addition, Gaia provides radial velocity (RV) measurements for 7.2 million stars (Cropper et al. 2018) with R ∼ 11,500 and a typical uncertainty of ∼2 km s⁻¹. Although neither the quantity nor the precision of the RV portion of the survey compares with the astrometric portion, it is (currently) the only large spectroscopic program that provides observations across the entire sky.

Using data from Gaia DR2, a number of studies of both individual clusters and their bulk distributions have been conducted, and a number of new open clusters have been found (e.g., Cantat-Gaudin et al. 2018; Castro-Ginard et al. 2019), showing that the current census of open clusters is still incomplete. Some studies focused on tracing the spiral arm structure and kinematics of the Milky Way (e.g., Bobylev & Bajkova 2019; Dias et al. 2019) using known clusters. There have also been studies of the overall spatial density distribution of young stars in the solar neighborhood (Zari et al. 2018), as well as detailed characterization of the structure and dynamics of individual star-forming regions (e.g., Kounkel et al. 2018; Cantat-Gaudin et al. 2019; Damiani et al. 2019).

In this paper, we identify extended structures, estimate their ages, and analyze their distribution. In Section 2 we describe the data selection process and the clustering that was performed. In Section 3 we discuss the estimation of ages in the identified groups. In Section 4 we describe the classification of groups that correspond to extended large-scale structures and their individual properties. We also focus on the distribution of these extended structures in the context of the Galaxy, and discuss the possible implications for the dynamical evolution of the Milky Way. Finally, in Section 5 we offer our conclusions. This paper specifically focuses on the generation of the catalog of members of comoving groups within 1 kpc and their characterization. Their analysis in the theoretical framework of the Galactic star formation, structure, and dynamics is deferred to subsequent works in the series.

2. Data

We used Gaia DR2 data for the analysis. We focus on the Galactic midplane, where star formation persists today, and the extended solar neighborhood, where extended structures will have the largest angular size. We select sources within $| b| \lt 30^\circ$ of the Galactic plane and π > 1 mas. We further limit our sample to sources with robust astrometric and photometric detections, requiring

1.
σ_π < 0.1 mas or σ_π/π < 0.1,
2.
1.0857/phot_g_mean_flux_over_error < 0.03,
3.
astrometric_sigma5d_max < 0.3,
4.
visibility_periods_used > 8,
5.
astrometric_excess_noise < 1 or(astrometric_excess_noise > 1 andastrometric_excess_noise_sig < 2),
6.
${v}_{\alpha ,\delta }^{\mathrm{lsr}}\lt 60\,\mathrm{km}\,{{\rm{s}}}^{-1}$ .

These cuts ensure a high-quality data set for our hierarchical clustering algorithm, which may perform poorly with data that are very uncertain. The resulting catalog consisted of 19.55 million stars, typically with G < 18 mag. The $| b| \lt 30^\circ$ cut excludes the space that is not expected to contain a significant number of clusters and comoving groups. The ones that are present at higher galactic latitudes would also be more difficult to recover with the chosen algorithm due to the distortions in l from cos b, and having strong outliers may have a negative effect on the recovery of the remaining structures (this is the same motivation for imposing the proper motion limit). The $| b| \lt 30^\circ$ cutoff was set to include everything up to and including the structures considered to be part of the Gould Belt.

We perform a clustering analysis on this data set using the Python implementation of the hierarchical density-based spatial clustering of applications with noise (HDBSCAN; McInnes et al. 2017). It is a hierarchical clustering algorithm adapted from the more commonly used DBSCAN (Ester et al. 1996). DBSCAN identifies clusters as overdensities in a multidimensional space in which the number of sources exceeds the required minimum number of points within a neighborhood of a particular linking length . HDBSCAN does not depend on ; instead it condenses the minimum spanning tree by pruning off the nodes that do not meet the minimum number of sources in a cluster and reanalyzing the nodes that do. Depending on the chosen algorithm, it would then either find the most persistent structure (through the excess of mass method), or return clusters as the leaves of the tree (which results in somewhat more homogeneous clusters). In both cases it is more effective at finding structures of varying densities in a given data set than DBSCAN.

The two main parameters that control HDBSCAN are the number of sources in a cluster and the number of samples. The former is the parameter that rejects groupings that are too small; the latter sets the threshold of how conservative the algorithm is in its considerations of the background noise (even if the resulting noisy groupings do meet the minimum cluster size). By default, the sample size is set to the same value as the cluster size, but it is possible to adjust them separately.¹

The clustering was performed on the 5D data set: Galactic coordinates l and b, parallax π, and proper motions. The conversion from the equatorial to the Galactic reference frame for the positions themselves is necessary, as most of the structure is located along the Galactic plane, and the cos δ term would add nonlinear distortions in α otherwise. However, in terms of the proper motions, the combinations of v_α, v_δ and v_l, v_b are the direct rotational transformation of each other, and thus they produce largely comparable outputs. Because the range of scatter in v_l is typically larger than in v_b, but v_α and v_δ tend to have similar ranges, the latter was chosen. Proper motions were converted to the local standard of rest (lsr) reference frame (using constants from Schönrich et al. 2010) to avoid distortions due to the line of sight from the solar motion, as well as converted to the physical units of km s⁻¹ to avoid the distortion in the distance.

Various scaling factors were considered to normalize each of the five dimensions, but they did not have a strong effect on the resulting groups, and they were left in their native units (i.e., degrees, mas, and km s⁻¹). We also considered additional transformations, such as distance instead of parallax or the XYZ positions, but they were not considered optimal due to a greater degree of noise in the outputs. We required a minimum number of samples to be 25 sources, with the minimum number of stars per cluster to be 40 stars.

While HDBSCAN is able to robustly recover structures of very different densities and sizes, the resulting distance matrix that is computed is strongly correlated with distance. The further a cluster is, the smaller it is going to be in the plane of the sky, the smaller the proper motions are going to be (and they will have more scatter due to uncertainties when transformed to the velocity space), and the smaller the parallaxes would be. Combined with the fact that there is more volume of space to encompass the structures that are more distant than those that are nearby, any clustering runs that include distant sources (i.e., have a smaller cutoff parallax) will be less attuned to the apparent densities of the nearest groups. The cores of the nearby massive clusters (e.g., Pleiades, Hyades) can be recovered in all the runs, regardless of the cutoff parallax, but the sources in their periphery, or some of the lower density structures have a significantly poorer recovery in clustering runs that extend to our full distance limit (1 kpc). We therefore performed several different runs with different cutoffs in parallax—10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, and 1 mas—using the "leaf" clustering method (Figure 1). The algorithm did not perform effectively for a cutoff parallax >10 mas, and thus, with exception of the Hyades, little of the structure within 100 pc can be recovered.

Compared to the leaf method, the default excess of mass (eom) method better recovers the extended lower density structure in various star-forming groups (e.g., Orion, Sco-Cen, Taurus, Perseus). In most cases, however, it would return only one or two groups for the entire volume of space being probed, merging all of the substructure into a single coherent distribution of sources. This may occur due to a large number of outlying sources that do not belong to physically significant structures. Only a single optimal configuration of the input data was found at a cutoff parallax of 2.0 mas that did not result in this overmerging. It was added as a supplement to the runs performed with the leaf method, which had a significantly more consistent performance in preserving distinct structures.

The Galactic coordinate grid is discontinuous at l = 0 = 360°. Thus, some of the structures that cross that boundary are artificially split (e.g., Figure 1). Thus multiple runs are performed for each parallax cut, spanning from 0° < l < 360°, and from −180° < l < 180° up to 2 mas. Because of the number of sources necessary to process, for computational efficiency, runs at 1.5 and 1 mas are split to run from 0° < l < 190° and from −180° < l < 10°.

Merging outputs of different runs with the same cutoff parallax (i.e., to stitch up the discontinuities in l) is largely a trivial process. In some of the overlapping clusters, a handful of sources might be absent in one run compared to another due to slight inconsistencies in the weights given by HDBSCAN with different input matrices. But they are tracing the same underlying structures, and most of them are identical, with primary differences occurring at the location of the split in l.

On the other hand, merging outputs from different parallax runs is somewhat more complex. With different density sensitivity, it is possible that in a particular run, some sources in neighboring structures might be clustered into one group, whereas these structures might be split in into distinct structures in a run sampling a different volume. All of the outputs from different cutoff parallax runs were examined for self-consistency of their position on the sky, proper motions, parallaxes, and the shape of the H-R diagram. They were merged manually, and split by hand if there were natural divides in any of the aforementioned spaces. Some of the obvious contamination in individual groups was also removed—this was most noticeable in young clusters located far above the main sequence on the H-R diagram that had a small fraction of more evolved stars that appeared to be uniformly distributed in all of the position–velocity parameter spaces. This amounted to ∼3000 stars, ∼1% of the total sample.

It should be reiterated, however that there is a great degree of complexity in the physical and kinematical structure even in the young star-forming regions. While they can be roughly broken apart into smaller components of varying densities that may evolve separately over time, they are fundamentally formed from the same parental cloud complex, and many of these divisions may be somewhat arbitrary. In Orion, for example, the complex can be subdivided into individual clouds, each cloud can then be subdivided into individual clusters and diffuse regions; the clusters themselves may be subdivided into subclusters and other notable features (Kounkel et al. 2018). There is no unique method with which all of the different scales of structure could be split into individual components, and the boundaries between them are fuzzy (e.g., Beccari et al. 2017; Zari et al. 2017; Großschedl et al. 2018; Chen et al. 2019; Kuhn et al. 2019; Getman et al. 2019). This may also the case for many of the individual groups identified in this work, as will be discussed in later sections.

The final catalog totals 1901 individual groups consisting of 288,370 stars (Table 1). Combined, these 288 K sources that were grouped together represent less than 1.5% of the ∼20 million stars in the original Gaia DR2 input catalog. The average properties of the resulting groups (further processed as described in Section 4; hereafter named Theia) are given in Table 2.

Table 1. Clustered Sources

Gaia DR2	α	δ	Theia
ID	(deg)	(deg)	Group ID
2172342682494385024	320.57220379	52.07733956	1
2170224782576556672	315.65202897	52.20035062	1
2168760576695182208	315.71659886	50.22889438	1

Only a portion of this table is shown here to demonstrate its form and content. A machine-readable version of the full table is available.

Download table as: Data Typeset image

Table 2. Identified Structures and Their Average Properties

Theia	Common	Age	A_V	String?	Projected	Projected	String	l	b	π
Group ID	Name	(dex)	(mag)		Width (pc)	Height (pc)	Length (pc)	(deg)	(deg)	(mas)
1	LDN 988e	6.1	1.7		9.7	7.1		90.922	2.794	1.677
9	IC 1396	6.6	1.2	y	47.2	8.9	130.4	102.828	4.276	1.104
16		6.9	1.5		4.2	4.8	0.0	319.605	1.527	1.089

v_l	v_b	v_r	X	Y	Z	U	V	W
(km s⁻¹)	(km s⁻¹)	(km s⁻¹)	(pc)	(pc)	(pc)	(km s⁻¹)	(km s⁻¹)	(km s⁻¹)
−16.354	−3.154	−9.3	−9.6	595.5	29.1	17.2	−9.0	−3.7
−29.085	−0.536	6.9	−207.3	880.8	68.3	27.8	12.7	0.3
−7.130	3.015	−0.6	698.9	−594.7	24.5	−5.4	−5.3	3.2

Only a portion of this table is shown here to demonstrate its form and content. A machine-readable version of the full table is available.

Download table as: Data Typeset image

Comparing this catalog to the catalog of open clusters identified in Gaia DR2 by Cantat-Gaudin et al. (2018), we recover 198 clusters out of 208 that are located within 1 kpc; within those we recover 73% of the sources identified as members of those clusters. However, the catalog presented in this work is significantly more sensitive to the extended structure, recovering not only gravitationally bound clusters, but also a number of comoving groups and associations, most of which have not been previously known (Figure 2).

**Figure 2.** Spatial distribution of sources in the final catalog in Galactic coordinates. Sources in common with Cantat-Gaudin et al. (2018) are shown in yellow.
Download figure:
Standard image High-resolution image

We estimate that as much as 5%–10% of the population inside of the identified groups may be contamination, consisting of field stars that have kinematics similar to those of the underlying cluster. Additionally, while it is difficult to estimate the precise fraction, there may also be some number of groups consisting of unrelated stars that appear to be comoving. Their fraction is expected to increase at larger distances. RVs from Gaia can confirm some of the structures, particularly those that are extended, such as those described in Section 4. However, as Gaia does not have RVs for all of the stars in the Theia catalog and the uncertainties for the existing measurements are large, it is difficult to determine which ones of the more compact groups are real and which ones are spurious. Future spectroscopic follow-up would be needed to distinguish between them based on coherence in RV and chemical composition. Nonetheless, even fake clusters can be used as tracers of the underlying structure and kinematics of the extended solar neighborhood.

3. Ages

In order to estimate the ages of the individual groups, two approaches were considered.

3.1. Machine Learning

To infer ages for each group identified in this analysis, we constructed a convolutional neural network (CNN) using PyTorch (Paszke et al. 2017). The CNN was trained on the cluster populations from Cantat-Gaudin et al. (2018), with the age and extinction labels from Kharchenko et al. (2013), with the combination of the Gaia, 2MASS, and Wide-field Infrared Survey Explorer (WISE) photometry which were obtained from the cross match tables (Marrese et al. 2019). Each individual cluster in the training set was subsampled repeatedly with a random number of sources from 40 to 250 stars, to increase the sample from 1196 clusters to 130,000 realizations of them. Even though the same stars were represented in many different realizations, because their combination was slightly different, it allowed the CNN to more effectively memorize the different features of clusters with various ages than it could have from a smaller set of clusters with more complete sampling of their stellar populations.

Additionally, the training set was supplemented by population drawn from synthetically generated clusters. We generated clusters with a uniform distribution of ages (in log space) from 1 Myr to 10 Gyr with a 2 Myr scatter within a cluster. The clusters' remaining properties were randomly generated as well, with uniform Fe/H from −2.5 to 0.5, uniform distances from 50 to 3000 pc with a 2 pc scatter, uniform A_V from 0 to 2 with 0.05 scatter, and the masses of the individual stars were drawn from the initial mass function. Gaia DR2 (G, BP, RP), 2MASS (J, H, K), and WISE (W1, W2, W3) fluxes were interpolated using the PARSEC isochrones (Marigo et al. 2017), and the typical uncertainties in each band (as well as in the parallax) were applied. We included only the fluxes brighter than G < 19, BP < 20.5, RP < 17.5, J < 17, H < 16, K < 16, W1 < 16, W2 < 16, W3 < 13.5, which are the typical limiting magnitudes in our observed sample. Only the G band is required for all of the synthetic sources; if other bands are undetected they were set to the edge value. A total of 150,000 synthetic clusters were generated to supplement the real observational data, in which the missing fluxes were processed in the same manner.

All of the sources in the individual clusters were ordered based on their M_G, and all of the nine fluxes and the parallax were passed to the CNN which consisted of seven 2D-convolutional and two fully connected layers, architecture of which is shown in Appendix.

3.2. Isochrone Fitting

In addition to the machine learning approach, we estimate the ages of the individual groups through traditional isochrone fitting. We used the Bayesian analysis for stellar evolution with nine variables (BASE-9; von Hippel et al. 2006; Robinson et al. 2016). Inputs for BASE-9 fitting were the absolute photometry from the Gaia filters, incorporating σ_π as part of the photometric uncertainties, with a minimum photometric uncertainty of 0.02 mag. PARSEC isochrones (Marigo et al. 2017) were used for fitting, although their current implementation in BASE-9 cannot estimate ages younger than 7.4 dex. For computational expediency, binaries were not treated in the fitting process, and various combinations of the input parameters and chain lengths were considered. However, due to the volume of data, the resulting chains were not evaluated on whether the output was correlated or not.

3.3. Comparison and the Final Parameters

The resulting isochrones, constructed both from the parameters predicted by CNN and those fitted using BASE-9, were visually examined against a color–magnitude diagram for each of the identified groups. Multiple outputs were averaged together if they produced a reasonable approximation of the data. The most successful approach was the isochrone fitting restricted to solar metallicity, not fitting for a potential offset in the distance modulus due to the systematics of the Gaia parallaxes (e.g., Stassun & Torres 2018), and using the predictions from CNN as the initial estimates for age and A_V. This produced a successful fit in 77% of cases. Machine learning on its own produced reasonable approximation in 44% of cases. BASE-9 fitting with additional parameters such as distance and metallicity could approximate isochrones for only 24% of groups (without additional tuning of the inputs), and 6% of groups struggled to be characterized by either method and required manual fitting (Figure 3).

**Figure 3.** Distribution of success cases of various methods to determine age and extinction.
Download figure:
Standard image High-resolution image

From the scatter in the accepted parameters we estimate the typical uncertainties in age to be ∼0.15 dex and ∼0.2 mag in A_V. This is a sufficient degree of precision for the purposes of the analysis in this work, however, these age estimates do not supersede those from the more detailed studies of individual clusters, and more work will be needed in the future to improve on these estimates, such as through gyrochronology. The primary origin of the uncertainty in our age determination originates from a lack of a distinct turn-off of the top of the main sequence to the red giant branch, as well as over-inflated low-mass stars compared to the isochrones in some of the populations (e.g., Jackson et al. 2018). These uncertainties partially reflect stochastic effects, due to both intrinsic astrophysical fluctuations in the cluster's population and subsampling by the fitting routines, but also competing systematic biases (i.e., where a missing turn-off star will drive the cluster to older ages, whereas the mass dependent offsets from isochrones at lower masses drive to younger age estimates). However, we do note that the ages were derived independently of any analysis of the galactic positions or kinematics (Section 4). The apparent continuous evolution of the populations in time and their consistency with the Galactic kinematics do demonstrate that the derived ages are reliable and self-consistent. Furthermore, the overall position of the sources on the H-R diagram (Figure 4) does also show a smooth gradient as a function of age.

**Figure 4.** Extinction-corrected H-R diagram of all the sources, color coded by their measured age. An interactive version of the figure, with a combination of various photometric bands, is available in the online version.
Start interaction Open in new tab
Download figure:
Standard image High-resolution image Figure data file

While using machine learning to predict cluster parameters is still inferior to the direct fitting method, it does show potential, particularly as it requires fewer computational resources to process large quantities of data. A fully trained network can evaluate the parameters in the entire data set in a matter of seconds. At the same time, while BASE-9 utilizes a Bayesian approach in evaluating the parameters through a Markov chain Monte Carlo, the quality of the fit is dependent on the input parameters. And, in particular, BASE-9 appears to run just a single long chain with a very limited number of walkers, which can be suboptimal compared to multiple walkers in probing the full parameter space (Goodman & Weare 2010; Foreman-Mackey et al. 2013; Bian et al. 2017). While it is possible to run BASE-9 multiple times to produce multiple chains, as mentioned before, due to the volume of data it was impractical to do.

The overall distribution of the measured ages is shown in Figure 5 and further discussed in the subsequent sections. The 3D distribution of the measured A_V is shown in Figure 6; it is largely self-consistent (i.e., increasing with distance along the line of sight) and the overall distribution is consistent on a per cluster basis with what is mapped by the extinction measurements of individual sources from Gaia DR2.

**Figure 6.** Map of the measured extinction; the Galactic center is toward the right of the plot.
Download figure:
Standard image High-resolution image

4. Strings

4.1. Manual Assembly

The distribution of many of the identified groups appears to form filamentary structures that we will refer to as strings. These strings are roughly parallel to the Galactic plane, they appear to be coherent both spatially and kinematically, and the stars in them generally can be characterized with a single isochrone (with a possible scatter of few megayears). In some cases, the entire identified group may correspond to a single such string. In other cases, multiple groups with similar ages appear to form spatially extended but kinematically coherent strings, with linear extension of which was segmented by HDBSCAN.

To identify and manually reconstruct these strings that consist of multiple groups, we have examined the distribution of sources in l versus b, l versus π, l versus v_l, l versus v_b,and over a range of age slices (e.g., Figure 7). Using TOPCAT (Taylor 2005), we selected a structure that appears to be most well-defined and separatable from its surroundings in one of these planes. We compared their coherence in the other planes, removing outlying sources in each one, until the remaining structure was fully continuous coherent in all kinematic and spatial dimensions. We verified that the manual selection was representative of the groups to which these sources belong, in that individual groups were not split into two, and most of their sources were recovered with the selection. These groups were then joined into a common string. If, instead of multiple groups, only a single filament-like group remained after the selection process, its classification was also assigned to a string. Such strings that consisted of a single group typically have comparable filamentary shape to what is found in multiple group strings. The search was continued after removing the sources inside the identified string from the catalog that would be used in the next iteration of the search process. This process was able to untangle multiple structures that were projected on top of each other in any of the dimensions. It should be noted that a union of some individual clusters may appear to form coherent structures in some projections of the phase space, but they tend to become incoherent in the remaining projections. No grouping was applied in such cases.

**Figure 7.** Demonstration of the selection of a string. NGC 2232 is shown as an example. Blue colors correspond to the full catalog for ages of up to 7.5 dex. Sources selected at the current step are shown in yellow, while the sources selected at the previous step and discarded at the current step are shown in red. (A) Rough selection of a string in l vs. b. (B) Refinement in l vs. v_l, removing sources with distinct, noncontinuous v_l kinematics. (C) Refinement in l vs. v_b. (D) Refinement in l vs. π. (E) The resulting selection in l vs. b space, for comparison with panel (A). (F) The remaining catalog, with the groups that compose the string removed.
Download figure:
Standard image High-resolution image

Afterwards, we examined the remaining individual groups that have not been joined into strings, searching for extended structure with a strong eccentric distribution between l and other dimensions, and they were similarly classified as strings.

RVs were not used in identification of the strings due to their smaller sample size and poorer quality, although it should be noted that they tend to show a similar arrangement as a function of l as can be observed in other dimensions. Once a more complete RV survey is available, however, the coherence of all the identified structures should be further revisited with the knowledge of the full 3D space motions.

4.2. Properties of Strings

In total, 138,075 sources (approximately half of the full catalog) are found in 328 strings (Figure 8). The remaining 150,295 sources are located in 1312 groups² (Figure 8). To the best effort, the separation into groups and strings was done self-consistently through the visual examination, however, in some cases the difference between the two can be poorly defined, as most groups do show some degree of extension along l (Figure 9). There is some bias in the identification ability as a function of distance, as nearby short strings would be more easily apparent due to their large angular size compared to those that are further away. The typical length of individual strings when traced along their full extent along their spine is ∼200 pc.

**Figure 9.** Apparent sizes of the groups and the strings shown as a function of their relation of various dimensions. Projected width and length are measured from the standard deviation in l and b converted to the physical units at the typical distance to the structure. The length is traced along the spine of the string end-to-end in the *XYZ* plane. The black line shows the line of equity.
Download figure:
Standard image High-resolution image

Nearby strings in particular appear to span large angular sizes. While it is not physically the largest, one of the most spectacular ones counts the cluster α Per as a part of it, and it extends for over 120° (Figure 10). Another peculiar one is located just north of the Orion Complex, containing NGC 2232, and it is probably the parental population that has formed Betelgeuse. Orion Complex itself can be imagined as a single string-like structure, although it has a more complex structure and star-forming history than what is typical. The Sco-Cen OB association appears to consist of two closely connected strings: one that extends from Upper Sco to Lower Centaurus-Crux (as is typically recognized), and another that extends from Lower Centaurus-Crux to the Corona Australis star-forming region, forming a shape of a wishbone that diverges both spatially and kinematically on one end. Although they are related, they were treated separately during the selection. However, both of them are physically smaller than a typical string.

**Figure 10.** Some examples of notable structures, plotted in the galactic coordinates. See the interactive version of Figure 8 for full projections of all of the individual strings.
Download figure:
Standard image High-resolution image

Recently a similar extended structure has been identified in the solar neighborhood, which is referred to as the Pisces-Eridanus stream (Curtis et al. 2019; Meingast et al. 2019). It is located at a distance of ∼100 pc, spans ∼120°, and has an age of ∼120 Myr. We do not recover it due to its height above the Galactic plane. However, it is likely to be similar to the strings we do find.

Looking at the distribution of all of the sources as a whole, kinematically, there does not appear to be a significant difference between strings and groups, both showing a similar structure in the position–velocity diagrams. However, there is a significant difference in the age distribution between the two (Figure 5). Almost all of the newly formed stars can be collected into various strings. And while some strings may survive up to a few gigayears, they become significantly less numerous at ages beyond 8.5 dex compared to the younger counterparts. On the other hand, the age distribution of non-strung together groups does not peak until 8.6 dex.

While there is no correlation between the size of a structure and its age (Figure 9), the number of sources in individual groups and strings appears to decrease strongly for older structures. There is no characteristic size of the newly born populations, as they can contain anywhere from a few dozen to several thousand stars. On the other hand, the oldest populations rarely contain more than 100 stars. This effect is not caused by the distance-driven incompleteness (Figure 11). Assuming that it is possible to map the most massive young populations to the most massive old ones, the number of surviving sources within a structure can be roughly characterized as a function of age with a power law of ${N}_{* }={N}_{\circ }\,\times {10}^{-(t-{t}_{\circ })/1.5}$ . Regarding the apparent sizes, although there may be some apparent decrease in the projected size in age in Figure 9, it is largely driven by the number of stars in each population; there is no trend in age with the absolute length that is traced in 3D end to end.

**Figure 11.** Number of stars in individual groups and strings in various distance slices. The black line shows an approximate evolution of the group size is the power law with the slope of ∼−1.5.
Download figure:
Standard image High-resolution image

**Figure 12.** Velocity dispersion in three-dimensions (l, b, and radial) for individual groups and strings. The underlying spatial dependence in kinematics as a function of l was subtracted prior to measuring the dispersion.
Download figure:
Standard image High-resolution image

Velocity dispersion for the surviving members of the individual structures that have not been dissolved into the field changes as a function of age as well, with ${\sigma }_{{v}_{l}}$ and ${\sigma }_{{v}_{b}}$ increasing from ∼0.75 km s⁻¹ to ∼1.5 km s⁻¹ between ages of 7 and 9 dex (Figure 12). Because RV measurements from Gaia DR2 have large uncertainties, and they have not been examined for multiplicity, the specific σ measurements are less meaningful, but they do show a similar underlying trend. The measured velocity dispersions are the lower limit for the populations, if it were possible to incorporate all the original members that have long since dissolved into the field. However, it should be noted that populations that form with a smaller number of stars tend to have a smaller velocity dispersion than those that are more massive at any given age, therefore the evolution toward higher velocity dispersion is in part due to a relaxation of the smaller groups over time. However, if it were possible to account for all the members that have formed within any given population and have since dissolved into the field, the absolute velocity dispersion should increase more significantly.

In addition to the strong prominence of stringing in younger populations, many strings do not appear to have an open cluster at their core. Because of this, it is unlikely their shape originates from tidal stretching in comparison to the various streams associated with globular clusters or dwarf galaxies. Rather, for young strings this extended shape is likely primordial, mirroring the shape of the molecular cloud from which they have formed. Possibly, some of them originate from giant molecular filaments (e.g., Ragan et al. 2014; Zucker et al. 2018). While it may be surprising that they can survive as quasi-coherent structures into the gigayear age, these strings do not show a significant evolution beyond slowly dissolving into the field, and the oldest strings are most likely the remnants of some of the most massive structures that have originally formed. While we characterize their present day structure and kinematics in the next section, we defer a deeper exploration of various formation scenarios and tests of dynamical stability in the Galactic potential to future work.

4.3. Large-scale Structure and Dynamics

To place the individual strings into the physical galactic reference frame, we first determine their individual 2D structure. To trace the spine of each string, we average b, π, and the kinematics in 1° bins along l, and then smooth the resulting arrays with the Savitzky-Golay filter (Table 3) using a 20° kernel or half the full width of the string in l, whichever one is the smallest. This is done to avoid strong fluctuations in the averages that could result from too few sources in a given bin. The positions are then converted into the heliocentric XYZ reference frame, and the UVW velocity vectors are computed using GalPy (Bovy 2015).

Table 3. Traces of the Strings

Theia	l	b	π	v_l	v_b	v_r	X	Y	Z	U	V	W
Group ID	(deg)	(deg)	(mas)	(km s⁻¹)	(km s⁻¹)	(km s⁻¹)	(pc)	(pc)	(pc)	(km s⁻¹)	(km s⁻¹)	(km s⁻¹)
9	98.5	3.873	1.102	−8.570	−0.339	8.4	−133.9	895.8	61.3	36.0	14.1	−0.9
9	99.5	3.771	1.097	−8.025	−0.403	8.4	−150.1	897.2	60.0	33.6	14.4	−1.2
9	100.5	3.955	1.097	−7.352	−0.300	9.1	−165.7	894.1	62.9	30.2	15.1	−0.7

Only a portion of this table is shown here to demonstrate its form and content. A machine-readable version of the full table is available.

Download table as: Data Typeset image

Strings with ages <7.9 dex appear to jointly form a coherent structure, Stream 1, that is ∼300 pc wide, extending beyond the boundary probed by this sample (Figure 13). However, older strings do not coincide with it. Instead, those with ages between 8.1 and 8.7 appear to form a separate coherent Stream 2 that is oriented to the younger one by ∼60°. Furthermore at ages beyond 8.7, two other streams (one that is located at Y ∼ 500 pc, Stream 3, which has also recently been identified as a stellar overdensity by Miyachi et al. (2019), and one that is at Y ∼ −700 pc, Stream 4) may be apparent. In all of the cases, the strings appear to be preferentially oriented perpendicularly to these streams.

**Figure 13.** Face-on and edge-on map of the solar neighborhood with the identified strings and groups, color coded by age, in various age slices. Thinner lines show the velocities of the structures over the next 5 Myr, ignoring the galactic potential. An interactive 3D plot is available in the online version.
Start interaction Open in new tab
Download figure:
Standard image High-resolution image Figure data file

Kinematically, the streams are, on average, in the local standard of rest (Figure 14). However, while the lsr corrections from Schönrich et al. (2010) hold on average, there are unaccounted cross terms. While ${V}_{\mathrm{ave}}\sim {V}_{\odot }$ and W_ave ∼ W_⊙, there is a strong dependence between ( $U\equiv {\rm{\Delta }}X$ ) and Y. This dependence can be characterized by

$\begin{eqnarray}&&{U}_{\mathrm{ave}}(\mathrm{km}\ {{\rm{s}}}^{-1})={U}_{\odot }+2.14-0.030\times (Y\mathrm{pc}).\end{eqnarray} \tag{ 1 }$

Previously, Schönrich (2012) has observed a signature of galactic rotation in the combination of the stellar velocities. Upon examining unclustered sources within 1 kpc, two cross terms are apparent: a strong correlation between Y and U as well as one between X and V that do indeed resemble a rotation. On the other hand, clustered sources do not show a correlation between the X position and V velocity, making their global motion in the same volume of space resemble a (local) shear as opposed to rotation, as can be seen in Figure 13.

The scatter in U, V, and W for the clustered structures increases as a function of age (Figure 14), and the scatter of the remaining sources that are not part of the clustered catalog is larger yet. We can assume that most of the unclustered sources are representative of the Milky Way disk population and are older than a few gigayears. Similarly, the scatter in Z also increases with age, although this is less apparent for the younger structures due to a large-scale warp throughout the Stream 1. The Gould Belt, which is formed by the nearby star-forming regions, is tilted relative to the Galactic plane by ∼20°; several mechanisms have been proposed for its formation in the past assuming a circular shape, although recently it has been shown that Gould's Belt does not form a ring (Zari et al. 2018). Instead, it is likely that its tilt is part of a larger warp of the Local Arm itself. Such a warp is not apparent in the older streams, although due to them being thicker, it would be more difficult to confirm.

5. Discussion

In this work we analyzed the distribution of sources in 5D space in Gaia DR2 data. Using HDBSCAN we searched for extended coherent structures. We found a total of 1900 clusters and comoving groups within 1 kpc, consisting of ∼300,000 sources, and we measured ages for these groups using a combination of isochrone fitting and machine learning.

Approximately half of all of the clustered sources are found in various strings that typically span ∼200 pc, and they are oriented in parallel to the galactic plane. In most cases we expect that these strings are primordial and are not formed as a result of a tidal disruption of the clusters. This stringing is most prominent in the youngest populations. However some may survive into gigayear age, dissolving the majority of their sources into the field, and increasing in their velocity dispersion.

The distribution of strings varies significantly as a function of age. Strings of similar ages (to within a few hundred megayears) appear to form four coherent streams, and they are preferentially oriented orthogonal to them.

Stream 1, which is the youngest, appears to coincide with the Local Arm of the Milky Way. Expanding the search radius beyond 1 kpc will be necessary in the future to confirm that a similar stringing occurs along it outside of the immediate solar neighborhood. However, if it is indeed a signature of the Local Arm, then the other streams could also be considered as remnants from the older spiral arm-like structures that can no longer be traced with gas. Because they have well-defined age ranges, this may suggest that the spiral arm pattern of a galaxy undergoes evolution over a period of a couple of orbital periods, thus supporting simulations that produce transient spiral arms (e.g., Kawata et al. 2014; Pettitt et al. 2015).

The arrangement of the strings preferentially perpendicular to the streams is also notable, as it might suggest a major mode of the giant molecular cloud assembly. Detailed comparison to simulations will be necessary to understand the dominant mechanisms behind it.

We thank Tristan Cantat-Gaudin for the wonderful discussion. Additionally, we acknowledge Brian Hutchinson and Richard Olney in regards to the assistance in construction of the CNN, and Ted von Hippel and Elliot Robinson in helping to set up BASE-9. M.K. and K.C. acknowledge support provided by the NSF through grant AST-1449476, and from the Research Corporation via a Time Domain Astrophysics Scialog award (#24217). This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC; https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

Software: TOPCAT (Taylor 2005), HDBSCAN (McInnes et al. 2017), PyTorch (Paszke et al. 2017), BASE-9 (Robinson et al. 2016), GalPy (Bovy 2015).

Appendix: CNN Architecture

class Net(nn.Module):

def __init__(self, input_shape = (1, 10, 250)):

super(Net, self).__init__()

self.conv1 = nn.Conv2d(1, 8, 3, padding = 1)

self.conv2 = nn.Conv2d(8, 8, 3, padding = 1)

self.conv3 = nn.Conv2d(8, 16, 3, padding = 1)

self.conv4 = nn.Conv2d(16, 16, 3, padding = 1)

self.conv5 = nn.Conv2d(16, 32, 3, padding = 1)

self.conv6 = nn.Conv2d(32, 32, 3, padding = 1)

self.conv7 = nn.Conv2d(32, 64, 3, padding = 1)

self.fc1 = nn.Linear(448, 512)

self.fc2 = nn.Linear(512, 512)

self.fc3 = nn.Linear(512, 2)

def _forward_features(self, x):

x = F.max_pool2d(F.relu(self.conv2(F.relu(self.conv1(x)))), 2)

x = F.max_pool2d(F.relu(self.conv4(F.relu(self.conv3(x)))), 2)

x = F.relu(F.max_pool2d(self.conv5(x), 2))

x = F.relu(F.max_pool2d(self.conv6(x), 2))

x = F.relu(F.max_pool2d(self.conv7(x), 2))

return x

def forward(self, x):

x = self._forward_features(x)

x = x.view(x.size(0), −1)

x = F.relu(self.fc1(x))

x = F.relu(self.fc2(x))

x = self.fc3(x)

return x

Download table as: ASCII Typeset image

The inputs are formatted in a 10 × 250 matrix, with 10 columns sorted into 9 photometric bands and the parallaxes, and the rows containing information on the individual stars, sorted by the M_G flux. The network first performs a 2D convolution, breaking this input into eight channels, passing this output through another 2D convolution, reassembling it in the same shape. Then the matrix is downsampled using max pooling reducing dimensionality, turning the 8 × 10 × 250 matrix into 8 × 5 × 125. Another 2D convolution of a 2D convolution is performed, breaking it into 16 channels, and after max pooling, turning it into a 16 × 2 × 62 matrix. Then 2D convolution followed by max pooling is performed three times, resulting in 32 × 2 × 31, 16 × 1 × 15, and 64 × 1 × 7. The resulting 448 nodes are stacked, and they are fully connected to a first hidden layer with 512 nodes, followed by a second hidden layer with 512 nodes, which is then connects to the two outputs, which are the age and the extinction value for each cluster.

Untangling the Galaxy. I. Local Structure and Star Formation History of the Milky Way

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Data