Identifying Physical Structures in Our Galaxy with Gaussian Mixture Models: An Unsupervised Machine Learning Technique

We explore the potential of the Gaussian mixture model (GMM), an unsupervised machine-learning method, to identify coherent physical structures in the interstellar medium. The implementation we present can be used on any kind of spatially and spectrally resolved data set. We provide a step-by-step guide to use these models on different sources and data sets. Following the guide, we run the models on NGC 1977, RCW 120, and RCW 49 using the [C ii] 158 μm mapping observations from the SOFIA telescope. We find that the models identified six, four, and five velocity coherent physical structures in NGC 1977, RCW 120, and RCW 49, respectively, which are validated by analyzing the observed spectra toward these structures and by comparison to earlier findings. In this work we demonstrate that GMM is a powerful tool that can better automate the process of spatial and spectral analysis to interpret mapping observations.


INTRODUCTION
Massive stars (> 8 M ⊙ ) play one of the most important roles in regulating the physics and chemistry of the interstellar medium (ISM).During their lifetime (< 5 Myr), massive stars inject enormous amounts of radiative energy through extreme-ultraviolet (EUV) and far-UV (FUV) photons (Hollenbach & Tielens 1999), and mechanical energy through stellar winds (Castor et al. 1975;Weaver et al. 1977;Pabst et al. 2019) into their surroundings.This causes diffuse and dense gas to form different structures either by disrupting or compressing molecular clouds in the vicinity of massive stars (Elmegreen & Lada 1977;Walborn et al. 2002).These structures can be spatially and spectrally distinct.Their physical conditions are among others dependent on morphology, relative location to the main ionising source and the star formation history of the region (Tiwari * Both authors contributed equally to this work.et al. 2022).Estimations of these physical conditions quantify the role of stellar feedback in the evolution of the ISM.However, the identification of coherent physical structures in the ISM is not straightforward due to its turbulent nature.Recent observational studies have shown that besides high angular resolution data, a detailed spectral and spatial analysis is necessary to identify coherent physical structures in our Galaxy (e.g., Hacar et al. 2013;Henshaw et al. 2019).
Soon balloon missions such as ASTROS (Astrophysics Stratospheric Telescope for High Spectral Resolution Observations at Submillimeter-wavelengths, Pineda et al. 2022) and GUSTO (Galactic/ Extragalactic Spectroscopic Terahertz Observatory, Goldsmith et al. 2022) will carry out large-area surveys of [N II] , [C II] and [O I] fine-structure lines throughout the Galaxy with high spatial and spectral resolutions.Besides these balloon missions, the GEco (Galactic Ecology, Simon et al. 2023) project is an upcoming large scale survey with the CCAT observatory which will observe CO and [C I] lines.Identifying structures which are spatially and spectrally distinct in these large areas is a big challenge.Manually analysing these huge data sets will be time consuming and cost ineffective.Therefore we emphasize the need of an automated technique to assist in this process.
To this end, we explore the usability of the Gaussian Mixture Model (GMM), an automated technique to identify distinct physical structures in three Galactic sources : NGC 1977, RCW 120 and RCW 49.We use the model on [C II] line observations of these regions.Apart from being one of the major coolants of the ISM, the 1.9 THz fine-structure line of ionized carbon, C + ([C II] ), is also among the brightest lines in Photo Dissociation Regions (PDRs) (Crawford et al. 1985;Stacey et al. 1991;Bennett et al. 1994).PDRs are exposed to FUV photons (6 eV < hν < 13.6 eV) where H 2 is dissociated to H (hydrogen) and C (carbon) is ionised to C + .Owing to the lower ionisation potential of C than that of H, [C II] traces the transition from H + to H and H 2 (Hollenbach & Tielens 1999;Wolfire et al. 2022).Recent observational studies have proved [C II] to be a powerful diagnostic in improving our understanding of stellar feedback and massive star formation.It probes dynamic structures like expanding shells, photoradiated surfaces of molecular clouds, and pillars in the ISM (Pabst et al. 2019(Pabst et al. , 2020;;Luisi et al. 2021;Tiwari et al. 2021;Bonne et al. 2022;Kavak et al. 2022).Moreover, Schneider et al. (2023) recently reported the role of [C II] in unveiling the dynamic interactions between cloud ensembles in the Cygnus region of our Galaxy.This makes [C II] a favourable candidate for this study.
In this paper, we provide a step-by-step guide for the community to be able to use the GMM to assist in identifying coherent physical structures (expanding shell/bubbles, interacting molecular clouds, etc) in the ISM.We also discuss the model results on three Galactic sources and evaluate the accuracy of the models in identifying different clusters in a region.We note that throughout this paper, we use the term 'clusters' to refer to groups of similar spectra, as described below, and not to clusters of stars.

GAUSSIAN MIXTURE MODEL
In principle, machine learning techniques require learning through a training set.In contrast, Gaussian Mixture Models (GMM) are a standard statistical regression procedure that is best calculated through maximum likelihood with the EM Algorithm (McLachlan & Peel 2000;Dempster et al. 1977;Bouveyron et al. 2019;Frühwirth-Schnatter et al. 2019).Hence, strictly speak-ing, Gaussian Mixture Models are not a machine learning technique.Nevertheless, in the literature, Gaussian Mixture Models are commonly referred to as an unsupervised machine learning method (Murphy 2013) and, here, we adhere to that choice by the community.GMM aims to describe complex high-dimensional data as a linear combination of multidimensional Gaussian distribution (McLachlan & Peel 2000).These models are well studied and have been improved by e.g.Smyth & Wolpert (1999); Blei & Jordan (2004); Bovy et al. (2011); Melchior & Goulding (2018).Their various applications indicate the importance of this tool in astronomy and other fields (Greenspan et al. 2006;Jones et al. 2019;Riaz et al. 2020;Kabanovic et al. 2022).
Our observed dataset consists of N spectra, each of which give the main beam temperature (T MB ) as a function of velocity (v LSR ).We can interpret each spectrum as a single point x i in D dimensional space where D is the number of velocity channels.The GMM attempts to find clusters of spectra that have similar profiles to each other.It does so by assuming the dataset is generated from a limited number (K) of Gaussian clusters each with mean µ k and covariance, Σ k .The mean of each Gaussian cluster can be interpreted as its average spectrum.By describing the set of all observed spectra as a weighted linear combination of a limited number of clusters, the probabilistic distribution is determined by maximizing the likelihood, defined as: (1) Here N ( x i | µ k , Σ k ) represents the multidimensional Gaussian distribution and α k is the weight attributed to that cluster with k α k = 1.The likelihood is computed by summing over all K Gaussian components for all N spectra individually.The optimization is performed using the Expectation-Maximization (EM) algorithm (Dempster et al. 1977).This method has recently been used by Kabanovic et al. (2022) to recognize several distinct physical structures in velocity resolved observations of Atacama Pathfinder EXperiment (APEX) telescope's 12 CO data towards RCW 120.
In this work we aim to expand upon earlier research and implement the more advanced GMMis (Melchior & Goulding 2018).Unlike a traditional GMM this implementation is capable of dealing with noise by enforcing a minimum covariance on the clusters, and of automatically determining an optimal number of clusters by setting the weight of a less probable cluster to extremely low (∼ 10 −10 ) values.These expansions are achieved through adaptations of the likelihood function and the  1) intensity maps of [C II] towards NGC 1977, RCW 120 and RCW 49.The pink colored stars mark the main ionizing sources in these regions.These plots are made using regrided data cubes for the three sources based on S/N requirement described in Sec. 5.
optimization algorithm.This behaviour of the GMMis can be tuned through a set of hyperparameters.These are different from model parameters (such as each µ k and Σ k ) in that they are set a priori and influence the resulting model.Examples of hyperparameters here are the number of provided clusters, the minimum covariance regularization and the stopping criterion.We discuss the considerations and choices for hyperparameters further in Section 5.2.
The complete algorithms and parameter definitions are described in detail in Melchior & Goulding (2018) and in their PyGMMis code. 1 The method and code used to get the results presented in this work are publicly available2 at the digital repository of University of Maryland.

SOFIA OBSERVATIONS
In this work, we use the [C II] 158 µm (1.9 THz) observations, which were observed using the upGREAT3 (Risacher et al. 2018) heterodyne receiver on board the Stratospheric Observatory for Infrared Astronomy (SOFIA, Young et al. 2012).On-the-fly maps towards NGC 1977, RCW 120 and RCW 49 were observed in February 2017 and June 2019.RCW 120 and RCW 49 were observed as a part of the SOFIA Legacy program, FEEDBACK (Schneider et al. 2020).A map size of 15 ′ × 15 ′ was observed for RCW 120 (Luisi et al. 2021;Kabanovic et al. 2022) and for RCW 49, the map size was 25.1 ′ × 25.1 ′ (Tiwari et al. 2021).For NGC 1977, we cropped out a 36.2 ′ × 32.5 ′ region from a 1.15 square degree map of the Orion Nebula complex (Pabst et al. 2020) centered on the NGC 1977 bubble.The native spatial and spectral resolution of the observations at 1.9 THz were 14.1 ′′ and 0.04 km s −1 .For more observational details see Schneider et al. (2020) and Higgins et al. (2021).

SOURCES
We selected three sources with relatively different complexity in terms of ionising source, morphology and star formation activity.Fig. 1 shows the velocity integrated intensity maps of NGC 1977, RCW 120 and RCW 49.We introduce the sources below: NGC 1977NGC 1977 is the northern-most shell in the Orion Nebula Complex within Orion A, which is ∼ 395 pc away (Großschedl et al. 2018).A B1V type star, 42 Orionis, is the main ionising source in this region (Peterson & Megeath 2008).Pabst et al. (2020) reported a shell in NGC 1977 of radius ∼ 1 pc which is expanding at a speed of ∼ 1.5 km s −1 and has a mass of ∼ 700 M ⊙ .The thermal pressure of the ionised gas is responsible for powering the expansion of the shell in NGC 1977.The expansion timescale is estimated to be 0.4 Myr from the Spitzer (1968) solution of thermal expansion of an H II region.

RCW 120
RCW 120 is a southern H II region, mainly ionized by a single O8V type star, CD −38 • 11636 (LSS 3959) (Georgelin & Georgelin 1970), and is located at a distance of 1.7 kpc.Two molecular clouds are associated Note-The rms is in the units of K, the velocity range is in the units of km s −1 , ninput is the number of clusters we give in the models as input, noutput is the number of clusters identified by the models and n output final is noutput excluding noise artefacts and background emission.
with RCW 120.One is blue-shifted and bright within a velocity range of -37 to -20 km s −1 , while the other is red-shifted and bright within a velocity range of -20 to 10 km s −1 .The near-perfect circular morphology with a prominent opening to the north of RCW 120 as seen in far and mid-infrared wavelengths belongs to the red-shifted cloud component (Zavagno et al. 2007;Deharveng et al. 2009;Luisi et al. 2021;Kabanovic et al. 2022).The stellar wind of the O-type stars drives the expanding shell with a velocity of 15 km s −1 and a mass of up to 520 M ⊙ (Luisi et al. 2021).The fast expansion speed of the bubble constrains the lifetime of the region to ∼ 0.15 Myr.X-ray observations suggest that hot plasma is venting out to the east and through an opening in the north, such that 20% of the thermal energy is lost (Luisi et al. 2021).

RCW 49
RCW 49 is one of the most luminous star-forming regions in the southern Galaxy and is located at a distance of 4.16 kpc (Vargas Álvarez et al. 2013;Zeidler et al. 2015;Tiwari et al. 2021).A compact stellar cluster, Westerlund 2 (Wd2), comprises 37 OB stars and ∼ 30 early-type OB star candidates around it.Moreover, a Wolf-Rayet (WR) binary (WR20a) is part of the central Wd2 cluster, while another WR star (WR20b) and an O5V star are a few arcminutes away from the cluster center (Ascenso et al. 2007;Tsujimoto et al. 2007;Rauw et al. 2011;Mohr-Smith et al. 2015;Zeidler et al. 2015).The stellar winds of the Wd2 cluster and WR20a have swept up a shell of radius 6 pc which expands at a speed of ∼ 13 km s −1 and has a mass of 2.5 × 10 4 M ⊙ .Besides the shell, this region also hosts two large scale molecular clouds whose collision led to the formation of the Wd2 cluster (Furukawa et al. 2009), namely: the ridge and the northern & southern clouds (Tiwari et al. 2021(Tiwari et al. , 2022)).

STEP-BY-STEP GUIDE TO USE THE GAUSSIAN MIXTURE MODELS
For a given data cube, the Gaussian Mixture Model identifies various physical structures associated with unique spectral profiles depending on the different input parameters we give.The use of different combinations of input parameters can lead to significantly different results.Below, we provide a recipe that users can follow when applying these models to the data on their sources.
In short, we have a data cube with three dimensions (two spatial, one spectral).We fit a Gaussian Mixture Model with K clusters, decided based on the first breaking point in the n input versus n output diagram.Each spatial pixel is then assigned to one of these clusters based on its probability of being generated by that cluster.

Signal-to-noise ratio of the data cube
It is preferable that the data cube on which the Gaussian Mixture Model is run has a signal-to-noise (S/N) ≳ 10.This ensures that the model does not identify noisy pixels as 'real' and independent clusters.The noisy pixels mentioned here do not only correspond to the observed map edges that are under-sampled, but rather are spread out all over the map.They seem to have high intensity (or brightness) but are actually artefacts.
The S/N of a cube is determined by: where I avg (in K) is the average intensity of the [C II] emission in the entire cube area and the rms (in K) of the cube is calculated in an emission free velocity window following the method given in Higgins et al. (2021).
For cases where the S/N of a cube is < 10, we suggest the users to smoothen the cube to a lower spatial resolution.This can be achieved by gradually increasing the beam size and pixel size following the Nyquist theorem.The cube can also be spectrally regrided to a lower velocity resolution upto a limit of F ull width half maxima 3 .Moreover, if it is not feasible to improve the S/N of a given data set, follow the process described in Sec.5.2 point 4. For NGC 1977, we keep the cube at the spatial resolution of 18 ′′ (following the data reduction technique given in Higgins et al. 2021) with a velocity resolution of 0.5 km s −1 .We find the I avg ∼ 12 K and an rms = 0.85 K such that the S/N ∼ 14.
For RCW 120, we regrided the cube to a 20 ′′ spatial resolution with a velocity resolution of 0.5 km s −1 .We find the I avg ∼ 5 K and an rms = 0.46 K such that the S/N > 10.
For RCW 49, we regrided the cube to a 25 ′′ spatial resolution with a velocity resolution of 1 km s −1 .We find the I avg ∼ 3 K and an rms = 0.29 K such that the S/N ∼ 10. emission is constrained to 50 channels.Therefore, we constrain the algorithm to only fit its models on these channels or velocity range.The velocity ranges for the three sources are given in Table 1.

Hyperparameter variations
2. rms threshold: Any velocity channel which has an rms evaluated over the entire source area lower than this limit is discarded before we fit the model.We essentially throw out any channels that contain little information to decrease computing time.A too low value can cause the model to be unable to converge within a reasonable time, and a too high value might result in real clusters disappearing.After extensive experimentation we set this value to 3 × rms for all sources presented in this work.
3. Normalization: Using the Gaussian Mixture Models, we want to identify the main coherent physical structures in a source.Depending on the scales of the data, machine learning techniques might require normalisation of the dataset to increase model performance.We have tested the models using several normalisation techniques, but it is important to note that the choice of normalization technique can bias the resulting model.When implementing such a method, its impact needs to be carefully considered based on both the data and the desired result.Kabanovic et al. ( 2022) opted for min-max normalization which places a larger emphasis on the spectral shape over the peak intensities.We have investigated the results after applying mean normalization (see Appendix A) but opt for no normalization to limit the amount of bias we introduce.
4. Covariance regularisation, ω: This is essentially a spectral intensity limit.ω describes a minimum T mb below which the model is not allowed to explain variances in the data.This technique is implemented by Melchior & Goulding (2018) based on the work by Bovy et al. (2011) and equips GMM with the power to suppress the effects of noise.
The traditional GMM (used in e.g.Kabanovic et al. 2022) does not include ω as an hyperparameter.We use ω = rms where S/N ≳ 10.In cases where the S/N < 10 for a source and the data quality cannot be enhanced through various reduction techniques, we suggest to use ω equal to the standard deviation of the data points of an rms map of a source. .This means that n output ≤ n input .Nevertheless, an optimal n input still needs to be determined.Too few will lead to loss of information, while too many can lead to assignments of insignificant clusters and increasingly long computing times.Therefore, we propose a new method of determining this optimal value.
The value n output generally increases as n input increases, i.e. the models identify more coherent physical structures in a source.However, this trend breaks for certain n input and we define the first occurrence of this as the 'first breaking point'.This can be seen in Fig. 2, where n output versus n input is plotted for all three sources.Based on the quantitative increase in n output when increasing n input linearly, we incremented n input as a multiple of 2 for all sources.Quantitatively, a breaking point occurs at the minimum of the first derivative of n output versus n input i.e. minima of ∆n output /∆n input .Models were run multiple times for every n input , and the standard deviations of the resulting n output are shown as error bars (in gray) in Fig. 2. For NGC 1977, RCW 120 and RCW 49, the first breaking point occurs at n input = 18, 14 and 16 clusters, respectively.

GAUSSIAN MIXTURE MODEL RESULTS
Based on the input parameters (set following the discussion in Sect.5.2 and given in Table 1), the models identify a fixed number of spectral profiles of [C II] emission for a source.Every pixel of a data cube has 0 to 1 probability to be assigned to one of these spectral profiles.We choose to assign it to the cluster k where this probability is largest.For any individual pixel the probabilities over each cluster k ∈ K sum up to 1.
Using these cluster assignments we created weights maps, which show α k for each cluster.This is analogous to the fraction of data points assigned to cluster k.We also made domain maps which show the spatial distributions of the clusters.These results are presented in Figures 3, 4 and 5. Tables 2, 3 and 4 list the estimated physical quantities of the identified clusters.The size was determined by counting the number of pixels per cluster and then converting it to a physical scale (kpc) using the distance to NGC 1977, RCW 49 and RCW 120.The flux T peak dv, position, width (dv) and peak temperature (T peak ) are derived by fitting Gaussians to the average spectra of every cluster (Fig. 6).
In cases where more than one velocity components are seen, multiple Gaussians were fitted.
For NGC 1977 (Fig. 3), 8 clusters were identified in total, but we excluded two clusters (black space in the domain map) as one of these traced striped noise artefacts in the top right corner of the cube, and the other contained background emission.The extended shell described by Pabst et al. (2020) is traced by the purple and green clusters in Fig. 3.The northern and southern halves of the shell are assigned to these two separate clusters because the models identified a velocity gradient throughout the shell which red-shifts the northern half by a few km s −1 .The shell and its velocity gradient can be seen in the [C II] velocity channel maps shown and discussed in Fig. 9 and Appendix C, respectively.The region towards the molecular core, OMC3, and the PDR created by interaction with the integral-shaped filament (Young Owl et al. 2002) is mostly outlined by one cluster (blue).The two emission peaks are each assigned to their own separate cluster (red and orange), and these clusters have distinctly different velocity structures.Additionally, through the absence of clusters, we can clearly see that the bubble is very thin on the Eastern side.
For RCW 120 (Fig. 4), 5 clusters were identified in total, but we excluded one cluster which only described noise and was mainly localised at the edges of the map.The circular structure (green colored cluster) consists of the expanding shell and the outer ring emission of RCW 120.The ring dynamics can also be seen in the [C II] velocity channel maps shown Fig. 10.Luisi et al. (2021) reported that the shell is broken open in the north and in the east.The green cluster in the domain map has an extension in the north-west corresponding to this break in the shell.The red colored cluster mainly traces the dense, ring-shaped PDR/torus of RCW 120, from which most of the [C II] emission originates.Within the boundary of this red cluster is the expanding shell of RCW 120.In the domain map (Fig. 4), the expanding shell is not clearly separated since the bulk emission (green cluster) dominates the spectral shape.However, the expanding shell is somewhat better separated/localized in Fig. 8 using the traditional GMM, which suggest an additional cluster.We find an opening towards the east where the PDR is disrupted.This aligns well with the X-ray emission from RCW 120 (Fig. 3 of Luisi et al. 2021).Moreover, Kabanovic et al. (2022) refers to this feature as a 'narrow opening' in the east.Local density enhancements along the PDR/torus are identified as additional clusters by the model.Thus, we expect these clusters to have different physical conditions.This is also validated by the significant differences in the observed intensities towards different clusters.The orange and blue clusters Note-Columns are from left to right the cluster number, size of the cluster, velocity component of the average spectra, velocity integrated intensity, centroid LSR velocity, full width half maxima of the spectra, peak temperature.The error bars for the flux, position and width are given in brackets.
have higher intensities compared to the other clusters, as seen in the observed average spectrum within each cluster in the middle panel of Fig. 6.
For RCW 49 (Fig. 5), 10 clusters were identified in total.Similar to RCW 120, we excluded five that corresponded to noise and were mainly localised on the edges of the map.Observational studies (Furukawa et al. 2009;Tiwari et al. 2021Tiwari et al. , 2022) ) identified 4 different physical structures in RCW 49.We identify three of these.The domain map (bottom panel of Fig. 5) shows the shell of RCW 49 in orange, which is broken open in the west (reported in Tiwari et al. 2021).It also identifies the 'ridge', which corresponds to the purple and blue colored clusters.The 'northern and southern clouds' reported in Fig. 7 of (Tiwari et al. 2021) are identified as the red colored cluster.The 'pillar', which is reported as an independent structure in Tiwari et al. (2022), is identified as a part of the southern cloud here (Fig. 5).In previous studies (Tiwari et al. 2021(Tiwari et al. , 2022)), the pillar is reported to be most intense in the velocity range of 5 to 15 km s −1 , which overlaps with the southern cloud (most intense within the velocity range of 2 to 8 km s −1 ).Morphologically, this pillar can be recognized as a separate structure, but kinematically, it is part of the southern cloud.Despite the rather complex nature of RCW 49, GMM recognizes all relevant structures identified in the in-depth study by Tiwari et al. (2022).These structures in RCW 49 can also be seen in the [C II] velocity channel maps shown in Fig. 11 and their dynamics are described in Appendix C. The complexity in the velocity structure of RCW 49 can be seen in the [C II] spectra, where the entire emission spans from -30 to 30 km s −1 (bottom panel, Fig. 6) compared to RCW 120 (-20 to 5 km s −1 , middle panel, Fig. 6) and NGC 1977 (6 to 16 km s −1 , top panel, Fig. 6).Not only is the velocity range of  Note-Columns are from left to right the cluster number, size of the cluster, velocity component of the average spectra, velocity integrated intensity, centroid LSR velocity, full width half maxima of the spectra, peak temperature.The error bars for the flux, position and width are given in brackets.
the emission larger but the number of different velocity components within this emission is largest in RCW 49.

EVALUATING THE MODELS
In the previous sections, we described the steps required to use the models and reported the model results for three relatively different sources.The next natural task is to examine the performance of these models.For this, it is critical to analyze the observed spectra of these regions.Fig. 6 shows the observed average spectra of all identified clusters in NGC 1977, RCW 120 and RCW 49 overlaid with dashed curves corresponding to the mean µ k of each cluster in each velocity channel.Additionally, the dispersion in each cluster-averaged spectrum is shown in Figs. 12 and 13.We find that the dispersion for every spectral profile is small.For more details, see Appendix D.
In NGC 1977 the shell is seen in the velocity range 9 to 15 km s −1 (Pabst et al. 2020).The spectra belonging to clusters 3, 5 and 6 (Fig. 6, top panel) fit well into this range.These are the green, purple and brown clusters respectively in the domain map which trace the majority of the limb-brightened shell.Spectra 1, 2 and 4 belong to the blue, orange and red clusters towards OMC3 in the domain map.They are visibly brighter and show a double peaked structure which is possibly caused by [C II] self-absorption (Pabst et al. 2020 and see Guevara et al. 2020;Kabanovic et al. 2022 for details on optical depth effects).
In RCW 120, the circular ring (bulk emission of RCW 120) is localised within the velocity range of -12 to 2 km s −1 and peaks at -7.5 km s −1 (Kabanovic et al. 2022).The spectra belonging to clusters 3 and 4 (Fig. 6, middle panel) fit this description.They correspond to the green and red clusters, respectively, in the domain map of RCW 120 (Fig. 4), which correctly represent the ring.Also, the spectrum belonging to cluster 1 (blue colored cluster in the domain map of RCW 120) is similar to the spectra presented by Kabanovic et al. (2022) (their Fig. 5) towards the brightest emission region in In RCW 49, the expanding shell broken open to the west lies within the velocity range of -25 to 0 km s −1 (Tiwari et al. 2021).The average spectrum belonging to cluster 2 (Fig. 6, bottom panel) has its brighter component within this velocity range.It corresponds to the orange cluster in the domain map of RCW 49 (Fig. 4), which outlines the broken shell's structure.The northern and southern clouds of RCW 49 are brightest within 2 to 8 km s −1 (Tiwari et al. 2021).This matches with the red colored cluster in the domain map of RCW 49 that corresponds to the average spectrum of cluster 4 in Fig. 6, bottom panel.The ridge of RCW 49 has the brightest [C II] emission within the velocity window of 16 to 22 km s −1 (Tiwari et al. 2021).This matches with the brighter components of the average spectra belonging to clusters 1 and 5.These correspond to the blue and purple clusters in the domain map of RCW 49, which spatially match with the location of the ridge as described in (Tiwari et al. 2021(Tiwari et al. , 2022)).Furthermore, the average spectrum of cluster 5 in purple has another blue-shifted velocity component, which corresponds to the shell of RCW 49 and this also agrees well with representative spectrum of position p3 in Tiwari et al. (2022) (and the Fig. 3 within), where both the ridge and shell components overlap.
It is important to note that the clusters identified by the GMM do not have strict boundaries.While most of the pixels of an observed [C II] map corresponding to a given source are assigned to a specific cluster with probability > 0.9 (1 being the highest), at the boundaries of the identified clusters the probabilities can be lower leading to more spurious assignments.Thus, spatial transition from one cluster to another can be more gradual than the strict boundaries illustrated by the domain maps (Figs. 3,4 and 5).As illustrated in Appendix E, the importance of such ill-defined assignments can be assessed by comparing the map of lower probability pixels in a cluster to the map of the cluster.For NGC 1977, this comparison clearly illustrates that pixels with illdefined cluster status are relegated to the boundary of the cluster and indicate a somewhat more gradual transition from one cluster to the next.

CONCLUSIONS
The objective of this study was to identify coherent physical structures in the ISM through an automated technique, which will assist the astronomical community in analysing large scale observations.We investigated the capabilities of the GMM in achieving this goal by testing the models on three different Galactic sources.
We ran the GMM on NGC 1977, RCW 120 and RCW 49.From expressing the need of a S/N > 10 for the input data cube to choosing the right number of input clusters into the models, we prepared a stepby-step guide for the users to try GMM on their data sets.The models identified 6, 4 and 5 clusters (excluding noise and background clusters) in NGC 1977, RCW 120 and RCW 49, respectively, and these results were illustrated using weight and domain maps.In NGC 1977, the models identified the expanding shell as well as the dense PDR tracing the interaction of this shell with the OMC-3 core and the Integral-Shaped Filament, consistent with previous work on this source.The domain map also revealed a break in the shell towards the east.In RCW 120, the models identified the bulk emission and the shell, which is broken in the north, and also confirmed another leak in the east, which was identified in the X-ray data earlier.In RCW 49, the models identified the broken shell, the ridge, and the northern & southern clouds, which agrees with the previous observational studies towards this region.
We also validated the models by examining the average spectrum of each cluster and found that the models assign the data cube pixels to the 'right' cluster with high accuracy.Successful comparisons between the results of our work and those in literature serve to validate the precision of these models in identifying major physical structures within a given region.This, in turn, reveals the promise of this powerful and easy-to-use method for analysing and interpreting large [C II] data sets, such as those to be delivered by NASA's GUSTO and ASTHROS balloon missions.

A. MEAN NORMALISATION
We tested the models using several normalisation techniques and found that the mean normalization works better than the other techniques.For a single spectrum, the mean normalisation is described as: where T mb (norm) is the normalized spectral intensity per channel, T mb is the observed main beam spectral intensity and T mb (mean) is the mean of the main beam spectral intensities of all the spectra towards the entire source area.All temperatures are in K.
We ran the models with mean normalisation on RCW 120 and Fig. 7 shows the domain map of RCW 120, where two clusters are identified (excluding the noise).The blue colored cluster depicts the broken (from the north and the east) shell of RCW 120 and the orange colored cluster comprises relatively brighter pixels in the region.Normalising the data set causes a decrease in the difference in intensities between various pixels.Thus, a lot of internal structure is lost in the mean normalised results when compared to the domain map (Fig. 4) obtained without normalising the data.It is possible to achieve the same level of internal structure in the normalised data set, however, one needs to increase the input number of clusters.This will make the entire process slower when compared to running the models without using any normalisation.

B. TRADITIONAL GMM
To investigate the difference between the 'traditional' GMM and the models presented in this work, we ran the models described in Kabanovic et al. (2022) on NGC 1977, RCW 120 andRCW 49. Fig. 8, bottom panel, shows the  BIC for the three sources, with the minimum BIC found at n input = 33, 6, 9 for NGC 1977, RCW 120 and RCW 49, respectively.These values match well with the n output reported in Table 1 for RCW 120 and RCW 49.
After excluding a single cluster that corresponds to only noise, the domain map of RCW 120 (Fig. 8) is very similar to the one shown in Fig. 4. It shows clear signs of shell leakage in the north-west and in the east (breaking in the orange cluster).The models also identify the higher intensity pixels as an independent cluster (red).The blue cluster, however, encompasses pixels with low intensity (surrounding the shell shown in green).
In RCW 49, apart from the four (red, pink, brown and dark blue) clusters that correspond to the noise artefacts, the domain map (in Fig. 8) identifies the major physical structures in RCW 49.Gray cluster corresponds to the broken shell.The green and orange clusters correspond to the northern and southern clouds.The light blue cluster corresponds to the ridge and also the shell in the north-east.However, unlike the results presented in Sect.6, the traditional GMM models are unable to identify the pillar in RCW 49.
Unlike RCW 120 and RCW 49, the minimum BIC in NGC 1977 occurs at a significantly high value leading to a domain map with 33 clusters.One of the possible reasons for the traditional models to converge at such a large number of clusters could be the large fluctuations in the spectral profiles along the shell which are more pronounced in NGC 1977 than in the other sources.Additionally, the relatively higher noisy pixels in the NGC 1977 data cause further artefacts in the generated domain map.In comparison, the GMMis model results for NGC 1977 presented in Sect.6 deal better with these variations introduced by velocity gradients or noise.The GMMis model converges at 8 clusters for NGC 1977, clearly identifying the major physical structures in the source.Thus, emphasizing the potential in GMMis in analysing large data sets.2022), Kabanovic et al. (2022) and Tiwari et al. (2021).
The shell of NGC 1977 can be seen in Figure 9.The southern part of the shell is most intense within 9 to 12 km s −1 , while the northern part is most intense within 11 to 13 km s −1 .This velocity gradient is the reason that the northern and southern parts of the shell are identified as separate clusters by the models (Fig. 3).In RCW 120, the [C II] velocity channel maps (Fig. 10) illustrate the ring dynamics from -12 to -2 km s −1 .The break in the north-western break in the shell is also visible in the channel maps.The different structures of RCW 49 can be seen in the velocity channel maps (Fig. 11).The broken expanding shell is most intense from -14 to 0 km s −1 .The northern and southern clouds are most intense from 2 to 8 km s −1 .The pillar is brightest within 4 to 12 km s −1 , while the ridge is brightest within 16 to 22 km s −1 .The domain map of RCW 49 (Fig. 5) identifies all these structures as independent clusters except the pillar, which is part of the southern cloud.

D. AVERAGE SPECTRUM OF EACH CLUSTER
Figure 12 shows the average spectra of all clusters identified by GMM for NGC 1977, RCW 120 and RCW 49.It can be seen that all pixels assigned to a given cluster have similar spectral profiles.Most of the average spectra have relatively small dispersion which means that all the spectra corresponding to that cluster have very similar intensities and line shapes.The relatively larger dispersion seen in some of the average spectra is mainly because of the difference in the line intensity, while the line profiles corresponding to all the pixels associated with one cluster are similar.To illustrate it, we show the normalised average spectra of all clusters in Fig. 13 and it can be seen that the dispersion is overall less in this case.Thus, the models are able to classify spectra based on their line profiles very well, further validating their accuracy.

E. THE CASE OF CLUSTER ASSIGNMENT WITH LOW PROBABILITY
Figure 14 highlights the pixels in NGC 1977 domain map that have low probabilities (<0.9) of getting assigned to their respective clusters as shown in Fig. 3.This is an example to understand the spatial distribution of these pixels.Firstly, we found that the map contains relatively few of these pixels.Also, they are more localised to the boundaries of the identified clusters implying a more gradual transition from one cluster to another in contrast to what is seen in the domain maps themselves.3, however instead of assigning each pixel to the cluster to which it has the largest probability of belonging, we only plot points with a probability between 0.1 and 0.9.The color opacity of each pixel is proportional to this probability.

Figure 1 .
Figure1.Velocity integrated (within the ranges given in Table1) intensity maps of [C II] towards NGC 1977, RCW 120 and RCW 49.The pink colored stars mark the main ionizing sources in these regions.These plots are made using regrided data cubes for the three sources based on S/N requirement described in Sec. 5.

Figure 2 .
Figure 2. Number of clusters identified by GMM (noutput) versus number of clusters given as input (ninput) for NGC 1977, RCW 120 and RCW 49.Models were run 10, 50 and 10 times for each ninput in NGC 1977, RCW 120 and RCW 49, respectively.Their mean noutput is shown with blue, while the standard deviation is displayed in gray.
5. Number of clusters: When running any clustering algorithm, a number of clusters needs to be provided a priori.The traditional GMM (Appendix B), as used by e.g.Kabanovic et al. (2022), must use all provided clusters (n input = n output ), and an information criterion is required to determine the optimal n input .Examples include the BIC and AIC (Bayesian and Akaike Information Criterion; Schwarz 1978) which penalize the likelihood of a model by the number of clusters it uses.This is not required in the GM-Mis method used here, since an adjustment in the Expectation-Maximization algorithm allows arbitrarily low cluster weights (See Bovy et al. 2011; Melchior & Goulding 2018, for mathematical details)

Figure 3 .
Figure 3. GMM results of NGC 1977: weights map (top) and domain map (bottom) displaying the identified clusters.

Figure 4 .
Figure 4. GMM results of RCW 120: weights map (top) and domain map (bottom) displaying the identified clusters.

Figure 5 .
Figure 5. GMM results of RCW 49: weights map (top) and domain map (bottom) displaying the identified clusters.

Figure 6 .
Figure 6.Average spectra from all clusters shown in the domain maps of NGC 1977 (top), RCW 120 (middle) and RCW 49 (bottom).The color of every cluster shown here is the same as shown in Figs. 3, 4 and 5.The dotted lines indicate the mean of each GMM cluster (µ k in eq. 1).Channels without dots are those discarded by the RMS threshold.Large differences between the GMM means and the averaged spectra can be an indication of large variance within a single cluster.

Figure 9 .
Figure 9. Velocity channel maps of [C II] emission towards NGC 1977.The beam size shown in the lower right.

Figure 10 .
Figure 10.Velocity channel maps of [C II] emission towards RCW 120.The beam size shown in the lower right.

Figure 11 .
Figure 11.Velocity channel maps of [C II] emission towards RCW 49.The beam size shown in the lower right.

Figure 12 .
Figure 12.Average spectra of all GMM identified clusters for NGC 1977 (left column), RCW 120 (middle column) and RCW 49 (right column).The dashed region represents the standard deviation depicting the variation in spectral profiles for all spectra per pixel corresponding to a cluster.

Figure 13 .
Figure 13.Normalised average spectra of all GMM identified clusters for NGC 1977 (left column), RCW 120 (middle column) and RCW 49 (right column).The dashed region represents the standard deviation depicting the variation in spectral profiles for all spectra per pixel corresponding to a cluster.
Figures 9, 10 and 11 show the velocity channel maps of [C II] emission towards NGC 1977, RCW 120 and RCW 49.Similar maps have been previously presented in Pabst et al. (2022),Kabanovic et al. (2022) andTiwari et al. (2021).The shell of NGC 1977 can be seen in Figure9.The southern part of the shell is most intense within 9 to 12 km s −1 , while the northern part is most intense within 11 to 13 km s −1 .This velocity gradient is the reason that the northern and southern parts of the shell are identified as separate clusters by the models (Fig.3).In RCW 120, the [C II] velocity channel maps (Fig.10) illustrate the ring dynamics from -12 to -2 km s −1 .The break in the north-western break in the shell is also visible in the channel maps.The different structures of RCW 49 can be seen in the velocity channel maps (Fig.11).The broken expanding shell is most intense from -14 to 0 km s −1 .The northern and southern clouds are most intense from 2 to 8 km s −1 .The pillar is brightest within 4 to 12 km s −1 , while the ridge is brightest within 16 to 22 km s −1 .The domain map of RCW 49 (Fig.5) identifies all these structures as independent clusters except the pillar, which is part of the southern cloud.

Figure 14 .
Figure 14.Domain map of NGC 1977 from an identical model to the one presented in Fig.3, however instead of assigning each pixel to the cluster to which it has the largest probability of belonging, we only plot points with a probability between 0.1 and 0.9.The color opacity of each pixel is proportional to this probability.

Table 1 .
Input and output parameters of the models

Table 2 .
Computed physical quantities for GMM clusters of NGC 1997 cluster size (pc 2 ) velocity component flux (K km s −1 ) position (km s −1 ) width (km s −1 ) T peak (K) Note-Columns are from left to right the cluster number, size of the cluster, velocity component of the average spectra, velocity integrated intensity, centroid LSR velocity, full width half maxima of the spectra, peak temperature.The error bars for the flux, position and width are given in brackets.