PHANGS-ML: Dissecting Multiphase Gas and Dust in Nearby Galaxies Using Machine Learning

The PHANGS survey uses Atacama Large Millimeter/submillimeter Array, Hubble Space Telescope, Very Large Telescope, and JWST to obtain an unprecedented high-resolution view of nearby galaxies, covering millions of spatially independent regions. The high dimensionality of such a diverse multiwavelength data set makes it challenging to identify new trends, particularly when they connect observables from different wavelengths. Here, we use unsupervised machine-learning algorithms to mine this information-rich data set to identify novel patterns. We focus on three of the PHANGS-JWST galaxies, for which we extract properties pertaining to their stellar populations; warm ionized and cold molecular gas; and polycyclic aromatic hydrocarbons (PAHs), as measured over 150 pc scale regions. We show that we can divide the regions into groups with distinct multiphase gas and PAH properties. In the process, we identify previously unknown galaxy-wide correlations between PAH band and optical line ratios and use our identified groups to interpret them. The correlations we measure can be naturally explained in a scenario where the PAHs and the ionized gas are exposed to different parts of the same radiation field that varies spatially across the galaxies. This scenario has several implications for nearby galaxies: (i) The uniform PAH ionized fraction on 150 pc scales suggests significant self-regulation in the interstellar medium, (ii) the PAH 11.3/7.7 μm band ratio may be used to constrain the shape of the non-ionizing far-ultraviolet to optical part of the radiation field, and (iii) the varying radiation field affects line ratios that are commonly used as PAH size diagnostics. Neglecting this effect leads to incorrect or biased PAH sizes.


Introduction
Over the past several decades, astronomy has been going through a data revolution.It was pioneered by the Sloan Digital Sky Survey, which imaged roughly one-third of the sky in five photometric bands, providing measured photometry for billions of objects and spectroscopy for millions (York et al. 2000;Eisenstein et al. 2011;Dawson et al. 2013).Since then, past and ongoing surveys have been producing massive datasets that include millions to billions of objects with astrometric, photometric, spectroscopic, or time-series observations (e.g., Pan-STARRS; ZTF; Gaia; APOGEE; MANGA; DESI; Kaiser et al. 2010;Bellm 2014;Bundy et al. 2015;DESI Collaboration et al. 2016;Gaia Collaboration et al. dalyabaron@gmail.com2016; Majewski et al. 2017;Abdurro'uf et al. 2022;DESI Collaboration et al. 2023;Gaia Collaboration et al. 2023).These observations, along with numerous derived products from them, were made publicly accessible through efficient and convenient interfaces and changed the way astronomers interact with observations, marking the beginning of the big data era in astronomy.In the near future, surveys by the Vera Rubin Observatory, Roman Space Telescope, Euclid, SDSS-V, and the Square Kilometre Array (e.g., Dewdney et al. 2009;Kollmeier et al. 2017;Ivezić et al. 2019;Euclid Collaboration et al. 2022), to name a few, are expected to make another order-of-magnitude increase in data volume.
The big data era in astronomy is not only characterized by an increase in data volume, related to the total number of observed sources, but is also characterized by an increase in data complexity, which is related to an increased information content of a single astronomical source.With numerous surveys conducted using different telescopes and instruments, astronomical sources today often have multi-wavelength observations, from radio to X-ray, and in some wavebands, also as a function of time.In the nearby Universe, over the past several decades, surveys have mapped tens to hundreds of nearby galaxies from ultraviolet (UV) to radio using imaging and spectroscopy (e.g., SINGS; KINGFISH; THINGS; xCOLD GASS; Kennicutt et al. 2003;Dale et al. 2007;Walter et al. 2008;Moustakas et al. 2010;Kennicutt et al. 2011;Saintonge et al. 2011;Dale et al. 2017;Saintonge et al. 2017).
The increase in data complexity raises some challenges, but also presents some new opportunities.On the one hand, it raises the question of how to incorporate the different types of observations, each with different sensitivities, spatial and spectral resolutions, and noise properties, within the same framework, in a sufficiently general manner to be applied to a variety of astronomical objects.In addition, the highdimensionality of the data makes it challenging to identify trends, especially when they tie observables from different wavelengths or instruments.On the other hand, the increase in information content offers the unique opportunity to use the data itself to form novel hypotheses, a core approach in the field of data science.
In the more traditional, model-driven or physics-driven approach, a scientific study starts with a hypothesis.Observations are planned and conducted to test the hypothesis, and their analysis leads to new insights, often resulting in new hypotheses.In data science, various statistical tools are used to visualize and dissect the high-dimensional space spanned by the dataset.When the information content of the data is large, such tools may uncover previously-unknown trends or groups of objects, allowing one to form hypotheses directly from the data.Since this process relies less on a physical model or on prior knowledge, it may lead to unexpected discoveries.
With the advent of multi-waveband opportunities, surveys have been producing larger and more complex datasets.The Physics at High Angular resolution in Nearby Galax-ieS (PHANGS; Leroy et al. 2021a;Emsellem et al. 2022;Lee et al. 2022Lee et al. , 2023) ) survey is an example of a modern astronomical survey that pushes the limits of data complexity.With the goal of constraining the physics near or at the molecular cloud scale, the survey has been making high-resolution observations of nearby galaxies across the electromagnetic spectrum, utilizing various telescopes and instruments.Nineteen of the PHANGS galaxies have high-resolution maps obtained with the following telescopes: ALMA (mm interferometry; Leroy et al. 2021a), HST (UV and optical imaging; Lee et al. 2022), JWST (near-infrared and mid-infrared imaging; Lee et al. 2023), and VLT (optical integral field spectroscopy with MUSE; Emsellem et al. 2022).Each of the galaxies has 10 5 -10 7 independent spatial resolution elements1 , with each pixel/spaxel probing the conditions of multiphase gas, dust, and stars on scales of 5-100 pc (see figure 1).The unique combination of spectral coverage and high spatial resolution makes the information content of a single PHANGS galaxy comparable to that of the Legacy SDSS spectroscopic survey when considering the number of spectra and spectral resolution elements.
The high information content of the PHANGS galaxies makes it an ideal dataset for applications of data-science tools.In this pilot study, our goal is to test whether unsupervised machine learning algorithms can be used to identify previously unknown trends or groups in the data.Such tools have been applied to astronomical datasets in various contexts (see reviews and references in Baron 2019, hereafter B19;and Fluke & Jacobs 2020), with the most relevant and recent examples being (i) identification of underlying correlations in the high-dimensional data of molecular cloud populations constructed from multi-wavelength observations from PHANGS (Sun et al. 2022), and (ii) the application of a clustering algorithm to JWST observations of PAH emission in the Orion Bar (Pasquini et al. 2023).
In this work, we estimate properties related to the stellar populations, multiphase gas conditions, and dust properties, on scales of 150 pc.We then use dimensionality reduction and clustering algorithms to divide these 150 pc-sized regions into groups.In the process, we identify new relations between PAHs and the warm ionized gas, and use our defined clusters to interpret these relations.This work is therefore complementary to recent studies that explicitly study the connection between PAHs, the ionized gas, and the radiation field, in the PHANGS-JWST galaxies, using a more physicsdriven approach.In particular, Egorov et al. (2023) study the PAH-to-total dust fraction and its relation to the strength of the radiation field, parameterized using the gas ionization parameter, in HII regions.Dale et al. (2023) constrain different PAH properties, such as their size and charge distribution, in stellar clusters, while exploring the impact of changing radiation fields.Chastenet et al. (2023a) and Sutter et al. (in prep.)study how the PAH fraction depends on local and global conditions, including the phase of the interstellar medium, specific star formation rate, metallicity, and stellar mass.
In section 2 we describe the data we use in our analysis.In section 3 we describe our approach, which consists of three main steps: feature construction (section 3.1), dimensionality reduction (section 3.2), and clustering (section 3.3).The resulting clusters, their interpretation, and the new relations found between PAHs and the warm ionized gas are presented in section 4. The results section stands on its own and does not require a deep understanding of the methods.Therefore, readers who are interested in the results may skip section 3 and go directly to section 4. In section 5 we discuss possible extensions and generalizations of our methodology.We summarize and conclude in section 6.

Data
To study the interplay between ISM gas and dust, star formation, and the observed stellar populations, we use multiwavelength observations by ALMA, VLT-MUSE, JWST-NIRCam, and JWST-MIRI, obtained as part of the PHANGS survey (Leroy et al. 2021a;Emsellem et al. 2022;Lee et al. 2023).The first PHANGS-JWST data release includes fullyreduced and calibrated NIRCam and MIRI broad-band images of the three galaxies: NGC 0628, NGC 1365, and NGC 7496 (e.g., Lee et al. 2023 and references therein).The broad-band images have been extensively tested and analyzed in a series of recent works (e.g., Belfiore et al. 2023;Chastenet et al. 2023b;Dale et al. 2023;Egorov et al. 2023;Leroy et al. 2023;Sandstrom et al. 2023a,b), making them an ideal benchmark for applications of machine learning algorithms.We therefore focus on these three galaxies here.
Since our goal is to combine information from different instruments (ALMA, MUSE, and JWST NIRCam and MIRI), each with a different spatial resolution, and to consider pixels from different galaxies within the same analysis, we have to ensure that the pixels trace information originating from the same physical scale.We therefore use PHANGS data products that are based on maps convolved to a resolution of 150 pc, as described in each of the subsections below.We list the effective angular resolution ( ′′ ) of the convolved maps for each of the galaxies in table 1.We project all the convolved maps into the world coordinate system (WCS) defined by the ALMA observations using REPROJECT.EXACT by ASTROPY (Robitaille et al. 2020;Astropy Collaboration et al. 2022).We then downsample the RA-DEC coordinate grid to have two pixels per resolution element of 150 pc in each of the galaxies2 .We list the initial number of pixels in the ALMA observation grid and the number of pixels after the downsampling in table 1.
As an illustration, we show different properties of NGC 1365 derived from the ALMA, MUSE, and JWST maps in figure 2. These include properties pertaining to the cold molecular gas, warm ionized gas, stellar populations, PAHs, and large dust grains.In the rest of the section we describe the data products we use and their analysis.

ALMA
To trace the cold molecular gas properties, we use the PHANGS-ALMA survey (Leroy et al. 2021a,b).The survey mapped the 12 CO(2 − 1) (CO hereafter) line emission at a spatial resolution of ∼ 1 ′′ ∼ 100 pc in 90 nearby galaxies.After imaging, the cubes are convolved to a succession of physical resolutions.We use the PHANGS-ALMA data products obtained after convolving the cubes to a spatial resolution of 150 pc.
The catalog includes two main types of products.The "strict" mask products include two-dimensional maps generated from the data cubes after applying stringent signal identification criteria.Since these require a detection of the signal with high confidence, these maps have low noise, but they include less of the total CO flux and are thus somewhat incom-plete.The "broad" mask products include two-dimensional maps generated using all the sight lines where signal is identified at any resolution, making them more complete than the "strict" mask products.However, since these include regions with faint emission, they appear noisier and can contain false positives.In figure 2 we show the CO moment 0 (total intensity) derived using the "strict" and "broad" masks, as well as the CO effective width (W ef f , a line width measure) calculated using the "strict" masks.Since our analysis requires high completeness and can tolerate some additional noise, we use the CO flux derived using the "broad" mask in our feature extraction (see details in section 3.1).

MUSE
To trace the warm ionized gas conditions and the stellar population properties, we use the PHANGS-MUSE survey (Emsellem et al. 2022).This survey mapped 19 of the PHANGS-ALMA galaxies with the integral field spectrograph MUSE at a spatial resolution of ∼ 1 ′′ .The data analysis pipeline of the survey includes (i) mosaicking and homogenization of individual MUSE pointings for a given galaxy, (ii) performing spatial binning of the spectra to reach sufficient signal to noise ratio (SNR) for stellar population synthesis modeling, (iii) fitting the stellar continuum to extract the stellar kinematics, reddening, and stellar populations, and (iv) fitting the optical emission lines to extract gas kinematics and line fluxes.In this work, we use various properties (see below) derived from the MUSE cubes after they have been convolved to a resolution of 150 pc (PHANGS DR2.2 release).
For the warm ionized gas properties, we use the surface brightness maps of the Balmer lines Hβ and Hα, and the following emission lines: (see details in Emsellem et al. 2022).To estimate the dustcorrected Hα surface brightness, we assume case-B recombination (T = 104 K), a dusty-screen, and the Cardelli et al. (1989) extinction law, with the color excess given by: where (Hα/Hβ) obs is the observed Hα/Hβ surface brightness ratio.We then correct the observed Hα surface brightness using the derived E(B − V ) values.
In our analysis, we consider the dust-corrected Hα surface brightness, the Hα gas velocity dispersion, and the line ratios log([O III]/Hβ), log([N II]/Hα), log([S II]/Hα), and log([O I]/Hα), which are typically used to constrain the main source of ionizing radiation.These line ratios are based on lines close in wavelength, so they are nearly reddening independent.We thus do not correct the lines for reddening prior to the line ratio estimation.We do correct the Hα surface brightness for reddening since our derived features in section 3.1 include ratios of Hα to CO or PAH emission.We propagate the surface brightness uncertainties to the feature uncertainties, and pixels in which these properties are measured with SNR < 3 are masked out (see discussion about feature missingness in section 5.2).
For the stellar population properties, we use maps of the stellar velocity dispersion, stellar mass density, massweighted stellar age, mass-weighted stellar metallicity, and reddening towards the stars.These properties were derived using stellar population synthesis fits of binned stellar spectra, using Voronoi bins of the convolved MUSE cubes (Emsellem et al. 2022;Pessa et al. 2023).

JWST
To trace PAH properties and dust-reprocessed stellar or AGN light, we use the Cycle 1 PHANGS-JWST survey data (Lee et al. 2023;Williams et al. 2024).It is a JWST treasury survey aimed at collecting imaging data in eight bands from 2 to 21 µm of the 19 galaxies observed as part of PHANGS-ALMA, PHANGS-MUSE, and PHANGS-HST.The broad band images include the four NIRCam bands F200W, F300M, F335M, and F360M, probing near-infrared emission at 2 µm, 3 µm, 3.35 µm, and 3.6 µm respectively, and the four MIRI bands F770W, F1000W, F1130W, and F2100W, probing mid-infrared emission at 7.7 µm, 10 µm, 11.3 µm, and 21 µm.These filters are designed to cover different PAH emission bands, which are expected to be sensitive to different grain size and charge distribution, to capture dust continuum emission, and the silicate 9.7 µm absorption feature.
We use the same version of the surface brightness maps presented and described in the PHANGS-JWST Cycle 1 Fo-cus Issue3 (version 0.8; see e.g., Belfiore et al. 2023;Chastenet et al. 2023b;Dale et al. 2023;Lee et al. 2023;Sandstrom et al. 2023a,b).These maps are convolved to a spatial resolution of 150 pc using the approach described in Aniano et al. (2011).Williams et al. (2024) describes the full data reduction pipeline.
To study the PAH properties, we use the MIRI F770W and F1130W broad-band images, which are expected to be dominated by the 7.7 and 11.3 µm PAH features.We also use the NIRCam bands F300M, F335M, and F360M, and follow the procedure outlined by Sandstrom et al. (2023a) to estimate the 3.3 µm PAH flux (F335M PAH hereafter).This procedure uses the F300M and F360M bands to subtract the starlight contribution from the F335M band, thus isolating the flux from the 3.3 µm PAH feature.
For the dust continuum emission, we considered both the MIRI F1000W and F2100W filters, though both suffer from different limitations.The F1000W filter shows strong correlations with the PAH-tracing filters F770W and F1130W, and a weaker correlation with the hot dust-tracing filter F2100W, suggesting that the filter is in fact dominated by PAH emission in a large fraction of the pixels (e.g., Belfiore et al. 2023;Leroy et al. 2023).The F2100W filter traces only dust continuum emission, but it is saturated in the central pixels of NGC 1365 and NGC 7496, presenting significant diffraction spikes (see figure 2 and Chastenet et al. 2023b).Since using the F2100W filter would require us to mask out the saturated galaxy centers and the diffraction spikes, we do not include it in our feature construction.We do include the F1000W filter, but note that it traces PAH emission much more than the hot dust continuum.We use both F1000W and F2100W in our interpretation of the resulting clusters and trends.
Following the papers in the Focus Issue, we use the convolved maps of NGC 7496 to estimate the noise level in all the relevant bands (F335M PAH , F770W, F1130W, F1000W, F2100W) 4 .NGC 7496 is the only target with sufficiently empty space that is not contaminated by emission from the source.The estimated noise root mean square (RMS) levels, in units of MJy sr −1 , are 0.0071, 0.021, 0.023, 0.013, and 0.082, for the F335M PAH , F770W, F1130W, F1000W, F2100W bands, respectively.We use these values to mask out pixels in which the measured flux is lower than 3 times the noise RMS in all three galaxies.

Methods
In this section, we apply unsupervised machine learning algorithms to the maps constructed from the ALMA, MUSE, and JWST observations.These algorithms are used to divide the pixels from the different galaxies into groups according to their multi-wavelength properties, where pixels in a given group show distinct values or correlations between their stellar, gas, and dust properties, from pixels in other groups.This allows us to explore the complex multiwavelength PHANGS dataset without an initial model-driven hypothesis.Instead, we form data-driven hypotheses by inspecting the resulting groups and identifying previously unknown trends and correlations within them.
In our analysis, each pixel from each of the galaxies is considered as a separate object with a set of measured features.The measured features are constructed from the multiwavelength maps, and they trace the stellar, gas, and dust properties within each of the pixels.The features do not include information related to the galaxy a pixel belongs to, or its location within the galaxy, though both are used when interpreting the results.Once a final list of objects with measured features is constructed, we apply a dimensionality reduction algorithm to this data.The output of the dimensionality reduction is an embedding of the high-dimensional data into a two-dimensional space, where every object (pixel) is represented by two numbers.We then apply a clustering algorithm to the distribution of the objects in the twodimensional space, which allows us to assign a class to each of the objects.Therefore, this three-step procedure divides the PHANGS pixels into groups according to their multiwavelength observations.We start by describing our feature construction scheme in section 3.1, which includes our definition of features, and their scaling and normalization.Since the data contains a non-negligible fraction of non-detections, the section also describes our adopted pixel masking strategy aimed at minimizing missing feature values, and our strategy to handle the remaining missing values.In section 5.2 we discuss the issue of missing values more generally, and propose methods to handle non-detections and upper limits in future works.In section 3.2 we apply the dimensionality reduction algorithm UMAP and obtain a two-dimensional embedding of the input dataset.We also discuss the impact of different hyperparameter choices on the resulting two-dimensional embedding.In section 3.3 we apply several different clustering algorithms to the two-dimensional embedding and define the final clusters that will be used to group the pixels.

Feature Extraction
We use the convolved, reprojected, and downsampled maps obtained from the ALMA, MUSE, and JWST observations described in section 2. Since our goal is to study the interplay between stellar, gas, and dust properties, we require > 3σ detection of the Hα and CO emission lines, the PAH bands, and the dust continuum emission5 .For that requirement, we construct a pixel mask map for each of the galaxies.In each pixel mask map, a pixel value is set to 1 if this pixel is not masked out in every one of the following maps: dust-corrected Hα surface brightness, CO intensity using the "broad" mask, F335M PAH , F770W, and F1000W.Otherwise, the pixel value is set to 0. Since the Hα surface brightness is dust corrected, requiring a 3σ detection ensures the detection of the weaker Hβ line.While this requirement does not ensure the detection of the [O III], [N II], [S II], and [O I] lines, their detection fraction is quite high (see missing fractions in table 2).
We list the number of pixels in the pixel mask map in table 1.The included pixels are also marked with colors in figure 3 in the results.They constitute only 15%, 32%, and 20%, of all the pixels in the downsampled maps of NGC 0628, NGC 1365, and NGC 7496, respectively.There are two main factors contributing to the large number of masked-out pixels.The first is the incomplete overlap between the fields of view of the different instruments (see e.g., figure 1 in Lee et al. 2023).The second factor is our requirement of CO and PAH detection, which masks out around 30% of the pixels (see e.g., figure 3).These pixels are masked out because the CO and PAH 3.3 µm emission are below the sensitivity limit.
Among the maps we consider in our masking, the dustcorrected Hα and the 7.7 µm and 10 µm surface brightness maps are quite complete, with a small fraction of masked-out pixels.The CO and 3.3 µm maps have a larger fraction of masked-out pixels, and they typically have the same pixels masked-out in both.Since we are interested in studying multiphase gas and dust, which cannot be done without a CO and 3.3 µm PAH emission detection 6 , we choose to exclude such pixels from the analysis.Our choice to use the CO emission identified using the "broad" mask is motivated by the CO detection requirement, since the CO line is detected in a larger fraction of the pixels compared to that of the "strict" mask.If instead we had used the CO intensity identified using the "strict" mask, we would have had to mask out approximately 80-90% of the pixels in each galaxy, excluding most of the diffuse ISM.In section 5.2 we discuss the implications of excluding such pixels from the analysis and propose possible methods to include upper limits and non-detections in future analyses of this kind.
Table 2 summarizes the features we considered, most of which are surface brightness ratios.Although the machine learning algorithms we use can in principle be applied to measured surface brightness values directly, we choose to work with ratios as we find them more easy to interpret.The first three features, Hα to CO, Hα to 10 µm, and CO to 7.7 µm PAH, trace the interplay between star formation (traced by Hα), different gas phases (traced by Hα and CO), and dust grains.We also consider the warm ionized gas line ratios log([N II]/Hα), log([S II]/Hα), log([O I]/Hα), and log([O III]/Hβ) as features, as these are known to be sensitive to the source of ionizing radiation through standard BPT diagrams (Baldwin et al. 1981;Veilleux & Osterbrock 1987;Kewley et al. 2001;Kauffmann et al. 2003), as well as to metallicity, ionization parameter, and more (e.g., Kewley et al. 2019).
We consider properties pertaining to the dynamics and stellar population traced by the MUSE observations: the ionized gas velocity dispersion measured with the Hα, the gasto-stellar velocity dispersion ratio, the stellar mass surface density, the age of the stellar population, and the reddening towards the line-emitting gas (see equation 1).Three of these features were later excluded from the analysis -(i) the Hα velocity dispersion, (ii) and the stellar mass surface density, since they dominated the dimensionality reduction and resulted in a trivial embedding where the pixels were divided into three clusters according to the galaxy they belong to7 , even after correcting for galaxy inclination, and (iii) the gas to stellar velocity dispersion ratio, which had a large fraction of missing values and had little impact on the lowdimensional embedding.
We include several flux ratios that trace different PAH properties.The 3.3 µm PAH feature primarily traces small and neutral PAHs, while the 7.7 µm feature traces larger and positively-charged ions, and the 11.3 µm feature represents grains that are larger and neutral (e.g., Boersma et al. 2016Boersma et al. , 2018;;Maragkoudakis et al. 2020;Draine et al. 2021;Rigopoulou et al. 2021;Maragkoudakis et al. 2022).Therefore, the 3.3/11.3µm and 3.3/7.7 µm PAH ratios are sensitive to the PAH size distribution, though they are also sensitive to the shape of the FUV-optical radiation field (see e.g., Draine et al. 2021 and Appendix C.1).The 11.3/7.7 µm PAH ratio is primarily sensitive to the PAH charge distribution, though also to the shape of the radiation.To trace the PAH abundance, several recent studies defined R PAH = (F770W + F1130W)/F2100W, which is a ratio of PAH to dust-continuum flux (e.g., Chastenet et al. 2023a,b;Egorov et al. 2023;Sutter et al., in prep).Since the F2100W is saturated in the centers of NGC 1356 and NGC 7496, we use an alternative feature, (F770W + F1130W)/F1000W, where F1000W is used to trace the hot dust continuum.However, we found that the (F770W + F1130W)/F1000W feature does not correlate with R PAH , most-likely since the F1000W band is dominated by PAH emission, rather than by hot dust continuum emission.In addition, the band may also be affected by 9.7 µm silicate absorption (e.g., Smith et al. 2007).
Most of the features we consider are distributed over several orders of magnitude.In such a case, any dimensionality reduction algorithm that starts by measuring pairwise distances between the features will be dominated by features with larger values (see e.g., B19).To give even weights to small and large feature values, we use a logarithmic scaling of the features.That is, the Hα to CO flux ratio feature is represented by log[f(Hα)/I(CO)] rather than f(Hα)/I(CO).The only exception is the feature E(B − V ), which is not distributed over several orders of magnitude.
We then apply clipping and normalization to the scaled features.For each feature, we clip its values to be between the 0.5th and 99.5th percentiles of the distribution.The clipping ensures that catastrophic outliers do not have a significant impact on the normalization of the feature, which is the next and final stage of the feature construction.These outliers are mostly the result of problems in observations, their reduction, or the estimation of a feature value from them.By definition, the fraction of objects with clipped feature values is very low, and thus the clipping of their values does not have an impact on the two-dimensional embedding of the rest of the objects.Indeed, we found the two-dimensional embedding by UMAP to be stable to different clipping schemes, ranging from (0.1, 99.9)% to (1, 99)%.However, even a low fraction of catastrophic outliers can have a significant impact on the estimated standard deviation of a feature, which can affect the normalization of the whole feature, and thus the UMAP embedding.It is therefore essential to perform some clipping before the normalization.
Finally, each scaled and clipped feature x is normalized as (x − µ x )/σ x , where µ x is the average feature value and σ x is the standard deviation.The normalization is done to ensure that the dimensionality reduction will not be dominated by features with a large dynamical range (e.g., see discussion in B19).
Since we exclude pixels where the CO and/or 3.3 µm PAH are not detected, our analysis is complete in HII regions and 12.4% NOTE-Summary of the selected features, their units, and their missing fraction after applying the pixel mask.The data was first scaled, then clipped, and then normalized.All features except the gas E(B − V ) are scaled logarithmically.Then all features are clipped to be between the 0.5th and 99.5th percentiles of the distribution.All the features are then normalized by subtracting the mean and diving by the standard deviation of the clipped distribution.The last three features were excluded from the data after some initial tests: * : these features dominated the two-dimensional embedding and resulted in a trivial embedding with three clusters corresponding to the three different galaxies, * * : high-fraction of missing values and little impact on the resulting embedding.
much of the dense ISM as these show brighter CO and midinfrared emission, but is incomplete in the most diffuse part of the ISM, where CO and mid-infrared emission are much fainter.Therefore, our results may not be applicable to the most diffuse parts of local galaxies.In table 2 we list the fraction of missing values in the features we consider after applying the pixel mask.Due to our pixel mask, the fraction of missing values is very low.We compared two different imputation methods to replace these missing values (see additional details in section 5.2 in the discussion): KNN search (Hruschka et al. 2003;Jonsson & Wohlin 2004) and Random Forest Regression (Stekhoven & Bühlmann 2011;Shah et al. 2014) by SKLEARN (Pedregosa et al. 2011).The two methods gave comparable results, and neither had a significant impact on the resulting two-dimensional embedding.The results shown in section 4 are based on a dataset where the missing values have been replaced using the KNNImputer8 .

Dimensionality Reduction with UMAP
In this section we use the dimensionality reduction algorithm UMAP (Uniform Manifold Approximation and Projection; McInnes et al. 2018) to embed our input dataset into a two-dimensional space.
UMAP is a non-linear dimensionality reduction algorithm that aims to represent high-dimensional data in a lowerdimensional space while preserving the underlying structure and relationship among data points.It operates on the assumption that the data lie on a manifold, which is a low-dimensional curved surface embedded within the highdimensional space.It attempts to learn an approximation of this manifold and then project the data onto the lowerdimensional space.UMAP has been shown to preserve local as well as global structures in the data, and has been widely used in a variety of fields (see e.g., Becht et al. 2018;Ali et al. 2019;Cao et al. 2019;Carter et al. 2019;Packer et al. 2019).
The two main use cases of UMAP are: (i) visualization of complex high-dimensional datasets, and (ii) dimensionality reduction prior to the application of clustering algorithms.The latter, which is also the use case in this work, is done because many clustering algorithms do not scale well with a large number of features (e.g., Xu & Tian 2015), and thus The left panels show the three galaxies NGC 628, NGC 1365, and NGC 7496, where pixels that are included in the analysis are marked with colors.In these panels, the grayscale background represents the Hα surface brightness, and the black contours represent the CO intensity.The right panel shows our adopted UMAP embedding.Every point corresponds to an object (pixel) from our input dataset, which includes 24 007 objects with 13 features each (features trace 150 pc-size regions).The two-dimensional embedding was obtained using the following hyper-parameters: metric=correlation, n neighbors=10, and min dist=0, though we show in Appendix A that the global structure in the two-dimensional space remains stable when changing the metric and the n neighbors parameter.Each pixel is colored according to the galaxy it belongs to, information that is not included in the input dataset.The embedding shows several over-dense regions that may be interpreted as clusters, connected through filamentary structures of points, suggesting that the features form continuous relations in the complex high-dimensional space.
UMAP is used as an intermediate stage to reduce the initial dimensions of the dataset.This intermediate step also improves the interpretability of the final clustering output, as the clusters and their properties can be easily visualized in the two-dimensional space given by UMAP.
UMAP has several hyper-parameters, and setting different values of the hyper-parameters can change the resulting embedding significantly.The first, and probably most important, hyper-parameter is metric, which defines the distance metric to be used when estimating distances between the objects in the high-dimensional space.Different metrics are sensitive to different aspects and scales in the feature space.As a result, while one metric may suggest proximity between two objects, another may suggest a considerable separation (see discussion in B19).Given that UMAP relies on pairwise distances between objects for the embedding process, the selection of an appropriate metric becomes crucial.
The second most important hyper-parameter of UMAP is n neighbors, which controls how the algorithm balances local versus global structure when learning the manifold of the data (see examples in McInnes et al. 2018).Setting a low value of n neighbors will result in an embedding that is more sensitive to local structure of the data, sometimes at the expense of accurately representing the global structure.On the other hand, increasing the value of n neighbors expands the neighborhoods considered when estimating the manifold, which may result in the loss of fine-grained details.
The third hyper-parameter, min dist, controls how tightly points can be packed in the low-dimensional space.Larger values of min dist will force even close neighbors to be separated in the low-dimensional embedding, while lower values will result in clumpier embeddings.Following the feature construction scheme in section 3.1, the input dataset has 24 007 objects with 13 features each.Each object represents a pixel in one of the three galaxies we consider, and the 13 features (listed in Table 2) trace different properties related to the stellar population, gas, dust, and star formation, over a 150 pc scale.
We applied UMAP to our input dataset while exploring a wide range of hyper-parameter choices.In particular, we examined the two-dimensional embedding using 11 different distance metrics, and using a wide range of n neighbors values (see Appendix A) 9 .We find that the global structure of the data in the two-dimensional space remains stable when changing the metric and/or n neighbors.We reach a similar conclusion for the min dist parameter, which primarily changes the density of points.This suggests that the clusters that would have been identified in the two-dimensional embeddings by UMAP with different hyper-parameters should be roughly the same.Therefore, the results presented in section 4 should not depend significantly on the assumed hyperparameters.
For the rest of the paper, we adopt the twodimensional UMAP embedding obtained using metric=correlation, n neighbors=10, and min dist=0.This set of hyper-parameters was adopted after a visual inspection of the resulting embeddings, where we selected an embedding where we expect cluster identification to be less challenging technically, as we describe below.However, since different sets of hyper-parameters result in embeddings that satisfy the criteria described below, the selection of this particular set of hyper-parameters is somewhat arbitrary.
To select an embedding where identifying clusters is expected to be less challenging, we favor embeddings where clusters that we identify by eye are more separated from each other, and are separated by roughly the same distances one from the other.In addition, we prefer embeddings where the density of points in different clusters does not vary significantly, and try to avoid embeddings with a large number of filamentary structures, as these tend to result in clusters that do not match our visual perception (many clustering algorithms are designed to work effectively in a flat geometry, and filamentary structures are a strong departure from this assumption).
In figure 3 we show our adopted two-dimensional UMAP embedding, where every pixel is colored according to the galaxy it belongs to.In figure 4 we show our adopted embedding color-coded by the features in the input dataset, where strong gradients can be seen in many of the features throughout the embedding.

Clustering
This section includes the final phase of our three-step process, where we apply a clustering algorithm to the twodimensional embedding by UMAP to divide the PHANGS pixels into groups.To partition the two-dimensional distribution, we examine several different clustering algorithms: K-means, Birch, OPTICS, DBSCAN, and Hierarchical Clustering (see Xu & Tian 2015 and B19 for reviews about clustering algorithms).The different algorithms have different optimization objectives and different stopping criteria, making them sensitive to different aspects of the data.For example, by construction, K-means identifies evenly-sized clusters and can only operate in a flat geometry, while OPTICS and DBSCAN may identify unevenly-sized clusters and can operate in a non-flat geometry.The output of Hierarchical Clustering depends on the assumed linkage method, which defines how clusters are linked to different clusters.We considered the four linkage methods: Ward, complete, average, and single, each being sensitive to different types of structures (see figure 11 in B19 and related text).
Each of the clustering algorithms has different hyperparameters, and changing the values of these parameters can have a significant impact on the resulting clusters (see e.g., B19).For example, in K-means, Birch, and Hierarchical Clustering, the number of clusters is a hyper-parameter of the algorithm and is set by the user.When applying clustering algorithms directly to high-dimensional datasets, it may be challenging to select the ideal number of clusters10 , since it is not straightforward to visualize the detected clusters in the high-dimensional space.In our case, since the clustering algorithms are applied to the two-dimensional embedding by UMAP, we were able to visualize the resulting clusters, colorcoded by different features (figure 4), and select a suitable number of clusters.The chosen numbers, which were selected to not be larger than ∼10, so we can inspect the clusters manually and compare between their properties, and not smaller than ∼4, so that objects with different properties will not be grouped together, are 6 for K-means and Hierarchical Clustering, and 7 for Birch.While K-means, Birch, and Hierarchical Clustering divide all the points into clusters, OPTICS and DBSCAN may divide only some points into clusters, leaving others unclustered.The hyper-parameters of OPTICS and DB-SCAN primarily control the minimum cluster membership and neighborhood size (see e.g., Xu & Tian 2015), and we set these parameters to result in roughly 6-8 clusters.For OPTICS, we used the hyper-parameters (min samples, xi, min cluster size) to be (300, 0.001, 0.04).For DBSCAN, we set the hyper-parameters (min samples, min cluster size) to be (10, 300).We used the python SKLEARN library to apply these different clustering algorithms (Pedregosa et al. 2011).
Figure 5 shows the different clustering algorithms applied to our adopted two-dimensional embedding by UMAP.One of the methods, Hierarchical Clustering with the single linkage, can be excluded immediately as it clusters the majority of the points into a single cluster, and marks five very small outlier groups as the remaining clusters.We also exclude OPTICS and DBSCAN as they leave a significant fraction of points unclustered, and we wish to include as many objects as possible in the analysis.Next, we exclude K-means, Birch, and Hierarchical clustering with the complete linkage as they cluster together groups that appear separate upon visual inspection (yellow group in K-means, yellow group in Birch, and pink and orange groups in Hierarchical clustering).This leaves Hierarchical clustering with Ward linkage versus the average linkage, both of which roughly agree on three of the clusters (top-most, right-most, and bottom-most), but disagree about the division of points in the left-most overdensity of points.Our choice to adopt the average linkage was motivated by the distributions of features seen in figure 4, where it is apparent that the average linkage divides the points in the left-most region into points with lower PAH 3.3/11.3µm ratios and points with higher ratios.The results reported in section 4 are therefore based on the clusters identified with Hierarchical clustering with the average linkage method.

Results
We apply dimensionality reduction and clustering algorithms to the PHANGS multi-wavelength dataset of NGC 0628, NGC 1365, and NGC 7496, and divide the pixels into six groups.This dataset is constructed from maps .PHANGS multi-wavelength pixels partitioned into 6 groups using dimensionality reduction and clustering algorithms.The top left panel shows the two-dimensional embedding by UMAP of the input dataset, which includes 24 007 PHANGS pixels with 13 measured features that trace the stellar population, gas, dust, and star formation properties.Each point in the two-dimensional space represents a pixel from one of the galaxies we consider.The objects are divided into groups using the Hierarchical Clustering algorithm with the average linkage, and each point is color-coded according to its assigned group.The other three panels show the spatial distribution of the six groups, which map to large-scale coherent structures within the galaxies.In these panels, the grayscale background represents the Hα surface brightness, and the black contours represent the CO emission.
extracted from ALMA, MUSE, and JWST observations, and it traces the properties of the stellar population, multiphase gas, dust, and star formation, on a scale of 150 pc in these galaxies.
The top left panel of figure 6 shows our adopted twodimensional embedding by UMAP (section 3.2), where every point represents a pixel from one of the three galaxies.The distribution of points in the two-dimensional space shows several regions with an over-density of points, which may be interpreted as separate clusters, connected to each other through filamentary structures.This highly connected filamentary structure11 , seen for different UMAP hyper-parameter choices (see Appendix A), indicates that the multi-wavelength features of the PHANGS pixels form continuous relations in the high-dimensional space they span.It suggests that the PHANGS pixels do not represent different entities with distinct physical properties (e.g., the difference between a star and a quasar), but rather the same entity with varying physical conditions (e.g., optical spectra of stars with different temperatures).This is not surprising given that the features we constructed trace different physical properties of gas and dust, averaged over a 150 pc scale.
The points in the two-dimensional space are color-coded according to their assigned group using our adopted clustering algorithm (section 3.3).Since the points do not form well-separated clusters in the two-dimensional space, the resulting groups are somewhat arbitrary, with different clustering algorithms and different hyper-parameter choices changing the resulting groups.Figure 5 and the top left panel of figure 6 demonstrate the arbitrary nature of partitioning objects that form a continuous sequence into distinct groupsthere is more than one way to divide the objects into groups, each resulting in groups that are distinct in different aspects.Despite this ambiguity, dividing objects into groups is a common practice in science and in astronomy in particular, with examples ranging from classifying the stellar sequence into distinct stellar types, (O, B, A, F, G, K, M), the classification of core collapse supernovae according to their light-curves or spectra, classification of AGN into type I and II, and more.The practice of dividing objects, even when they form a sequence, into groups is useful as it allows one to compare the properties of objects in different groups, and by that, gain insight into the physics that drive the continuous variation in their properties.The difference of this work is the use of statistical tools to dissect the high-dimensional space, rather than using predefined physics-motivated properties, such as line ratios, metallicity, mass, luminosity, temperature, etc.
In this section, we describe the properties of the adopted groups, showing that they each have distinct gas ionization and PAH properties (section 4.1).We then present newly identified galaxy-wide correlations between PAH band and optical line ratios and use the adopted groups to interpret them (section 4.2).

Distinct gas ionization and PAH properties in the different clusters
In this section we interpret the six groups using different observables.We use the spatial distributions of pixels in different groups, as well as the feature values and their relation to other features.In particular, we use optical line diagnostic diagrams, PAH band ratios, and the relations between the Hα, CO, and 10 µm emission.
Figure 6 shows the spatial distribution of the adopted groups.Although the input dataset did not include information regarding the galaxy a region belongs to, nor information regarding the relative location of a region within the galaxy, the groups map onto large-scale coherent structures within the galaxies.
In Figure 7 we show the distribution of the groups in standard optical line diagnostic diagrams (BPT diagrams; Baldwin et al. 1981;Veilleux & Osterbrock 1987;Kewley et al. 2001).These diagrams are used to constrain the main source of ionizing radiation, by classifying a set of emission lines into one of the three classes: (i) HII regions, where the ratios are consistent with ionization by O and early B-type stars, (ii) Seyfert, where the ratios are consistent with ionization by AGN, and (iii) LINER/LIER, where the ratios may be consistent with either AGN ionization, photoionization by hot and evolved stars, or shock-excited gas (e.g., Kewley et al. 2001;Kauffmann et al. 2003;Kewley et al. 2006;Allen et al. 2008;Cid Fernandes et al. 2010;Rich et al. 2011Rich et al. , 2015;;Belfiore et al. 2022).
In figure 8 we show the distribution of the groups in the PAH 11.3/7.7 µm versus 3.3/11.3µm plane (hereafter PAH 11.3/7.7 and 3.3/11.3ratios).These bands ratios are sensitive to the ionized fraction of PAHs, the PAH size distribution, and the shape of the incident FUV-optical radiation (e.g., Draine et al. 2021;Rigopoulou et al. 2021 and section C.1).
Figures 7 and 8 show that the different groups have distinct ionized gas and PAH properties, as probed by the different optical line and PAH band ratios.Below, we describe these general properties12 : (1) AGN-photoionized gas (turquoise group): These pixels can primarily be found in NGC 1365 and NGC 7496 13 , both of which host AGN in their center (e.g., Morganti et al. 1999).The optical line ratios place most of the pixels in the Seyfert/LINER region in the BPT diagram, with log([O III]/Hβ) ratios too high to be powered solely by hot and evolved stars (e.g., Cid Fernandes et al. 2010;Byler et al. 2019;Belfiore et al. 2022).The pixels are located primarily along the AGN ionization cones, suggesting that the gas is photoionized by the AGN.Some of the pixels in this group are in the bars of NGC 1365, and these pixels show properties intermediate between the AGN-photoionized group and the CMZ group below.Interestingly, other clustering algorithms (e.g., BIRCH and OPTICS in figure 5) divide this group into two -one that corresponds to pixels in the bars and the other to pixels within the AGN ionization cones.From here forward, when referring to this group, we will focus on the pixels that are within the AGN ionization cones, located on kpc scales, in the bulge that is dominated by an older stellar population.The pixels in this group show the highest PAH 11.3/7.7 band ratio, and the lowest PAH 3.3/11.3band ratio.Importantly, while the AGN seems to be dominating the ionizing radiation, resulting in the Seyfert-like line ratios, the old stellar population in the bulge probably dominates the FUV-optical radiation, and is thus responsible for the PAH heating.(2+3) Diffuse ionized gas (yellow and green groups): These pixels are seen in all three galaxies, and they primarily probe regions with faint Hα and CO emission (e.g., figures B4 and B5 in the appendix).The optical line ratios are consistent with LINER/LIER-like emission.In addition, these pixels spatially-coincide with regions identified by Belfiore et al. (2022) as regions dominated by diffuse ionized gas, where the gas is ionized by a combination of radiation leaking from HII regions along with emission from hot and evolved stars, called the HOLMES mixing sequence.Both of the groups show quite high 11.3/7.7 PAH band ratio.The two groups differ in their 3.3/11.3PAH band ratio, with the yellow group showing significantly larger values than those of the green group.(4) Central molecular zone (CMZ; slate blue group): Most of the pixels in this group belong to the central part of NGC 1365, a region known for hosting extreme star formation that is powered by a massive molecular gas reservoir (see Schinnerer et al. 2023; and an overview by Hen-shaw et al. 2023).Interestingly, some of the central pixels of NGC 0628 and NGC 7496 are also assigned to this group.The optical line ratios place the pixels in the HII region of the BPT diagram, suggesting ionization by young massive stars.This group differs from the two SF+ISM groups below primarily due to its very bright CO emission (figure B5 in the appendix), in particular with respect to the observed Hα, and significant dust extinction.It shows quite a low 11.3/7.7 PAH band ratio, consistent with those observed in the SF+ISM groups below, and an intermediate 3.3/11.3ratio, in between those of the SF+ISM groups.Some of the pixels of this group deviate from the strong correlation between 3.3/7.7 and 3.3/11.3(see figure B6 in the appendix), which may suggest the presence of Silicate 9.7 µm absorption that reduces the observed flux in F1130W.(5+6) Star-forming regions and ISM (orange and pink groups): These pixels are primarily seen in the spiral arms of the three galaxies.Their Hα, CO, and 10 µm emission show strong correlations, suggesting standard star formation that is powering the observed Hα and 10 µm emission.The optical line ratios are consistent with ionization by young massive stars.Similarly to the CMZ group, they show comparable and low 11.3/7.7 PAH band ratios.The first difference between the two groups is in their 3.3/11.3PAH band ratio, which is significantly larger in the orange group than in the pink group.In addition, the orange group has a higher Hα/CO ratio than the pink group, which may suggest that it corresponds to more dispersed and evolved clouds compared to those of the pink group.
The three galaxies we consider dominate different regions of the two-dimensional embedding by UMAP (see figure 3).This can be explained in the context of the groups we identify -(i) NGC 1365 is unique in having a large number of pixels that trace the CMZ, the bar, and AGN-photoionized gas on kpc scales.It therefore dominates two out of the six groups.(ii) The diffuse ionized gas in NGC 7496 differs from that of NGC 0628 and NGC 1365 in PAH band ratios, thus dominating one of the groups.(iii) The star forming and ISM regions of NGC 0628 differ from from those NGC 1365 and NGC 7496 in PAH band ratios.It therefore dominates another group.At this point, it is unclear whether adding the rest of the 19 PHANGS galaxies would fill-in the space, making pixels that originate from a given galaxy indistinguishable from pixels of other galaxies.For that, it may be necessary to correct some of the features for inclination.We plan to address this question in a future publication where we plan to include all 19 PHANGS galaxies.

Close connection between the heating of PAHs and the ionization of the warm ionized gas
We identify significant and tight correlations between different PAH band and optical line ratios (see the full correlation matrix in Figure B3 in the appendix).These correlations are seen across the entire dataset, extending from the star-forming regions and the ISM, through the diffuse ionized gas, to the AGN-photoionized gas.The correlations are also detected in individual groups in which the dynamical range is large enough.Ionizing radiation is expected to destroy PAHs, and observations suggest much weaker PAH emission in regions dominated by ionized gas, such as HII regions (e.g., Chastenet et al. 2019Chastenet et al. , 2023a;;Chown et al. 2023;Egorov et al. 2023;Lee et al. 2023;Peeters et al. 2023;and reviews Tielens 2008;Li 2020), though at our spatial resolution, the PHANGS pixels include contributions from both.These correlations suggest a strong connection between the heating of PAHs and the ionization of the warm ionized gas on 150 pc scales.
In Figure 9  Since the PAH band ratios are based on the broad band filter ratios F1130W/F770W and F335M PAH /F1130W, we first ensure that these correlations are not due to varying contributions of dust continuum emission to the F1130W filter flux.Since the F1000W filter likely has a large contribution from PAHs under most conditions in the PHANGS galaxies (e.g., Leroy et al. 2023), we use the F2100W filter flux to trace the dust continuum emission.If the observed correlations are due to a varying contribution of hot dust emission, we expect to find significant correlations between F2100W/F770W and F335M PAH /F2100W and the optical line ratios (F2100W replaces F1130W).We find no such correlations.Therefore, in what follows, we assume that the change in F1130W/F770W and F335M PAH /F1130W band ratios trace changes in PAH emission.
Various PAH band ratios, including those we consider here, have been used to place constraints on the PAH size and charge distribution in different environments (see reviews by Tielens 2008; Li 2020).Theoretical calculations and laboratory measurements suggest that neutral PAHs have significantly different infrared spectra than ionized PAHs, with the former showing strong 3.3 µm and 11.3 µm bands compared to the 7.7 µm band, and the latter showing stronger 7.7 µm band compared to the 3.3 µm and 11.3 µm bands (e.g., Allamandola et al. 1999;Tielens 2008).Therefore, the 11.3/7.7 PAH band ratio has been used extensively as a PAH ionization diagnostic, and to a lesser extent, as a PAH size diagnostic (see below; e.g., Kaneda et al. 2005;Smith et al. 2007;Galliano et al. 2008;Diamond-Stanic & Rieke 2010;Vega et al. 2010;Lai et al. 2022;Chastenet et al. 2023b;Dale et al. 2023).Since smaller PAHs have smaller heat capacities than larger PAHs, stochastic heating by single photon absorption raises their peak temperature to higher values (e.g., Draine & Li 2001;Draine 2011), leading to higher luminosity in shorter wavelength bands such as 3.3 µm compared to longer wavelength bands such as 11.3 µm.Therefore, PAH band ratios such as 3.3/11.3,and to a lesser extent, the 11.3/7.7,have been used as PAH size diagnostics (e.g., Smith et al. 2007;Galliano et al. 2008;Lai et al. 2023;Chastenet et al. 2023b;Dale et al. 2023;Ujjwal et al. 2024).Draine et al. (2021) and Rigopoulou et al. (2021) showed that various PAH band ratios are also sensitive to the shape of the incident radiation field.The same population of PAHs may show significantly different 11.3/7.7 and 3.3/11.3band ratios when heated by radiation that is dominated by young versus older stars.Harder far-ultraviolet (FUV)-optical radi-

Ionized gas PAHs
Figure 11.PAHs and ionized gas are sensitive to different parts of the illuminating radiation field.The lines represents two example SEDs calculated using the flexible stellar population synthesis code by Conroy et al. (2009).They correspond to single stellar populations of age 2 Myr and 10 Gyr (see Appendix C.2 for additional details), where the flux density of the 10 Gyr star has been scaled to higher values for representational purposes.The black vertical lines represent the wavelengths that correspond to photon energies that can ionize Hydrogen (H + ; dashed) and Oxygen twice (O ++ ; dotted).The gray bands correspond to the wavelength ranges 1350-1780 Å and 3000-4000 Å, which are our adopted definitions for FUV and optical luminosities.The conditions in the warm ionized gas primarily depend on the flux densities at λ ≤ 912 Å.The PAHs are exposed to non-ionizing radiation, and their heating depends on the shape of the FUV-optical radiation field, which we parametrize using νLν (FUV)/νLν (optical) (see Appendix C.1 for details).
ation field, typical of young and massive stars, is expected to lead to hotter PAHs, resulting in higher luminosity in the 3.3 µm feature, and to a lesser extent, the 7.7 µm feature, compared to the 11.3 µm.Since the PHANGS 150 pc-sized pixels trace a variety of radiation fields, this effect must be taken into account when interpreting PAH band ratios.Dale et al. (2023) demonstrated that, in the PAH band ratio plane available for the PHANGS galaxies (3.3/11.3µm versus 3.3/7.7 µm; see their figure 3), the impact of varying PAH size distribution and a varying radiation field are degenerate with each other.In this work, we use the observed optical line ratios to break this degeneracy.

A varying radiation field as the main driver of the correlations
To interpret the observed correlations, we use our adopted groups, as well as the PAH emission models by Draine et al. (2021) and a range of spectral energy distributions (SEDs) of the incident radiation field (see details in Appendix C).Draine et al. (2021) calculated the infrared emission spectra of PAHs for different illuminating radiation SEDs, radiation intensities, and PAH size and charge distributions.To remove the underlying continuum emission from starlight or other hot, small grains, they employed a clipping scheme that focuses on strong PAH emission features.Using this clipping scheme, they studied the sensitivity of PAH band ratios to the SED of the illuminating radiation and the PAH size and charge distribution.
We use the infrared spectra predicted by Draine et al. (2021).In particular, we consider models calculated using the small, standard, and large PAH size distributions, and with low, standard, and high ionization distributions.For the illuminating radiation SED, we consider 12 different templates, covering a range of stellar population ages and FUV luminosities.The PAHs are believed to be located in regions that are shielded from ionizing radiation (e.g., Chastenet et al. 2019Chastenet et al. , 2023a;;Egorov et al. 2023;Lee et al. 2023;and reviews Tielens 2008;Li 2020), and their heating is dominated by non-ionizing FUV-optical radiation (see figure 11).To parameterize the dependence of PAH band ratios on the incident radiation field, we define the FUV-to-optical luminosity ratio to be νL ν (1350 − 1780 Å)/νL ν (3000 − 4000 Å), as illustrated in figure 11.
We use the clipping scheme by Draine et al. (2021) to estimate the PAH band ratios 11.3/7.7 and 3.3/11.3,and parameterize in Appendix C.1 how they vary for different FUV-tooptical ratios, PAH size distributions, and PAH charge distributions.Since the JWST photometry includes contributions from non-PAH emission sources, and since the photometric bands do not coincide with the clipping points defined by Draine et al. (2021), we do not expect the absolute PAH band ratios to match those predicted by the models.Instead, we use the clipping scheme only to compare the observed trends in ratios with those predicted by the models.Dale et al. (2023)  the model.The observed PAH band ratios in the regions they consider (for the same three PHANGS galaxies) is within the ranges spanned by the models.In particular, they find that the PAH band ratios are more aligned with models of larger and ionized PAHs in compact stellar clusters and stellar associations.
In figure 12 we show the PAH band ratios 11.3/7.7 and 3.3/11.3versus [O III]/Hβ ratio, using our adopted groups.We indicate on the figure the expected changes in the PAH band ratios when changing the PAH size or charge distribution.Since these changes affect only the PAH bands and have no effect on the optical line ratios, they form arrows in the vertical direction only.The relative sizes of the arrows represent the expected change in PAH band ratios when changing to size/ionization from low to standard, or from standard to large/high.We also show the expected mutual variation in the PAH band and optical line ratios under the assumption that the PAHs and ionized gas are exposed to different parts of the same spatially varying radiation, as described later in this section."Non-varying radiation field" interpretation: Under the assumption that the radiation field spectrum heating the PAHs is not varying, figure 12 suggests that PAHs are more ionized in regions with low optical line ratios.In particular, using our groups, PAHs are more ionized in star-forming regions and ISM, and are more neutral in the diffuse ionized gas and in the AGN-photoionized gas.The right panel of the figure shows a negative correlation between 3.3/11.3and the optical line ratio, suggesting PAHs are smaller in regions with low optical line ratios.This would suggest smaller PAHs in the star-forming and ISM regions, and larger PAHs in the diffuse ionized gas.These are in line with the conclusions of Ujjwal et al. (2024) for the three PHANGS galaxies we consider.However, the 3.3/11.3shows a strong positive correlation with the Hα/CO ratio (see left panel of figure 14.According to this interpretation, it would suggest that PAHs are smaller in regions with larger fraction of ionizedto-molecular gas.This is the opposite of what would be expected if smaller PAHs are destroyed by ionizing radiation more efficiently than larger PAHs."Common varying radiation field" interpretation: The left panel of figure 12 shows that the tight correlation between the PAH band ratio 11.3/7.7 and the optical line ratios is in fact a sequence in ionized gas conditions, with the different groups occupying distinct ranges within the correlation.It may suggest that the PAHs and the ionized gas are exposed to different parts of the same radiation field.That is, while the ionized gas is exposed to the entire radiation field, and its properties are set by the ionizing part of the radiation, the PAHs are located in regions shielded from the ionizing radiation, and are exposed only to the non-ionizing FUV-optical part of the radiation (see figure 11).Under this interpretation, the radiation field varies over kpc scales and its variation drives the changes in both PAH band and optical line ratios.The PHANGS pixels show significant variations in their stellar and/or AGN properties, with some regions dominated by young stellar populations, while others dominated by old stars or AGN radiation.Therefore, we argue that this scenario is unavoidable.The only question is its impact on the PAH band ratios compared to those of varying PAH size and charge distributions.
The gray arrows in figure 12 represent the expected relations between the PAH band and optical line ratios under the "common varying radiation field" interpretation.The expected relation depends on several factors: (i) the dependence of PAH bands on the FUV-to-optical luminosity ratio, (ii) the SED shape, in particular, the connection between the FUVoptical and ionizing parts of the radiation field, and (iii) the dependence of optical line ratios on the shape of the ionizing radiation.
We explore the dependence of PAH bands on the FUV-tooptical luminosity ratio using the Draine et al. ( 2021) models in Appendix C.1.In particular, we find that varying the radiation field from old (1-10 Gyr) to young (2 Myr) stellar populations increases the FUV-to-optical luminosity ratio by 2.5 dex, and at the same time, changes the PAH band ratios 11.3/7.7 and 3.3/11.3by -0.1 dex and 0.55 dex respectively.Interestingly, these are quite similar to the ranges spanned by 11.3/7.7 and 3.3/11.3 in the PHANGS pixels.
We explore the connection between the FUV-optical and ionizing parts of the radiation field in Appendix C.2.We use several different stellar libraries, covering single stellar populations with ages of 2 Myr to 10 Gyr, while varying the metallicity, stellar isochrones, and stellar templates.Since figure 12 suggests a sequence that covers HII regions, the diffuse ionized gas, and AGN-photoionized gas, we also consider models that are constructed to reproduce the optical line ratios in the diffuse ionized gas and AGN-photoionized gas.In particular, we consider the HOLMES mixing sequence introduced by Belfiore et al. (2022) to explain the observed optical line ratios in the diffuse ionized gas in PHANGS galaxies.In this model, the SED is a combination of radiation from a young stellar population leaking from HII regions and from hot and evolved stars, where the latter is used as it contributes the hard ionizing radiation required to power LINER-like emission line ratios.We also construct AGN+SF mixing sequences.We use a standard Shakura & Sunyaev (1973) accretion disk SED with several improvements (gen-eral relativistic corrections and radiative transfer in the disk atmosphere; e.g., Slone & Netzer 2012), which results in Seyfert-like line ratios in standard line diagnostic diagrams (see Appendix A in Baron & Netzer 2019).Similarly to the HOLMES mixing sequence, the AGN SED is mixed with a young (2 Myr) stellar template by varying the relative contribution of each.
Using these different SEDs, in Appendix C.2 we confirm that in all of these cases, i.e., stellar age sequence from young to old; HOLMES mixing sequence with varying contribution of old versus young stars; and AGN+SF mixing sequence with varying contributions from the stars versus the AGN, the general observed trend is a harder ionizing slope for a softer FUV-optical slope, matching the direction required to explain the relation in the left panel of figure 12.Moreover, we find that a decrease of 2.5 dex in νL ν (FUV)/νL ν (optical) results in an increase of ∼1 dex in νL ν (O ++ )/νL ν (H + ), depending on the mixing sequence used.Therefore, the expected slope as indicated by the gray arrow is roughly consistent with the observed slope in the left panel of figure 12.
The exact slope expected under the "common varying radiation field" interpretation also depends on the relation between the ionizing radiation and optical line emissivities.In particular, the optical line ratios we consider are measured with collisionally excited transitions whose line emissivity depends on the gas temperature (e.g., Osterbrock & Ferland 2006).Harder ionizing radiation leads to higher electron temperature (e.g., Garnett 1992;Byler et al. 2017).Both of these are expected to increase the expected optical line ratios, where the line emissivity depends linearly on ionic abundances (O ++ , S + , etc) and exponentially on the electron temperature (e.g., Osterbrock & Ferland 2006).Indeed, we find equally-strong correlations between 11.3/7.We use the PAH models by Draine et al. (2021) and a wide range of assumptions about the variation of the incident radiation field to show that the expected slope of the relation is roughly consistent with that observed.
The main implications of this scenario are: (I) The very small scatter (∼ 0.03 dex) in the relation between 11.3/7.7 and [O III]/Hβ (or any other optical line ratio) implies that the ionized PAH fraction is quite uniform on scales of 150 pc across different environments in local galaxies.Since the fraction of ionized PAHs is set by the balance between ionization (by photons or collisions) and recombination, this uniformity suggests a strong self-regulation of the ISM that limits variations in gas temperature and electron density.In particular, the property U √ T /n e , where U is dimensionless intensity parameter, could, in principle, vary by a factor of a few (see Tielens 2005;Galliano et al. 2008;Draine et al. 2021).According to our interpretation, this property can vary by no more than ∼100% across different environments (SF+ISM and the diffuse ionized gas).In a future study, we plan to use all 19 PHANGS galaxies to place constrains on U √ T /n e in different environments.(II) The 11.3/7.7 PAH band ratio may potentially be used to trace the shape of the FUV-optical parts of the radiation field across nearby galaxies.The combination of the 11.3/7.7 and optical line ratios may therefore be used to constrain simultaneously the ionizing and non-ionizing parts of the radiation field.(III) The varying radiation field is expected to impact PAH band ratios that are typically used as PAH size indicators (like the 3.3/11.3band ratio).To constrain the PAH size distribution, the variation of the radiation SED must be accounted for.
Finally, it is worth noting that previous Spitzer-based studies found surprisingly high 11.3/7.7 PAH band ratios in AGN-dominated systems and in elliptical galaxies (e.g., Kaneda et al. 2005;Smith et al. 2007;Kaneda et al. 2008;Galliano et al. 2008;Diamond-Stanic & Rieke 2010;Vega et al. 2010, which in some cases, were found outside the range of ratios predicted by models.There are two main differences between the variation of the 11.3/7.7 PAH band ratio found here and those reported by the studies mentioned above.First, while the Spitzer-based studies found a 11.3/7.7 PAH band ratio that may be larger by ∼one order of magnitude compared to the ratio observed in normal environments, the variation we observe is much smaller, of the order of 0.1 dex.While a change of 1 dex cannot be explained solely by a change of the hardness of radiation field, a change of 0.1 dex can.Secondly, our work focuses on 150 pc-sized regions, where the wealth of multi-wavelength observations allows us to constrain separately the dominant source of ionizing radiation and the dominant source of FUV-optical photons.In our case, although we classify the first group as "AGNphotoionized gas", these pixels are very different from the AGN-dominated systems presented and discussed by, e.g., Smith et al. (2007); Diamond-Stanic & Rieke (2010); Lai et al. (2022), since in our case, old stars probably dominate the PAH heating.

Constraining the PAH size distribution
The right panels of figures 12 and 13 show the PAH band ratio 3.3/11.3versus the optical line ratios [O III]/Hβ and [S II]/Hα.Although the 3.3/11.3had been used as a PAH size indicator (e.g., Lai et al. 2023;Ujjwal et al. 2024) also sensitive to the shape of the incident radiation.In particular, varying the stellar age from 2 Myr to 1 Gyr can change the ratio by 0.55 dex, similar to the entire range spanned by 3.3/11.3 in figures 12 and 13.This has been pointed out by Chastenet et al. (2023b) when interpreting the PAH band ratios observed in the first three PHANGS-JWST galaxies, where, similarly to Dale et al. (2023), they argued that the hardness of the radiation field and PAH size may both affect the observed 3.3/11.3ratio.In this section, we use the observed correlations between the PAH band and optical line ratios to break this degeneracy and disentangle the impact of the radiation field from that of the PAH size distribution on the observed 3.3/11.3PAH band ratio.Under the "common varying radiation field" interpretation, both 11.3/7.7 and 3.3/11.3are expected to change with the optical line ratios, with the two slopes (11.3/7.7 vs. optical line ratio) and (3.3/11.3 vs. optical line ratio) connected to each other through the shape of the radiation field.In section 4.2.1 above we suggest that the relation between the 11.3/7.7 and the optical line ratios can be explained entirely using a varying radiation field.We can use the Draine et al. (2021) models to propagate this relation to the expected relation between 3.3/11.3and the optical line ratios.
In the left panels of figures 12 and 13, we show two polynomial fits to the observed relation between 11.3/7.7 and the optical lines.Since the 11.3/7.7 is a function of the FUV-to-optical luminosity ratio of the radiation field, these bestfitting polynomials connect the FUV-to-optical luminosity ratio with the observed [O III]/Hβ or [S II]/Hα line ratios.Using the relation between 3.3/11.3and the FUV-to-optical luminosity ratio (Appendix C.1), we can therefore calculate the expected relation between 3.3/11.3and [O III]/Hβ or [S II]/Hα, under the "common varying radiation field" interpretation.We show these lines in the right panel of figures 12 and 13.
The right panels of figures 12 and 13 suggest that a significant part of the observed 3.3/11.3variation is in fact due to a varying radiation field, rather than a varying PAH size distribution.Significant deviations from the expected slopes may be attributed to changes in the PAH size distribution.For example, among the two SF+ISM groups, the orange group is consistent with hosting smaller PAHs.Among the two diffuse ionized gas groups, the yellow group corresponds to regions with smaller PAHs.Finally, the figures suggest smaller PAHs in the AGN-photoionized gas group.
To further illustrate the impact of the different PAH heating interpretations on the derived PAH sizes, in figure 14 we show the 3.3/11.3PAH band ratio versus the Hα/CO ratio.The left panel shows the measured 3.3/11.3ratio, and assuming that the radiation field does not vary, the ratio can be used directly to constrain PAH sizes.According to this "Non-varying radiation field" interpretation, the 3.3/11.3in- creases with the Hα/CO ratio, suggesting that regions with a larger fraction of ionized gas tend to host smaller PAHs.The positive trend in the left panel is surprising, given that smaller PAHs are believed to be more easily destroyed by ionizing radiation compared to larger PAHs (reviews by Tielens 2008; Li 2020).The right panel of figure 14 shows ∆3.3/11.3,which is the expected 3.3/11.3after accounting for the variation due to the varying illuminating SED.We calculate ∆3.3/11.3 by subtracting the black dotted line in the right panel of figure 13 from the measured 3.3/11.3band ratio 14 .According to this "common varying radiation field" interpretation, PAHs are generally larger with increas- 14 We reach the same conclusion if we use the relation with the [O III]/Hβ (figure 12) instead.
ing Hα/CO, consistent with the idea that larger PAHs survive in regions with more ionizing radiation.The right panel of figure 14 shows a significant scatter that is dominated by the 3.3/11.3ratios of the yellow and orange groups, both show significantly larger 3.3/11.3ratios than those of their equivalent groups (the green and pink; diffuse ionized gas and SF+ISM respectively).Even if the "common varying radiation field" interpretation describes more accurately the observed PAH band ratios, the scatter in the diagram suggests that a simple picture of 'larger PAHs in regions more dominated by ionizing radiation' is too simple to describe the observed variations in the 3.3/11.3band in the PHANGS galaxies, and that additional factors may be at play.We plan to revisit this question in a future work, where we plan to apply this analysis to the full PHANGS+JWST sample of 19 galaxies.

PAH bands vs. optical line ratios in individual groups
We find significant correlations between the PAH band and optical line ratios in a few groups where the dynamical range is large enough (see table B3 in the appendix for the full correlation matrix).In figure 15 we show these relations with the [S II]/Hα ratio for the three groups: AGN-photoionized gas, and the two SF+ISM groups.The top row shows the 11.3/7.7 PAH band ratio.The observed slopes of the 11.3/7.7 versus [S II]/Hα relations observed for the different groups are comparable to each other (0.16, 0.13, and 0.15 for AGN, SF+ISM 1, and SF+ISM 2, respectively) and to the best-fitting slope found for all PHANGS pixels in figure 13 (0.22).
We find that the observed trends between the 11.3/7.7 and [S II]/Hα ratio are all in line with the "common varying radiation field" interpretation, with different types of radiation dominating in each of the groups.For the AGN-ionized gas group, the radiation may be a sequence of varying contribution of AGN and stellar light.Importantly, while the AGN may dominate the ionizing radiation in this group, resulting in the Seyfert-like line ratios we observe in figure 7, it probably does not dominate the FUV-optical part of the radiation field.This is because these regions are at distances of a few kpc from the center, where the old stellar population probably dominates the PAH heating.
For the SF+ISM groups, the variation can be driven by a sequence in stellar ages (or equivalently, a mixing of younger and slightly-older stellar populations).Since these clusters are consistent with HII ionization in the BPT diagrams, the relevant stellar ages are 1-10 Myr.For gas metallicity of log(Z/Z ⊙ ) ∼ −0.3, which is roughly the metallicity in the three PHANGS galaxies we consider (e.g., Williams et al. 2022), a variation of the stellar age from 1 Myr to 10 Myr results in a variation of gas electron temperature from 12,000 to 7,000 K (Byler et al. 2017), which can explain the varying [S II] emission.In particular, the radiation of a 1 Myrold star has harder ionizing radiation and softer FUV-optical slope.The harder ionizing radiation results in higher electron temperature and thus brighter [S II] emission (Byler et al. 2017).At the same time, the softer FUV-optical slope results in larger 11.3/7.7 PAH band ratio (Draine et al. 2021), which explains the positive relation between 11.3/7.7 and [S II]/Hα.
In the above discussion, we assume that the optical line ratios in the two SF+ISM groups vary due to a varying sequence in stellar ages.We disfavor the interpretation that the optical line ratios are driven by a varying ionization parameter, a property that is typically invoked to explain the optical line ratios in HII regions (see review by Kewley et al. 2019).For a varying ionization parameter, we expect a negative correlation between [O III]/Hβ and [S II]/Hα, while for a varying stellar age we expect a positive one (e.g., Blanc et al. 2015;Byler et al. 2017;Kewley et al. 2019).Since we observe that [O III]/Hβ increases with increasing [S II]/Hα, we favor the varying stellar age sequence.
The bottom panel of figure 15 shows the 3.3/11.3PAH band ratio versus [S II]/Hα.Since the 3.3/11.3band ratio is also sensitive to variations in the incident radiation, we use the best-fitting relation between 11.3/7.7 and [S II]/Hα and propagate it to the expected relation between 3.3/11.3and [S II]/Hα.In all the three groups, figure 15 shows that the observed relation between log(3.3/11.3)and log([S II]/Hα) is shallower than that expected.This may suggest small variations in the PAH size distribution within each of the clusters.We intend to further investigate it in future works.

Omitted observations: rationale, implocations, and future considerations
For this pilot study, we choose to concentrate on a limited set of properties derived from the PHANGS observations, leaving out many potential properties of interest that could be included in future works.The choice to limit the number of considered properties simplifies the problem significantly but also limits the discovery space to that spanned by the properties we focus on.In this section we briefly discuss some of the omitted properties and the reasons for excluding them, and describe additional properties that may be included in future analyses of this type.
The current analysis is based on MUSE, JWST, and ALMA observations, and it does not use HST photometry.In particular, the PHANGS-HST galaxies have been observed with the filters F275W, F336W, and F438W, covering UV and blue optical wavelengths not observed by MUSE (Lee et al. 2022).Including these wavelengths may add information related to the young stellar populations, burstiness of star formation, star formation histories, and reddening towards the stars.
For the ALMA data, we use the integrated CO intensity obtained through the "broad" mask and leave out other properties, such as the width of the line.The decision to use only the integrated CO intensity, and not the line width, was motivated by our attempt to include as many pixels as possible in the analysis.Since the CO line width is available only in the "strict" mask products, it would have required us to exclude ∼90% of the pixels.Even with the "broad" mask, the CO detection requirement (along with the 3.3 µm PAH detection) excludes ∼30% of the pixels, restricting our analysis to regions hosting molecular gas masses equivalent to those of giant molecular clouds within 1 ′′ .In a future work, we plan to explore the use of the "flat" CO intensity maps presented by Leroy et al. (2023) and used by Belfiore et al. (2023), where CO flux measurement is available in every pixel.
For the MUSE data, we use derived properties that are based on high-SNR observables such as the integrated stel-lar continuum and strong emission lines.Additional properties that can be derived from the MUSE observations include the gas metallicity, ionization parameter, electron density, and temperature in the warm ionized phase; and the neutral atomic column density, traced by NaID absorption.Some of these properties require the detection of weaker lines, making maps derived with them incomplete 15 .In addition, the derivation of some of these properties is often based on physical models that may not be general enough to describe the variety of gas conditions in the PHANGS galaxies.For example, the metallicity is estimated using calibrations that are valid in HII regions, but not in the diffuse or AGNphotoionized gas (e.g., Pilyugin & Grebel 2016;Kreckel et al. 2020;Williams et al. 2022).
For the JWST data, our selected features trace primarily PAH emission, and we do not include features that trace the continuum emission from hot and larger dust grains (21 µm).Several PHANGS-JWST studies show that the F2100W filter behaves differently from the JWST filters we consider, in that it traces both gas column density and heating, showing connection to the Hα and CO emission (e.g., Belfiore et al. 2023;Hassani et al. 2023;Leroy et al. 2023;Pathak et al. 2024).In addition, by excluding the F2100W filter, we also exclude R PAH , a feature that traces the PAH-to-total dust mass, and shows galaxy-wide variations that are related to the ionization parameter and metallicity (e.g., Chastenet et al. 2023a;Egorov et al. 2023;Sutter et al. in prep.).In the first three PHANGS-JWST galaxies studied here, the F2100W filter is saturated in the centers of two (NGC 1365 and NGC 7496), showing significant diffraction spikes on kpc scales.Including this feature in the analysis would have required us to mask-out these pixels, leaving out the two galaxy centers that host AGN.We plan to revisit this choice in the future as more galaxies are included in the analysis.
Finally, it is worth noting that unsupervised machine learning algorithms can, in principle, be applied directly to the raw data (illustrated in figure 1).In the PHANGS survey case, these include the MUSE spectra, the photometric bands by HST and JWST, and the clean CO spectra reconstructed from the ALMA interferometric observations.Working with the raw data allows a more general analysis that is not limited by our prior physical knowledge, and may have a bigger potential for unexpected discoveries.It also provides a clear way to account for non-detections.However, working with the raw data also presents some challenges, including (i) treatment of highly correlated features, (ii) different features carrying different amounts of information, requiring some physicallymotivated weighting scheme, (iii) presence of catastrophic outliers due to problems in observations and reduction, and most importantly, (iv) the challenge of interpreting the output of the unsupervised learning algorithms when applied to highly-complex datasets (see discussion in B19).These challenges are also present when applying machine learning algorithms to derived features, but the usage of raw data can make them considerably more difficult to address.

Treatment of missing measurements
In astronomy, as well as in other fields, it is typical to find missing values in datasets.The values could be missing due to different reasons.Values may be missing if observations were not conducted.This can occur, for example, when certain astronomical objects are included in one survey but not in another.Values may be derived from fitting observations with theoretical models, where missing values might occur if the fitting failed, or the observations are outside of the parameter space allowed by the theoretical models.Perhaps the most common type of missing values in astronomical datasets are non-detections, where observations were performed but the astronomical object was too faint to be detected.These are also referred to as upper limits, and their statistical properties are significantly different from those of features missing because observations were not conducted.It has been known for several decades that excluding objects with upper limits from a dataset may lead to significant biases in estimating summary statistics such as the mean, median, or standard deviation of a distribution, as well as biases in the output of regression and/or correlation analyses (e.g., Feigelson & Nelson 1985;Isobe et al. 1986).
In the machine learning and data science literature, excluding objects with missing values has also been shown to lead to significant biases in estimated parameters and to an increasing prediction error.Interestingly, studies suggest that using a noisier but more complete dataset leads to smaller prediction errors compared to using cleaner but less complete datasets (e.g., Niederhut 2018).This has been shown specifically in the context of supervised machine learning, where a model is trained to predict a certain variable, and the prediction can be compared to some ground truth value (regression or classification task).In unsupervised machine learning tasks, such as dimensionality reduction or clustering, the effect of excluding objects with missing values is more ambiguous, as there is typically no definitive reference point for comparison.Nevertheless, we choose to construct a noisier and more complete dataset by using the CO intensity derived using the "broad" mask.
Several approaches have been suggested to handle missing values in a dataset (e.g., Little & Rubin 2002;Schafer & Graham 2002;Newman 2014;Niederhut 2018).The simplest one is to replace the missing values by the mean value within a given feature.However, this simple approach of imput-ing the mean has been shown to lead to biased results and to underestimate the prediction variance (Niederhut 2018).Additional approaches include more sophisticated imputations based on K nearest neighbor (KNN) search, Random Forest Regressor, and multiple imputation (e.g., Hruschka et al. 2003;Jonsson & Wohlin 2004;van Buuren & Groothuis-Oudshoorn 2011;Stekhoven & Bühlmann 2011;Shah et al. 2014;Niederhut 2018), or approaches that are not based on imputation but rather on estimating distances between features with missing values, and then applying the algorithm to the distance matrix directly (e.g., Eirola et al. 2013 and references therein).All of these approaches assume that the missing features are "missing at random", which means that the likelihood of a particular measurement being missing is independent of the value it could have had.This in turn assumes that any object with a missing value is well-represented in the dataset, and by fitting a model to the features of other objects, the missing feature value can be predicted.Once the missing values are replaced by predicted values, standard machine learning algorithms can be applied to the data without the need to exclude objects with missing features.
In section 3.1 we examine two different methods for missing value imputation: KNN search and Random Forest Regression, both assume that the features are "missing at random".The missing features are optical line ratios that are based on strong emission lines: [N II], [O I], and [O III].Our manual inspection of the pixels shows that when one of the ratios is missing, the others are measured and can be used to predict the value of the missing ratio.In this particular case, the assumption of "missing at random" approximately holds, since we are working with line ratios that have been well-measured throughout most of field of view.
In many cases, astronomical upper limits cannot be considered as "missing at random", since their missingness directly depends on their (low) value.Furthermore, objects with missing values are not necessarily well-represented by objects that were detected, as the two may represent different populations of objects (e.g., star-forming versus passive galaxies that show significantly different correlations between their properties).In a future work where we plan to include all 19 Cycle 1 PHANGS-JWST galaxies, we intend to examine whether the assumption of "missing at random" may hold approximately.The Cycle 1 PHANGS-JWST galaxies were observed to the same depth, but they span a distance range of 5-20 Mpc (see table 1 in Lee et al. 2023).Therefore, convolving all the maps to the same physical scale of 150 pc, and working in units of luminosity per physical scale of 150 pc, would result in lower noise levels for the closer galaxies.Then, the CO and 3.3 µm PAH luminosities in brighter regions corresponding to the closer galaxies, along with other measured features, may be used as an approximate model for the undetected CO and 3.3 µm PAH luminosities in farther away galaxies.It may therefore be possible to build a regression model that predicts the undetected CO and 3.3 µm luminosities from the detected Hα, 7.7 µm, 11.3 µm, and 10 µm luminosities using pixels where all the features are measured.This, of course, assumes that the CO luminosity distributions are similar, and that the physical properties of pixels with undetected CO and 3.3 µm can be well-represented by pixels with detections.
An alternative solution for including upper limits in the dataset is to represent all features as probability distribution functions (PDFs).Features that are not missing may be represented by a normal distribution whose mean is the measured value and its standard deviation is the measurement uncertainty.Upper limits may be represented by a step or a box function whose limits are defined by the survey depth and other physical considerations (e.g., flux is non-negative).Then, instead of estimating pairwise distances between numbers, distances can be estimated using metrics that estimate distances between PDFs, such as the Wasserstein distance (e.g., Rubner et al. 2000;Ramdas et al. 2015) or the Kullback & Leibler (1951) divergence.With the pairwise distances estimated, traditional unsupervised learning algorithms can be used without excluding objects with missing values.We have developed and tested such an approach for the Random Forest algorithm (Reis et al. 2019), though additional tests must be performed in the case of unsupervised learning.We intend to compare different solutions for including upper limits for the 19 PHANGS-JWST galaxies in a future work.

Summary
The PHANGS survey has been making high-resolution observations of nearby galaxies across the electromagnetic spectrum (figure 1).This complex multi-wavelength dataset offers the opportunity to explore the interplay between different processes operating on scales as small as the molecular cloud scale.This interplay is what controls the baryon life cycle in galaxies as well as their star formation, and thus is key to their overall evolution.In this work, we use unsupervised machine learning algorithms to dissect the complex high-dimensional space spanned by the PHANGS multi-wavelength observations.With these tools, we identify groups and previously-unknown trends in the data, allowing us to form data-driven hypotheses from this rich dataset.
We extract properties of interest from the ALMA, MUSE, JWST NIRCam, and JWST MIRI observations of the three PHANGS galaxies: NGC 0628, NGC 1365, and NGC 7496 (section 2), from ∼24 000 pixels that are half of the 150 pc resolution.These properties pertain to the stellar populations; warm ionized and cold molecular gas properties; and PAH and dust conditions (section 3.1 and table 2); on scales of 150 pc.We apply the dimensionality reduction algorithm UMAP to embed the high-dimensional dataset into a two-dimensional plane (section 3.2).We then use the Hierarchical clustering algorithm to divide the pixels into groups (section 3.3), each group showing distinct properties.In the process, we identify novel galaxy-wide correlations across the different regions, and crucially, use our defined groups to interpret them.Our results and their broader implications are summarized below.(I) The identified groups have distinct multiphase gas and PAH properties (section 4.1 and figure 6).The 150 pcsized regions are divided into 6 groups, where each maps to large-scale (∼kpc) and coherent structures within the galaxies.Using optical line ratios such as [O III]/Hβ, [N II]/Hα, [S II]/Hα, and [O I]/Hα, the Hα-to-CO flux ratio, the PAH band ratios 11.3/7.7 and 3.3/11.3,and additional properties, we interpret the groups, finding that they mostly differ in their multiphase gas and PAH properties.The different groups trace gas photoionized by the AGN; by standard star formation in spiral arms; extreme star formation fueled by galactic bars; and diffuse gas ionized by a combination of radiation leaking from HII regions with harder radiation from hot and evolved stars.The different groups also show different PAH properties, with some showing significantly smaller PAH size distributions than others.(II) There is a close connection between the heating of PAHs and the ionization of the warm ionized gas (section 4.2).We identify significant and tight correlations between different PAH band and optical line ratios (figures 9 and 10).These correlations are seen across the entire dataset, covering the star-forming regions and the ISM, the diffuse ionized gas, and the AGN-photoionized gas.They suggest a strong connection between the heating of PAHs and the ionization of the warm ionized gas on 150 pc scales.
• The observed correlations between the PAH band 11.3/7.7 ratio and the optical line ratios  12).Since the PHANGS pixels trace regions with widely varying radiation fields that are a combination of young stars, old stars, and/or AGN, a variation of PAH band ratios due to the changing radiation SED is unavoidable.We combine PAH models with a wide range of assumed radiation fields to show that the observed slope of the relation is roughly consistent with that observed.
• The scatter in the PAH 11.3/7.7 -optical lines relations is small, ∼0.03 dex, and suggests that the fraction of ionized PAHs is quite uniform on 150 pc scales in nearby galaxies.It suggests a significant selfregulation in the ISM across different regions.
• The 11.3/7.7 PAH band ratio may potentially be used to trace the shape of the non-ionizing far-ultraviolet to optical parts of the radiation field.Combining it with optical line ratios such as [O III]/Hβ and [S II]/Hα may therefore serve as a powerful diagnostic of the local radiation field, including its ionizing and non-ionizing parts simultaneously.
• The varying radiation field is expected to also impact PAH band ratios that are typically used as PAH size diagnostics (section 4.2.2 and figure 12).We show that the 3.3/11.3PAH band ratio is strongly impacted by the varying radiation, and we combine empirical fits of the 11.3/7.7 vs. optical line ratios relations with PAH models to correct the 3.3/11.3band ratio for the varying radiation field.Once the varying radiation field is accounted for, we find that PAHs tend to be smaller in regions with low Hα/CO ratios, and larger in regions with high Hα/CO ratios.This is in line with the picture that smaller PAHs are more easily destroyed by ionizing radiation than larger PAHs.We also show that using the 3.3/11.3band ratio directly to infer the PAH size distribution, which implicitly assumes that the incident radiation field is constant throughout, leads to completely different conclusions regarding the PAH size distribution in different regions in the galaxies (figure 14).
• We identify the same correlations between PAH band and optical line ratios in individual groups where the dynamical range is large enough (section 4.2.3).The observed slopes of the correlations are comparable between the groups and comparable to the slope we find in the galaxy-wide correlations.Our analysis suggests that the variation in the 11.3/7.7 PAH band can be completely attributed to a variation in the radiation field.On the other hand, the 3.3/11.3PAH band ratio requires an additional component, for example, a small variation of the PAH size distribution within each group.
large, the data itself can be used to form new hypotheses, an approach that is at at the heart of data science.This work illustrates that unsupervised machine learning algorithms can be used to mine for novel information in complex multiwavelength datasets such as that of the PHANGS survey.To expedite the discovery process, we used simple, yet powerful, machine learning algorithms and applied them to a set of physically-motivated features.This allowed us to quickly interpret the results and focus on the scientific implications of the newly-discovered correlations.On the other hand, this decision limited the discovery space to that spanned by the features we chose to consider.In future works, we plan to examine more sophisticated algorithms that can be applied directly to the raw data.While these may be more challenging to interpret, they can also potentially lead to new unexpected discoveries.
Appendix A UMAP hyper-parameter exploration We examine the two-dimensional embeddings by UMAP for a range of distance metrics and n neighbors parameters.For the distance metric, we examine 10 metrics from the list of proposed metrics in the UMAP python library 16 : Euclidean, Manhattan, Chebyshev, Minkowski, Canberra, Braycurtis, Mahalanobis, WMinkowski, Cosine, and Correlation distance.In addition to these, we estimate the unsupervised Random Forest (URF) distance matrix, used in Baron & Poznanski (2017) to perform anomaly detection on galaxy spectra.The motivation to examine the URF-based distance is that the Random Forest algorithm can be generalized to handle missing values, upper limits, and measurement uncertainties (Reis et al. 2019), and can thus be used to include objects with missing features in our future work of the full PHANGS-JWST sample.
Figure A1 shows the two-dimensional embeddings obtained with UMAP for different distance metrics, and using n neighbors=25 and min dist=0.These are compared to the embedding adopted in this paper (hyper-parameters: metric=correlation, n neighbors=10, and min dist=0).The points are color-coded according to their location in our adopted embedding, using rectangles we defined manually (top left panel).The figure shows that objects that are within a given neighborhood in one of the two-dimensional embeddings are also in the same neighborhood in another.This suggests that the global structure of the data does not depend significantly on the distance metric, and that the clusters that would have been identified using different metrics should be roughly similar to those we identify using our adopted embedding.
Figure A2 compares the two-dimensional embeddings by UMAP for different n neighbors parameters ranging from 10 to 500.The points are color-coded according to their designated clusters we defined manually in figure A1 (top left panel).As expected, the general structure in the twodimensional embedding becomes more connected and less clustered with increasing n neighbors.However, similarly to the previous figure, the figure shows that objects remain in the same rough neighborhood for different choices of n neighbors, suggesting that roughly the same clusters would have been identified using different values of n neighbors.

Appendix B UMAP and clustering interpretation
Figure B3 shows the Pearson correlation coefficient between different pairs of features.It includes the correlations calculated considering all the 24 007 PHANGS pixels, as well as those obtained when considering pixels from individual clusters.The correlation coefficients are listed in the figure only if their associated p-value is smaller than 0.01.The figure shows significant correlations between different PAH band and optical line ratios.In figures B4, B5, B6, B7, and B8 we show the distribution of the clusters in some of the features we considered in our analysis.We used these figures to interpret the clusters we identified.
Appendix C Models used to interpret the PAH-ionized gas connection

C.1 PAH models
We make use of the models presented in Draine et al. (2021).These models predict the infrared emission from dust under a wide range of assumptions, including incident radiation with varying SEDs, PAH size distributions, and PAH charge distributions.We use models with solar metallicity.For the SED, we use two sets of single-age stellar population models: BC03 (Bruzual & Charlot 2003) and BPASS (Eldridge et al. 2017), with stellar ages of 3 Myr, 10 Myr, 100 Myr, 300 Myr, and 1 Gyr.We also consider the models calculated with the mMMP and m31bulge SEDs (see Draine et al. 2021 for additional details).Therefore, we consider 12 different SEDs in our analysis.We consider the three different options for the PAH size distribution: small, standard, and large, and the three options for the PAH charge distribution: low, standard, and high.
To estimate the 11.3/7.7 and 3.3/11.3PAH band ratios, we follow the procedure outlined in Draine et al. (2021).We parametrize the radiation field SEDs using their FUVto-optical luminosity ratio, where the FUV luminosity was defined between λ = 1350 Å and λ = 1780 Å, and the optical between λ = 3000 Å and λ = 4000 Å. Figure C9 shows how the PAH band ratios 11.3/7.7 and 3.3/11.3change with the FUV-to-optical luminosity ratio, the PAH size distribution, and the PAH charge distribution.
We estimate the expected change in the PAH band ratios, ∆ log(11.3/7.7) and ∆ log(3.3/11.3),for varying PAH size and charge distributions.These are represented as red and blue arrows in figures 12 and 13 in the main text.
For the variation with the PAH charge distribution, we use models with the standard PAH size distribution, and for each FUV-to-optical ratio, calculate ∆ log(11.3/7.7) and ∆ log(3.3/11.3)when changing the PAH ionization from low to standard and from standard to high.The resulting ∆ log(11.3/7.7) and ∆ log(3.3/11.3)do not vary significantly for different FUV-optical slopes and when considering low-standard versus standard-high changes.Therefore, we  .Correlations between various features considered in our analysis.The matrix shows the Pearson correlation coefficient between pairs of features as indicated on the left, where empty cells correspond to correlations with p-values equal or larger than 0.01.Each row corresponds to a pair of features we consider.Each column marks the set of pixels considered for the correlation estimation, where the first column shows the correlations using all the 24 007 pixels we considered, and the rest correspond to pixels in the different clusters we identified.High correlation coefficients are marked with brighter colors, and low coefficients with fainter colors.We identify significant and tight correlations between PAH band and optical line ratios, suggesting a strong connection between them.We constructed the HOLMES mixing sequence described by Belfiore et al. (2022), which was used to interpret the optical line ratios seen in the diffuse ionized gas in PHANGS galaxies.The spectra are a combination of a young stellar spectrum (2 Myr), which represents the radiation leaking from HII regions, with an old stellar spectrum (10 Gyr), which represents the radiation of hot and evolved stars.The mixing between these spectra is determined through f ion , which represents the ratio of the ionizing fluxes from the young and old stars.For example, log f ion = 1 represents a case where the ionizing flux of the young stars is 10 times larger than the ionizing flux from the old stars.log f ion = 0 represents the case that the ionizing flux from the young stars equals to the ionizing radiation from the old stars.To construct this sequence, we used the FSPS models with ages 2 Myr and 10 Gyr, assuming solar metallicity, and considering mixing fractions in the range log f ion between -2 and 2.
Finally, we constructed two AGN+SF mixing sequences.For the AGN accretion disk, the SED consists of a combination of an optical-UV continuum emitted by an optically-   thick geometrically-thin accretion disk, and an additional Xray power-law source that extends to 50 keV with a photon index of Γ = 1.9.The normalization of the UV (2500 Å) to X-ray (2 keV) flux is defined by α OX , which we take to be 1.37.We consider two models that differ in the mean energy of an ionizing photon (2.65 and 4.17 Ryd), which correspond to different different black hole masses, accretion rates, and spins (see table A1 in Baron & Netzer 2019).We then construct a mixing sequence with a young stellar spectrum (2 Myr), considering mixing fractions in the range log f ion between -2 and 2.
For each of these SEDs, we estimate the FUV-to-optical luminosity ratio by integrating over the spectra in the ranges 1350-1780 Å (FUV) and 3000-4000 Å (optical).To probe the hardness of the ionizing SED, we estimate νL ν at 353 and 912 Å, which correspond to photon energies that are required to ionize Oxygen twice (probed by the [O III] line) and Hydrogen (probed by the Hα line).Figure C10 shows the luminosity ratio νL ν (O ++ )/νL ν (H + ), which traces the slope of the ionizing radiation, versus the FUV-to-optical luminosity ratio, which traces the hardness of the FUV-optical part of the SED.The figure includes all the models described in this section.The figure shows that harder ionizing radiation is typically associated with softer FUV-optical slope for a wide range of SEDs, including single stellar populations, mixing of young and old stars, and mixing of stellar and AGN SEDs.The slopes of the gray arrow in figures 12 and 13 was estimated by combining the observed slopes in figure C9 and C10.

Figure 1 .
Figure 1.The information content in a single PHANGS galaxy.The diagram uses data obtained for NGC 7496 to illustrate the multi-scale multi-wavelength information available as part of the PHANGS survey.The top row shows images of the galaxy obtained using: HST-WFC3 (F336W filter), VLT-MUSE (stellar continuum emission and Hα surface brightness), JWST-NIRCam (F200W filter), JWST-MIRI (F770W filter), and ALMA.The different images have different spatial resolutions, ranging from ∼ 0.08 ′′ to ∼ 1 ′′ , as indicated at the bottom of each image.Each PHANGS galaxy has 10 5 -10 7 pixels/spaxels.The multi-wavelength information available in each pixel/spaxel is shown in the bottom row, and it includes several UV-optical photometric bands by HST, a full optical spectrum between 4800 Å and 9300 Å by MUSE, eight JWST near and mid-infrared photometric bands, and a spectrum of the CO line reconstructed from millimeter interferometric observations by ALMA.The black solid lines represent observed spectra, blue rectangles represent photometry, and gray dashed lines represent the expected underlying spectrum.For presentation purposes, the flux densities in each wavelength region have been stretched vertically to cover the entire panel.In this work, we use a set of physically-motivated properties measured from these observations.

Figure 2 .
Figure 2. Different properties derived from the ALMA, MUSE, and JWST observations of NGC 1365.The first row shows cold molecular gas properties from ALMA: 12 CO(2 − 1) moment 0 (intensity) derived using the "strict" (lower noise but less complete) and "broad" (noisier and more complete) masks, and the CO effective width.The second and third rows show gas and stellar population properties from MUSE: dust-corrected Hα surface brightness, the line ratios log([N II]/Hα), log([S II]/Hα), and log([O III]/Hβ), Hα velocity dispersion, stellar velocity dispersion, stellar mass surface density, and the age of the stellar population.The fourth row shows dust grain properties from JWST: PAH 3.3 µm and 7.7 µm emission features, and broad-band mid-infrared emission from hot dust at 10 µm and 21 µm.The 21 µm image is saturated in the center of the galaxy and shows noticeable diffraction spikes.All the maps are convolved to a common resolution of 150 pc.

Figure 3 .
Figure3.UMAP dimensionality reduction of the PHANGS multi-wavelength pixels.The left panels show the three galaxies NGC 628, NGC 1365, and NGC 7496, where pixels that are included in the analysis are marked with colors.In these panels, the grayscale background represents the Hα surface brightness, and the black contours represent the CO intensity.The right panel shows our adopted UMAP embedding.Every point corresponds to an object (pixel) from our input dataset, which includes 24 007 objects with 13 features each (features trace 150 pc-size regions).The two-dimensional embedding was obtained using the following hyper-parameters: metric=correlation, n neighbors=10, and min dist=0, though we show in Appendix A that the global structure in the two-dimensional space remains stable when changing the metric and the n neighbors parameter.Each pixel is colored according to the galaxy it belongs to, information that is not included in the input dataset.The embedding shows several over-dense regions that may be interpreted as clusters, connected through filamentary structures of points, suggesting that the features form continuous relations in the complex high-dimensional space.

Figure 4 .
Figure 4. Our adopted two-dimensional UMAP embedding color-coded by the features in the input dataset.

Figure 5 .
Figure5.Different clustering algorithms applied to the two-dimensional embedding of the PHANGS pixels.Each panel shows the application of a different clustering algorithm to our adopted two-dimensional embedding by UMAP.In each panel, the points are color-coded according to their assigned cluster (all points for K-means, Birch, and Hierarchical Clustering), or are marked with light gray if they are not clustered (in the case of OPTICS and DBSCAN).The figure shows that while some regions in the two-dimensional map are always considered as separate and well-defined clusters (e.g., the clusters at the bottommost and at the topmost), others are more ambiguous, with different clustering algorithms dividing the objects into different groups.We adopt the clusters identified using Hierarchical clustering with the average linkage.
Figure6.PHANGS multi-wavelength pixels partitioned into 6 groups using dimensionality reduction and clustering algorithms.The top left panel shows the two-dimensional embedding by UMAP of the input dataset, which includes 24 007 PHANGS pixels with 13 measured features that trace the stellar population, gas, dust, and star formation properties.Each point in the two-dimensional space represents a pixel from one of the galaxies we consider.The objects are divided into groups using the Hierarchical Clustering algorithm with the average linkage, and each point is color-coded according to its assigned group.The other three panels show the spatial distribution of the six groups, which map to large-scale coherent structures within the galaxies.In these panels, the grayscale background represents the Hα surface brightness, and the black contours represent the CO emission.

Figure 7 .
Figure 7. Location of the identified groups in optical line diagnostic diagrams.Each row shows the optical line ratios on standard diagnostic diagrams (e.g., Baldwin et al. 1981; Veilleux & Osterbrock 1987; Kewley et al. 2001): log([O III]/Hβ) versus log([N II]/Hα), log([S II]/Hα), log([O I]/Hα).These diagrams are used to constrain the main source of ionizing radiation.In the left panel, we show the separating criteria byKewley et al. (2001) andKauffmann et al. (2003), that are used to separate ionization by young massive stars from AGN.In the middle and right panels, we show the LINER-Seyfert separating criteria byKewley et al. (2006).The black contours represent the distribution of line ratios in all the pixels we considered in our analysis.The colormaps represent 2D histograms of the line ratios for a given group, in logarithmic scale.The crosses represent the 16th, 50th, and 84th percentiles of the distributions in each of the band ratios.The different groups show different distributions in these diagnostic diagrams, suggesting different sources of ionizing radiation, as described in the text.Details on the classification of each group can be found in section 4.1.

Figure 8 .
Figure 8. Distribution of the identified groups in the PAH band ratios plane.The panels show the distribution of pixels in the PAH 11.3/7.7 µm versus 3.3/11.3µm plane.The top left panel shows the distribution of all the pixels we consider using gray-scale color-coding, and the rest of the panels show the distributions in each individual group.While the gray-scale colormap on top represents the 2D histogram counts in a linear scale, the individual group panels show the counts in a logarithmic scale.The black contours represent the distribution of all the pixels we consider, and they are the same in the different panels.The crosses represent the 16th, 50th, and 84th percentiles of the distributions in each of the band ratios for each group.Details on the classification of each group can be found in section 4.1.
we show the PAH band ratio 11.3/7.7 versus the optical line ratios [O III]/Hβ, [N II]/Hα, [S II]/Hα, and [O I]/Hα, for the PHANGS pixels considered in our analysis.The 11.3/7.7 band ratio shows strong correlations with all of them.In figure 10 we show the PAH band ratio 3.3/11.3versus the optical line ratios.The correlations show a larger scatter than those of the 11.3/7.7 band ratio, but they extend over twice as large a dynamical range.In addition, there is a clear difference in the relation seen for the lower ionization transitions traced by [S II]/Hα and [O I]/Hα compared to the higher ionization transitions [N II]/Hα and [O III]/Hβ, with the former showing stronger correlations with the 3.3/11.3band ratio.
Figure 12.Interpreting the relation between PAH band ratios and log([O III]/Hβ).The panels show the PAH band ratios log(11.3/7.7) and log(3.3/11.3)versus log([O III]/Hβ), where each group is plotted separately.The contours represent the 2D distribution for each group, and the crosses represent the 16th, 50th, and 84th percentiles.The blue and red arrows represent the expected change in PAH band ratios when changing the PAH size and charge distribution (see Appendix C.1).The gray arrows represent the expected change in PAH band and log([O III]/Hβ) ratios assuming that the PAHs and the ionized gas are exposed to different parts of the same varying radiation field.The varying radiation field is characterized by harder ionizing radiation and softer FUV-optical slope, a scenario that is expected under a wide range of assumptions regarding what powers it.The relation on the left panel is consistent with the "common varying radiation field" interpretation, without the need to invoke PAHs with different charge distributions.The dashed and dotted black lines in the left panel represent two polynomial fits to the 11.3/7.7 versus [O III]/Hβ relation.These best fits are then propagated to the expected relations between 3.3/11.3and [O III]/Hβ (right panel), assuming the "common varying radiation field" interpretation and the PAH models by Draine et al. (2021).Clusters that deviate from these lines in the right panel may represent PAHs with different size distributions.The dotted and dashed lines are given by the following equations; Left panel (y = log PAH 11.3/7.7,x = log [O III]/Hβ): y = 0.15x + 0.19, y = −0.087x 2 + 0.084x + 0.18.Right panel (y = log PAH 3.3/11.3,x = log [O III]/Hβ): y = −x − 2.14, y = 0.57x 2 − 0.55 − 2.12.

Figure 15 .
Figure 15.PAH band ratios versus log([S II]/Hα) for individual groups.The figure shows three groups that show significant correlations between their PAH band and optical line ratios: AGN-photoionized gas (left), SF+ISM with smaller PAHs (middle), and SF+ISM with larger PAHs (right).The top panels show the PAH band ratio 11.3/7.7 and the bottom 3.3/11.3.The black lines at the top row represent the best linear fits to the observed relations.The light gray lines represent the best linear fit for all the groups together (figure 13).The dashed lines at the bottom row represent the expected relation between 3.3/11.3and [S II]/Hα, propagated from the best-fit relation between 11.3/7.7 and [S II]/Hα, using the PAH models by Draine et al. (2021), and assuming that the PAHs and gas are exposed to different parts of the same varying radiation.The expected relations are steeper than the observed relations, which may suggest variations in PAH size distributions within each group.

Figure A2 .
Figure A2.Comparison of UMAP two-dimensional embeddings for different n neighbors parameters.The panels show the resulting two-dimensional embedding assuming different metrics and different values of n neighbors.In each panel, the points are colored according to the designated clusters of the pixels in our adopted two-dimensional embedding (top left in figureA1).For a given distance metric (given column), increasing the n neighbors parameter from 10 to 500 (top to bottom) does not change the general structure of the data in the two-dimensional space significantly, suggesting that comparable clusters would have been identified for different choices of hyper-parameters.

Figure B3
Figure B3.Correlations between various features considered in our analysis.The matrix shows the Pearson correlation coefficient between pairs of features as indicated on the left, where empty cells correspond to correlations with p-values equal or larger than 0.01.Each row corresponds to a pair of features we consider.Each column marks the set of pixels considered for the correlation estimation, where the first column shows the correlations using all the 24 007 pixels we considered, and the rest correspond to pixels in the different clusters we identified.High correlation coefficients are marked with brighter colors, and low coefficients with fainter colors.We identify significant and tight correlations between PAH band and optical line ratios, suggesting a strong connection between them.

Figure B6 .Figure B7 .
Figure B6.PAH band 3.3/7.7 µm versus 3.3/11.3µm ratios for the identified clusters.The black contours represent the 2D distribution of all the pixels in the dataset, and the colored colormaps represent the distribution in each cluster.

Figure B8 .
Figure B8.PAH band ratio 3.3/11.3µm versus the Hα/CO ratio.The top left panel shows the distribution for all the PHANGS pixels, and the rest show the distribution for each individual cluster.The black contours represent the distribution of all the pixels in the dataset.

Figure C9 .
Figure C9.Relation between PAH band ratios and the FUV-optical slope for theDraine et al. (2021) models.The panels show the PAH band ratios 11.3/7.7 µm and 3.3/11.3µm versus the FUV-to-optical luminosity ratio of the SEDs the PAHs were exposed to.The figures show different PAH size and charge distributions.The 11.3/7.7 µm band ratio is primarily sensitive to the PAH charge distribution, and increases for more neutral PAHs.It is also sensitive to the FUV-optical slope, and decreases for softer FUV-optical slopes.The ratio is only weakly related to the PAH size distribution, and can change by ∼ 0.03 dex when the PAH sizes change more standard to small, or from large to standard.The 3.3/11.3µm ratio is primarily sensitive to the PAH size distribution, and increases for smaller PAHs.It is also sensitive to the FUV-optical slope, and increases for softer SEDs.It only weakly depends on the PAH size distribution, and can change by ∼0.05 dex when changing the PAH charge distribution from low to standard, or from standard to high.

Table 1 .
PHANGS-2D galaxy properties Number of pixels in the initial ALMA WCS grid of the galaxy.(9)Number of pixels in the grid after downsampling the ALMA grid to have at least 2 pixels per 150 pc.(10) Number of pixels that are not masked out in the pixel mask.The machine learning algorithms are applied to these sets of pixels.

Table 2 .
Selected features Figure14.Different PAH heating scenarios lead to opposite interpretations regarding PAH sizes.The two panels show the variation of the 3.3/11.3PAHbandratio versus the Hα/CO ratio.The left panel shows the measured 3.3/11.3PAHbandratio.Assuming the "Non-varying radiation field" interpretation, the 3.3/11.3canbeused directly to constrain the PAH size distribution.The left panel therefore suggests that regions with higher fractions of ionized gas (traced by larger Hα/CO ratios) are associated with smaller PAHs.This seems to be in tension with the common picture that smaller PAHs are more efficiently destroyed by ionizing radiation than larger PAHs.The right panel shows ∆3.3/11.3,which is the measured 3.3/11.3PAHbandratio after correcting for the expected variation due to changing SED (subtracting the black dotted line in figure13from the measured 3.3/11.3values).Contrary to the left panel, the right panel suggests that regions with higher fractions of ionized gas host larger PAHs, in line with the common picture that larger PAHs survive in harsher environments.The yellow and orange clusters, which predominantly come from NGC 7496, are above this relation and show significantly smaller PAH size distributions (see section 4.2.2 for additional details).