Optimizing NILC Extractions of the Thermal Sunyaev–Zel’Dovich Effect with Deep Learning

All-sky maps of the thermal Sunyaev–Zel’dovich effect (SZ) tend to suffer from systematic features arising from the component-separation techniques used to extract the signal. In this work, we investigate one of these methods, known as needlet internal linear combination (NILC), and test its performance on simulated data. We show that NILC estimates are strongly affected by the choice of the spatial localization parameter (Γ), which controls a bias-variance trade-off. Typically, NILC extractions assume a fixed value of Γ over the entire sky, but we show there exists an optimal Γ that depends on the SZ signal strength and local contamination properties. Then we calculate the NILC solutions for multiple values of Γ and feed the results into a neural network to predict the SZ signal. This extraction method, which we call Deep-NILC, is tested against a set of validation data, including recovered radial profiles of resolved systems. Our main result is that Deep-NILC offers significant improvements over choosing fixed values of Γ.


INTRODUCTION
The hot and diffuse gas that pervades the Universe is most easily observed either from X-ray emission or the thermal Sunyaev-Zel'dovich (SZ) effect.The SZ effect is caused by hot electrons giving an energy boost to cosmic microwave background (CMB) photons through inverse Compton scattering.It has a weaker dependence on the gas density (linear vs. quadratic) compared to X-rays, making it an excellent probe of more diffuse gas.
The Planck mission has provided the best data to create all-sky maps of the SZ effect (known as y-maps).In 2016, Planck Collaboration et al. (2016) released the first publicly available y-maps, nearly a half of a century after it was first theorized (Sunyaev & Zeldovich 1970;1972).Extracting the SZ signal is no easy feat as the microwave sky is dominated by non-SZ components.Over the past few decades, various component separation techniques have been developed to separate these from the SZ signal.
These public y-maps were constructed using internal linear combination (ILC) methods (Delabrouille & Cardoso 2007, Leach et al. 2008).An ILC method is a semiblind extraction that relies on the spectral dependence of the SZ signal and uses second-order statistics to minimize the reconstruction error (Delabrouille et al. 2009).ILCs also rely on a strong assumption that all non-SZ components are uncorrelated with the SZ effect.Such assumptions are not always valid and, in turn, could lead one to obtain unreliable results.If the non-SZ components are poorly separated, they manifest as systematic residuals that contaminate the reconstructed SZ signal.
In light of this, modified ILC algorithms were developed.Modified ILC methods help improve SZ extractions by using data localization.Data localization introduces additional assumptions about the non-SZ signals before applying ILC.Among the most popular is the algorithm known as needlet internal linear comibination (NILC).NILC uses harmonic and spatial localization.It assumes non-SZ components are more similar when grouped by their angular size and location on the sky.
There also exists many other component separation techniques that rely on their own assumptions (many are listed at this website 1 ).For example, Bobin et al. (2008) used the assumption of sparsity to perform a semi-blind extraction, known as general morpholocial component analysis (GMCA), of the CMB and SZ signals.While GMCA could outperform a few other extraction techniques, it yielded similar results to NILC.Over the years, refinements have been made to improve its performance (Bobin et al. 2015, Wagner-Carena et al. 2020).One could also do parametric modeling of the components, which has largely been done with the Commander algorithm (Eriksen et al. 2008, Galloway et al. 2023).While many more component separation techniques exist, most have only been applied to the CMB.Only recently have there been efforts to improve the quality of all-sky SZ data (e.g., McCarthy & Hill 2023), so there is still plenty of room for investigations.
One modern approach to improving SZ extractions is with deep learning (DL).DL is a subset of machine learning techniques that uses neural networks to build non-linear models.They excel at capturing complex relations between variables in a way that simple, traditional models cannot describe.Over the last decade, DL has been used to tackle challenging tasks, and has proven to perform well in various facets of astrophysics including: galaxy morphology classification (Zhu et al. 2019), detection of exoplanets (Shallue & Vanderburg 2018), painting baryons onto dark matter halos in cosmological simulations (Tröster et al. 2019), separating the CMB from foreground contamination (Casas et al. 2022), and building catalogs of SZ clusters (Voskresenskaia et al. 2023).To our knowledge, DL has not been used to extract the SZ effect, and this is one of the chief goals of this paper.
In this work, we present the NILC method and specifically focus on its spatial localization parameter, Γ. Traditional implementations of NILC fix the value of Γ, however, we demonstrate that yields a sub-optimal solution in certain scenarios.Then we present a novel method for extracting the SZ signal which we call Deep-NILC.Deep-NILC combines NILC results from different values of Γ, then feeds these values into a neural network to predict the true SZ signal.Our results show that we can achieve better estimations of the SZ signal compared to using fixed values of Γ.
The paper is structured in the following way: in section 2 we present the methodology for this study; the main results are presented in section 3; the results are discussed in section 4; a summary is provided in section 5.

Simulations
The main data used in this study consisted of simulated observations of the microwave sky.First we describe the set of simulated y-maps obtained from Han et al. (2021) used for training and validation.Then we inject additional SZ sources to better represent the nearby "anomalies" that exist in the real Universe.Finally, we generate realistic mock observations as seen by Planck and WMAP .Han et al. (2021) This work required a sizable set of simulated y-maps that resembled the statistical properties of our observed Universe.Usually, a simulated y-map would be acquired from full 3-D cosmological simulations by projecting the electron pressure along a line-of-sight.Unfortunately, running a single simulation can be very computationally expensive, so there are only a few available for public use.

SZ Data from
We were able to overcome this obstacle using the work of Han et al. (2021).These authors generated a large set of mock extragalactic signals at millimeter wavelengths based off the cosmological simulations of Sehgal et al. (2010).The simulations from Sehgal et al. (2010) were designed to match the best observational data available at the time, using cosmological parameters inferred from WMAP and gas prescriptions that matched X-ray observations.Han et al. (2021) developed an algorithm, known as MillimeterDL (mmDL), to generate many synthetic y-maps using a single simulation from Sehgal et al. (2010).
The mmDL algorithm used a DL method known as Generative Adversarial Networks (GANs).GANs consists of two components: a generator and a discriminator.The generator generates synthetic data while the discriminator's role is to distinguish between real and fake data.Through an iterative process, the generator improves its ability to produce realistic data by fooling the discriminator until the mock data are nearly indistinguishable from the input training data.mmDL was designed to match the power spectra from the Sehgal et al. (2010) simulations, but it was also able to capture important non-Gaussian features.More generally, mmDL was built to take any cosmological simulation as input and generate a large set of corresponding realizations.By feeding a simulation into mmDL as input, Han et al. (2021) created 500 realizations that are publicly available 2 .
Their SZ effect was provided at different frequencies which we converted into y-maps using the known frequency dependence of the SZ signal (see Equation 2and Equation 3).Furthermore, only ten realizations of the SZ effect were used for our experiments.Seven of these were used for supervised training while three were set aside for validation.In the next section, we augment these data with more examples of resolved signals.

Resolved SZ Signals
Most of the strong SZ signals in the Universe are unresolved by Planck and WMAP .Although, there exists a few extended objects, and these are amongst the most interesting SZ signals to study.For example, the Virgo cluster is somewhat of a statistical anomaly as it produces a strong SZ profile that extends many degrees on the sky.In addition, there are nearby low-mass clusters, galaxy groups, and individual galaxies that also produce resolved signals.Such systems have been carefully looked at in X-rays and SZ in hopes of closing the baryon budget and understanding various feedback cycles (Bregman et al. 2018, Pratt et al. 2021, Bregman et al. 2022).
When constructing the training data, it appeared the simulations of Han et al. (2021) did not contain enough examples of the unique, resolved SZ sources observed in the real Universe.In order to overcome this, we created our own simulated set of resolved signals.These signals were generated based on X-ray observations reported in the Meta-Catalog of X-Ray Detected Clusters of Galaxies (MCXC; Piffaretti et al. 2011).We used the reported masses (M 500 3 ) and redshifts (z) and then calculated the angular size of R 500 for each system.Resolved systems were selected as those with R 500 > 30 ′ which yielded a sample of 25 objects.The next step was to simulate the SZ profiles of the resolved signals.They were generated using the universal pressure profile (P (r)) from the AGN 8.0 simulations of Le Brun et al. (2015).Their SZ surface brightness profiles, y(r), as a function of physical radius is given by where R max = 5R 500 , σ T is the Thompson cross-section, m e is the electron rest-mass, and c is the speed of light.These profiles were then superimposed with the simulated y-maps from Han et al. (2021).Herein, we denote the resolved part of the simulated y-maps as the resolved SZ component and that from Han et al. (2021) as background SZ.There were still a few resolved signals in Han et al. (2021), but we use this demarcation to distinguish the two components.
Then we superimposed the resolved profiles with the SZ background.Before injecting the signals, we added some randomization to help with generalization.Each object was placed at different distances by randomly simulating their signals at redshifts of ±50% their MCXC values.Furthermore, these were injected into the northern and southern Galactic hemispheres at random positions of moderately high Galactic latitudes (|b| > 20 • ).This was to simply augment the sam-ple while also avoiding the highly contaminating regions near the Galactic plane.
The resolved signals used for validation consisted of two data sets.The first was constructed in exactly the same way as the training data but using the three background SZ maps that were originally set aside.These were used for statistical analyses in section 3. The second set of validation data were made in order to inspect specific signals: a nearby galaxy group of M 500 = 10 13 M ⊙ at 10 Mpc (R 500 = 113 ′ ), a signal resembling the Coma cluster (R 500 = 47 ′ ), and that of the Virgo cluster (R 500 = 177 ′ ).These SZ profiles were also created using the AGN 8.0 pressure profile.
For the Coma-and Virgo-like sytems, ten realizations were generated with five injected both into the northern and southern Galactic hemispheres.For simplicity, we placed them at Galactic latitudes of b = ±50 • and spaced them evenly in Galactic longitude starting at l = 0 • .For the galaxy group-like signal, we generated fifty signals and randomly placed them across the sky with a Galactic latitude restriction of |b| > 20 • .Separate y-maps were generated for each of the three systems, but they were superimposed with the same background SZ for consistency.These validation data were used to investigate the recovered radial profiles in section 3.

Planck Sky Model
Here we describe the modeling of non-SZ emissions, which we call contaminants.These were generated using the Planck Sky Model (Delabrouille et al. 2013) version that has been wrapped into Python (PYSM; Thorne et al. 2017).The Planck Sky Model was developed for simulating microwave radiation observed by the Planck satellite.This software was mainly developed to help researchers test their pipelines and validate cosmological models against observational data.
The components4 used in this work included: Galactic thermal dust emission, synchrotron emission, anomalous microwave emission, free-free emission, the lensed CMB, CO line emission, cosmic infrared background (CIB), and radio galaxies.For some of these components, there was only one available model in PySM which included the SZ effect.A single realization of the SZ effect was not sufficient for our purposes which is why we incorporated the data products from Han et al. (2021).A consistent foreground model (d1-s1-a1-f1-c2-co1-cib1-rg1 components from PYSM) was used for the non-SZ components in all ten realizations.Therefore, the only differences between the simulated all-sky maps were the distributions of the SZ signal.The components were then integrated over the Planck and WMAP frequency response functions to simulate realistic observations.We implemented the publicly available transmission curves both for Planck5 and WMAP6 at their respective frequencies.Lastly, all of the frequency maps were produced in HEALPix format (Górski et al. 2005) with a resolution parameter NSIDE = 1410 which yields four pixels per resolution element of 10 ′ .Instrumental noise and beam effects were ignored for simplicity, but they may be worth including in future work.

SZ Signal Extraction
In this section, we explain the methods used to extract the SZ signal.We first present NILC in detail.Most importantly, we demonstrate the significance of its spatial parameter, Γ, that controls a bias-variance trade-off.Then we use the data products from multiple NILC extractions as input to a neural network called Deep-NILC.In the following, bold symbols indicate matrices or vectors.

Internal Linear Combination (ILC)
The SZ effect boosts the energy of low-energy CMB photons (ν < 217 GHz) to higher energies.This causes a distortion in the CMB spectrum that is well-grounded in theory and is given by where x = hν kBTCMB (k B is the Boltzmann constant, h is Planck's constant, and T CMB is the temperature of the CMB).The amplitude of the SZ effect is measured as a dimensionless quantity known as the Compton parameter, y, given by ∆T where g(ν) is the SZ spectral dependence, ∆T TCMB is the change in temperature relative to the CMB.
In practice, the SZ signal is observed alongside other cosmological signals and Galactic emissions in the radio/IR bands.The observed intensity, X ν , for each frequency channel, ν, at pixel, p, can be written as a sum of various components where N ν (p) denotes the signals from all non-SZ components, including instrumental noise as well as contaminating astrophysical signals.One can also rewrite Equation 4as where ∆ represents the difference between two frequency bands.When X ν is expressed in thermodynamic temperature units, the CMB has the same amplitude in each band and is deprojected from ∆N .This strategy was used by Pratt et al. (2021) where these authors subtracted subsequent frequency maps from the same instruments, and we applied their same method in this work.In the following equations, however, we drop ∆ in our notation since Equation 4and Equation 5 are mathematically similar.ILC groups all non-SZ signals into a single noise term such that where y is the estimated SZ signal, and w ν are the weights that minimize the total covariance projected along the g(ν) direction while also preserving the signal i.e., ν g ν w ν = gw T = 1.These weights can be solved for with Lagrange multipliers and the estimator of the SZ signal is where X is the matrix of frequency observations, R is the covariance matrix R = XX T , and w is the vector of calculated weights.
The ILC solution completely depends on the construction of R, which is derived from the data themselves.In the standard (simplest) ILC method, one uses all available data over the entire sky to calculate the weights.This assumes the spectral properties of the non-SZ emissions do not change across the sky.Such an assumption is naive since the contaminants are known to be nonstationary.In turn, the standard ILC method poorly separates the contaminants from the SZ signal.
One can achieve better ILC extractions by localizing data when constructing R. Localization refers to the sub-grouping of data, under certain assumptions, to calculate the ILC weights.When the localization is strong-that is when data are finely segmented-the propagation of contamination can be suppressed.This reduces the variance in the reconstructed y-map.Conversely, strong localization renders a biased result when the localized contamination and SZ signal are correlated.While many of the contaminating signals may not be physically correlated with the SZ effect, random empirical correlations still exist.
This bias has been discussed in previous studies (Delabrouille et al. 2009, Remazeilles et al. 2013, McCarthy & Hill 2023) and we present an overview for completeness.The estimated SZ signal can be written as where N = ν w ν N ν equals the last term in Equation 6and denotes the total amount of reconstructed noise.
There is a correlation term between the SZ signal and the noise known as the ILC bias defined as where m denotes the number of available frequency maps (or difference maps in this study) and N p is the number of modes used to construct R (see derivation by Delabrouille et al. 2009).Note that this term is negative and, thus, reduces the power of the SZ signal.Rearranging Equation 9we get An optimal solution is reached when the sum of the last two terms is zero, or equivalently when It is important to note that the reconstruction noise depends on N p .Empirically, ⟨ N 2 ⟩ and N p are positively correlated.Their relation, however, cannot be easily described because ⟨ N 2 ⟩ depends on the changing the contamination properties across the sky.This positive correlation suggests when N p ⟨ N 2 ⟩ is large, an optimal balance can be achieved when ⟨y 2 ⟩ is also large.This is a key point: one can afford a larger reconstruction error if the underlying signal is strong enough.On the other hand, if the true signal is weak, then one should consider reducing N p to achieve a better solution.

Needlet Internal Linear Combination (NILC)
Among the most popular localization methods is the NILC algorithm, a modified ILC algorithm that controls N p i.e., the strength of data localization.NILC harmonically and spatially localizes data on the sphere (Guilloux et al. 2007) by clustering them based on angular size and location on the sky.In practice, one first transforms the data into a wavelet basis, and then applies a spatial weighting scheme when constructing the covariance matrix in Equation 7. Below we describe the NILC method used throughout this work.We refer to Equation 13 as a guide for the workflow The double arrows (=⇒) denote a transforms to/from the wavelet basis.Single arrows (−→) represent processes performed on individual wavelet scales.We denote some variable, Y , transformed in the wavelet domain as Ỹ = { Ỹ1 , Ỹ2 , ..., ỸK }, where K is the number of window functions.The first step (X =⇒ X) was to transform the frequency data into a wavelet basis using harmonic window functions; this is the same as Analysis step from Delabrouille et al. (2009) These consisted of subtracted Gaussian beams characterized by their full-width at half-maximum (FWHM = F).The wavelet bands were defined as: 10 ′ -15 ′ , 15 ′ -30 ′ , 30 ′ -60 ′ , F u i−1 -2F u i−1 , ..., 960 ′ -1200 ′ .The window functions scaled dyadically using the largest FWHM of the previous window which we define as F u i−1 where i indexes the wavelet bands and the superscript u(l) denotes upper(lower) FWHM for each window.For clarity, we note that F l i =2F l i−1 .However, dyadic scaling was not applied to the first and last windows because we set a maximum resolution of 10 ′ and the largest scale at 20 • .Typically, the standard wavelet functions, ψ i , are defined such that K i |ψ i | 2 = 1.The window functions used in this study, however, altered the sum to , where G(F) represents the transmission of the Gaussian beam.
The first half of the window functions were applied to the frequency data, transforming them into a wavelet basis i.e., Xi = ⟨X, ψ i ⟩.(The second half will be applied later to satisfy the square-integrable condition.)Only some of the frequency data could be used in certain wavelet bands based on their corresponding beam sizes in either the Planck or WMAP missions.For example, only the data from the high-frequency instrument on Planck could be used in the first wavelet band because the resolution was better than 10 ′ .The next step ( X −→ R) was to obtain w by solving Equation 7. As previously mentioned, this depends entirely on the construction of R. NILC spatially localizes data on a pixel-by-pixel basis by constructing R from only nearby pixels.The contribution of each pixel to R was based on an angular separation weighting scheme where closer pixels carried higher weights.We implemented a Gaussian weighting scheme (β) as a function of pixel separation, ∆s, such that where σ controlled the amount of spatial localization.σ was scaled with each window function such that where Γ is the spatial localization parameter.The next step ( R −→ ỹ) solved for the SZ signal in each wavelet band using Equation 7 and Equation 8. Then one applies the second application of the wavelet function to achieve full spatial resolution.The last step ( ỹ =⇒ y) was the wavelet reconstruction y = K i ỹi which added the y-maps per wavelet scale; this is the same as Synthesis step from Delabrouille et al. (2009).In the next section, we skip this last step and, instead, use the y-maps per wavelet scale as input to a neural network.

Deep-NILC
In this section we explain the pipeline to produce a y-map with Deep-NILC.There are two main parts: the first is a data pre-processing stage using the NILC algorithm, and the second step models these data using a neural network.
The first part of the pipeline was to apply the NILC algorithm to our set of simulated data.Unlike the traditional NILC, we did not assume some Γ but rather performed extractions for a range of reasonable values; specifically, ten values of Γ with logarithmic spacing from [1-10] (i.e., 1, 1.292, 1.668, ..., 10).A total of 80 decomposed y-maps were created for a single simulation of the sky given the eight harmonic bands and ten values of Γ.
Varying Γ provided information on how the ILC weights reacted to the surrounding contamination.One could imagine scenarios where using one value would be more suitable than others.For example, Figure 1 shows an example of how altering Γ affected the extraction for a pixel near the center of a strong cluster.According to Equation 12, extracting the SZ signal around a massive cluster should be done with a large value of Γ.In this case, the signal was large enough where one becomes more concerned about the bias rather than contamination.The residual images shown in the top row of Figure 1 demonstrated that larger values of Γ yielded better estimates near a strong SZ source.On the other hand, Figure 2 shows a case where the SZ signal was weak and coincident with a known radio source from the PySM simulations.In this case, Γ = 1 helped reduce the contaminating signal from the radio source.One can see the true values (red lines) did not agree well with any of the NILC solutions for the first couple of wavelets, however, the solution when Γ = 1 was the closest match.Based on the arguments above, one can achieve better NILC extractions when Γ is allowed to vary.Now, we seek a model that can take NILC extractions from various Γ values to improve estimations of the SZ signal.In order to tackle this problem, we used a fully connected, feed-forward network known as a multilayer perceptron (MLP).MLPs consist of multiple layers of interconnected nodes, known as neurons.The networks are designed to process information from the input layer, through a set of hidden layers to the output layer.Each neuron receives signals from the neurons in the previous layer and returns a weighted linear combination of the inputs.The result then undergoes a non-linear activation function which is then fed as input to the next layer.The parameters of the network are then trained with supervised learning through an iterative procedure known as gradient descent.After each iteration, the trainable parameters are updated to minimize the error between the desired output and predicted output.
The network performed regression by minimizing the mean absolute error (MAE) loss function where y ′ denotes the best possible NILC solution.In other words, y ′ was constructed from the Γ that yielded the closest value to the true SZ signal on each wavelet scale then summing them together.Examples of this are shown as the black circles in Figure 1 and Figure 2. We note the only purpose of y ′ was for training the MLP, and it is not available for real data.We chose to regress on y ′ instead of the true SZ values to keep our model from over-correcting, and keeping the results within the confines of the NILC method.Taking Figure 2 as an example, many of the NILC predictions were far away from their true values when extracting the SZ signal in a highly contaminated region.If we regressed on the true values, this region would increase the MAE and the model would work hard to accommodate to the large residuals.Replacing the true values with y ′ mitigates this impact by reducing the range of residuals.A number of hyperparameters had to be defined, and those of our network are shown in Table 1.The input layer accepted a matrix of shape (K × N Γ ) where N Γ represents the ten Γ values.The data were then flattened into a vector before passing to the first hidden layer.The MLP contained nine hidden layers with Leaky Relu activation functions (slope for negative values was 0.01) after each hidden layer.The output layer used a linear activation and yielded a single number.In addition, normalization was applied to the input and output data to help train the model by scaling and shifting them to have zero mean and unit variance.We tuned our model to find optimal values of the initial learning rate, mini-batch size, number of layers, and number of nodes per layer.We decided to use the Adam optimizer (Kingma & Ba 2014) with an initial learning rate of 0.001 with exponential decay rates of 0.9 and 0.999 for the first and second moments respectively.The most sensitive hyperparameter was the learning rate.
All other hyperparameters, such as the batch size and network architecture, did not require meticulous tuning.Finally, we restricted the set of training pixels to be at Galactic latitudes |b| > 10 • to avoid training in the most highly contaminated regions.

RESULTS
In this section, we present the main results from the validation data.Specifically, we focus on the performance of Deep-NILC to the y-maps generated with fixed values of Γ.We consider three nominal values of gamma to give a sense of how NILC extractions behave: Γ = 1, 4.64, and 10. (These were selected from a set of Γ values with logarithmic scaling.)First, we show residual statistics between the estimated y-maps and the true signal including: power spectra, bias estimates, and crosscorrelations between the strongest contaminants.Then we present the radial profile extractions for resolved systems.

Residual Power Spectra
Here we consider the power spectra of the residual maps ( y − y), where y is the true SZ signal.The residual power, d ℓ , defined as ℓ(2ℓ+1) 2π ∆C ℓ where ∆C ℓ was computed from the residual map after setting pixels of low Galactic latitude (|b| < 10 • ) to zero.In Figure 3, we show the residual power for five y-maps.The gray line shows the results for the best possible NILC solution (y ′ ); the black line denotes Deep-NILC; and the colored lines display the results for fixed values of Γ.As expected, the gray line showed the smallest amount of residual power.The blue line exhibited the largest amount of power for ℓ ∼ 200, but less power compared to the other values of Γ when ℓ ≲ 40.Deep-NILC had the least amount of residual power compared to all fixed values of Γ on all angular scales.

Bias and Scatter
This section presents the residuals as a function of SZ signal strength.For this part, pixel values were first separated into logarithmic bins of size 0.25 dex.Then we calculated two quantities: the empirical bias B = ⟨ y −y⟩ and scatter Σ 2 = ⟨( y − y − B) 2 ⟩.In Figure 4, we plot the average values of the three validation maps.The color coding is the same as in Figure 3.
In the top-left panel, we show the fractional empirical bias.The Γ = 1 extraction yielded a negative fractional bias, especially for moderate to large SZ signals.Larger values of Γ (i.e., 4.64 and 10) appeared to show almost no bias for y > 10 −7 , but yielded a positive bias for weaker signals.Finally, the best NILC solution showed a slight negative fractional bias that decreased as the SZ signal increased, albeit only by a few percent.The results of Deep-NILC produced an unbiased solution for strong SZ signals y ≳ 10 −5 , however, it returned a steep negative bias moving toward weaker signals, reaching 50% bias by y ∼ 10 −7 .We note the the empirical bias is not the same as the ILC bias presented in Equation 10.The ILC bias stems from the amount of available modes used to construct the covariance matrix in each wavelet band.The empirical bias includes the ILC bias, however, it also depends on the contaminating properties which are not stationary.
The top-right and bottom-left panels include the fractional empirical scatter.The bottom-left panel shows dashed and solid lines to denote the empirical bias and scatter respectively.The circle points represent these two quantities added in quadrature, and we call this the total fractional error.One can see the total fractional error for Γ = 1 was dominated by the bias for large SZ values, but this changed as the signal dropped below y < 10 −5 .For Γ = 10, however, the bias remained subdominant to the scatter for all values.Looking at y ′ , the total error was controlled by the scatter for all signal strengths.This same trend was also observed for Deep-NILC.
In the bottom-right panel we show the expected signal-to-noise which we define as The main takeaway from this plot is the S/N for Deep-NILC was larger for all fixed values of Γ over most values of SZ signal.We discuss the implications of these results in section 4 to argue Deep-NILC offers an improved solution.

Component Correlation
These results show the correlation between the residual y-maps and known sources of contamination.We used the correlation coefficient where a and b represent two maps of the sky, and a × b denotes a cross-correlation.We set one of these to be the residual y-map while the other was a map of the contamination component.Most of the components did not show significant correlation with the residuals.The strongest correlations occurred with the SZ effect itself and the CIB.The cross-correlation coefficients between the residuals and the SZ signal are shown in the left panel of Figure 5.A strong negative correlation existed for Γ = 1 on nearly all scales.As Γ increased, however, the correlation coefficient decreased.Essentially, no correlation was seen by the time Γ = 10.The best NILC solution also exhibited a slight negative correlation and was similar to the results of Γ = 4.64 and Deep-NILC.
The right panel of Figure 5 shows the residual correlation with the CIB.Positive correlations were seen on moderate to large scales for most values of Γ.The most positive correlations were observed for the largest values of Γ. Deep-NILC and y ′ produced nearly identical correlations and were quite similar to that of Γ = 4.64.

Radial Profiles of Resolved Systems
Here we explain how radial profiles of the resolved systems were extracted from a given y-map.Before the extractions, the largest SZ values were masked to mitigate the contributions from nearby signals.We masked the pixels in the top 5% of pixel values found in a 20 • radius in the background SZ.The group-and Coma-like signals were extracted out to 400 ′ and 300 ′ respectively in 10 ′ circular bins.Mean values calculated from the outer five bins were then subtracted off as a local background correction for each field.The Virgo-like system was extracted out to 700 ′ using the last ten annuli for the local background correction.Next, we calculated the stacked signal for each type of systems.For the ten Coma-and Virgo-like signals, we took the mean value and the standard deviation across each annulus as a measure of the uncertainty.We did not use the error on the mean value since these signals are unique in the sense we can only make one observation of them in the real Universe.For the galaxy group-like signal, we performed bootstrap sampling using sample sizes of ten.Galaxy group signals are too weak to be studied individually, so in practice, one would have to stack their signals to get a robust measurement.We chose a sample size of ten to mimic the sample size of nearby galaxy groups used in (Pratt et al. 2021).
In the left column of Figure 6, we show the stacks of resolved signals.The results from Deep-NILC are plotted as black circles while those from the three nominal values of Γ are shown as colored stars.We omit the results for Γ = 4.64 in the group-and Virgo-like rows for visualization purposes; the results for Γ = 4.64 generally lied between those from Γ = 10 and Γ = 1.The black line represents the injected profile that we tried to recover, however, it is more important to look at the cyan data.These represent the true stacked signal which is the superposition of the background SZ signals and the injected signals.In reality, one can not perfectly recover the black line as the injected signal becomes convolved with the background SZ.
For the galaxy group-like signal, the results of Deep-NILC generally were between those of Γ = 10 and Γ = 1.The statistical errors for Deep-NILC were roughly 60% those of Γ = 10, however, Deep-NILC yielded a larger bias as seen in Table 2.There was very little difference between Deep-NILC and Γ = 10 for the Coma-like signal, but Γ = 1 performed poorly with large bias.Looking at the Virgo-like system, there was a clear negative bias for Deep-NILC while Γ = 10 returned accurate and acceptable results.For better inspection, we plot the empirical bias, scatter, and total error for the stacks in the appendix.
In the right panel of Figure 6 we show the cumulative "SZ flux" given as each radial bin is indexed by i, and ∆r = 10 ′ which is the angular width of each bin.This observed quantity is important as it is directly relatable to the total gas mass enclosed within some radius.We estimated the SZ flux values within 1, 2, and 3 R 500 for each system indicated by the dotted brown lines.Then we calculated the percent differences from the true Y for the three Γ values and Deep-NILC which are presented in Table 2.
We discuss these results further in section 4. Our results demonstrated that Deep-NILC outperformed NILC algorithms that assumed a fixed value of Γ.Small values of Γ, specifically Γ = 1, produced highly biased results.This was most evident in the top-left panel Figure 4 where the average bias was ∼ 10% − 20%.It was also manifested in the bottom-left panel of Figure 4 where the bias term provided the most contribution to the total fractional scatter for large SZ signals.The same idea was also reflected in the left panel of Figure 5 which showed a very strong correlation between the residuals and the SZ signal.This is why Γ = 1 exhibited the largest residual power in Figure 3.
Furthermore, Γ = 1 yielded the worst extraction for Coma-and Virgo-like systems, however, its performance was better for the galaxy group-like signal.For the pixelby-pixel statistics, Γ = 1 yielded the best results for weaker SZ signals and on the largest scales.Part of the reason could be due to the lack of correlation seen between the residuals and the CIB.In other words, smaller values Γ did better at extracting weak and/or large-scale signals because they reduced the effects from contamination.Nevertheless, the large observed bias was prominent and would not be suitable for most SZ studies.
The story was essentially reversed when looking at larger values of Γ.We found that Γ = 10 provided the most unbiased solutions but with the strongest residuals.
In Figure 4 we demonstrated the bias term was insignificant compared to the empirical scatter.Moreover, there was a modest positive bias for weak SZ signals at large radii which likely coincided with the positive correlation seen with the CIB in Figure 5.
Deep-NILC outperformed the extractions for both large and small values of Γ by providing a balance between the bias and variance.This balance was seen in the bottom-left panel of Figure 4 where Deep-NILC approached the solution for Γ = 10 for large SZ values and that of Γ = 1 for weak signals.These results mean that Deep-NILC was able to adjust the localization based on the strength of the SZ signal.We also observed Deep-NILC to have the smallest residual power in Figure 3 and more accurate recoveries of the group-and Comalike profiles in Figure 6 and Table 2.

Caveats
Deep-NILC comes with limitations.Most importantly, the algorithm relies heavily on the quality of simulated data.The data used in this work were generated using a set of ten different y-maps but only one realization of contaminating components.One might argue that this could lead to over-fitting to this specific set of signals.In the future, emphasis should be placed on including varying the foreground emission models.However, we note that the NILC estimate was also affected by other nearby SZ signals as they were contained in the covariance matrix.In this sense, the SZ signal can be thought of a contaminant itself using NILC.One way around this might be to mask strong SZ sources before constructing R. Nevertheless, the different y-maps we used gave a variety SZ distributions, and this helped provide generalization to our model.
Also, building a model with supervised learning is sensitive to the diversity of training examples.We had to include additional resolved SZ signals to augment the y-maps from Han et al. (2021).This was imperative for our model to learn about the resolved "anomalies" seen in our real sky, such as the Coma and Virgo clusters.It is likely, however, that there were not enough examples in the training data of a Virgo-like cluster.The Virgo cluster is very unique as it is the only prominent SZ signal that extends multiple degrees on the sky.In our methodology, we injected resolved systems based on the MCXC catalog, so only a few instances of Virgo-like systems were included in the training set.In Figure 6, a non-trivial bias was observed for the Virgo-like stack, however, this was not observed for the Coma-like and galaxy group-like signals which were much smaller in angular size.Furthermore, the majority of SZ sources produced zero signal on large-scales.Wrapping this together with the sample imbalance is what probably caused Deep-NILC to under-predict the signal of Virgo.
Another potential drawback of our simulated data set was the omission of spatially correlated components.For example, a background AGN may produce strong emissions both in radio and IR.If strong enough, they may significantly affect the NILC extraction.Also, radioloud AGNs often sit at the center of galaxy clusters (e.g., Virgo cluster).Deep-NILC was not trained to see these spatially correlated components, thus, we caution that Deep-NILC could yield unreliable results in some instances of real data.This may be worth exploring in future work as simulations improve.Deep-NILC could not perform better than the Γ values used here.If no combination of the Γ values could yield reliable results then Deep-NILC would not do any better.Furthermore, we did not investigate the full parameter space of NILC.Even though we considered a grid of spatial parameters, we chose to fix the set of harmonic window functions.Using a more diverse set of harmonic filters could help improve the performance of Deep-NILC, but we leave this to future work.

SUMMARY
This study investigated the systematic effects of the NILC method when extracting the SZ signal.Our methods included injecting synthetically generated ymaps into a simulated foreground of radio/IR emissions.These emissions were then computed as band-averaged signals that would mimic the observations seen by the Planck and WMAP satellites.
Next, we performed the NILC component separation technique in attempt to recover the input SZ signal.The main focus was on the spatial parameter, Γ, which controlled the amount of contamination and bias.Mathematically, we demonstrated an optimal value for Γ exists that minimizes the contribution from both of these effects.It was then argued that it is nearly impossible to analytically calculate the best value of Γ a priori as it depended on many complicated variables, such as the varying level of nearby contamination.
We went on to show how altering the value of Γ significantly changed the performance of the SZ extraction.Specifically, we found that small values of Γ tend to yield better results when attempting to extract weak SZ features and/or when there exists significant contributions from contamination.One the other hand, bigger values of Γ yielded unbiased solutions and performed better when extracting large SZ signals.
Rather than using a fixed value of Γ, we developed an algorithm to improve the NILC solution known as Deep-NILC.Deep-NILC required various NILC extractions for different values of Γ.These were then fed into a MLP to predict the SZ signal.The main takeaway from our experiments are the following: • Deep-NILC yielded smaller residual power on all angular scales compared to all values of fixed Γ as shown in Figure 3.
• The correlation between the Deep-NILC residuals and known contaminants, particularly the CIB and the SZ signal itself, were closest to that obtained by the best possible NILC solution (see Figure 5).
• The sum of the empirical bias and scatter terms seen in Figure 4 suggested Deep-NILC rendered the smallest deviations for nearly all strengths of SZ signals.The one exception being at y ≳ 10 −4 where Γ = 10 performs slightly better, but this was a marginal difference.
• When stacking the resolved profiles for galaxy group-like signal, Deep-NILC yielded the best results when considering the total contributions from the bias and scatter.
• Deep-NILC under-predicted the signal coming from a Virgo-like source which was likely due to an imbalance in the training data.However, the inferred SZ flux for Deep-NILC was only biased at the 10% level.
There are many other approaches that can, and should, be investigated using machine learning.As simulation data quality continues to improve and as future missions bring observational insights, data-driven models may soon supersede analytical techniques.Deep-NILC provides a bridge between analytical and datadriven solutions which could provide potential improvements to a variety of problems.In the future, one might be able to build a model that could learn the SZ signal without doing any pre-processed component separation.
We would like to thank the anonymous referee for providing excellent feedback to help improve this work.We would also like to thank Camille Avestruz, Jennier Li, and Ashley Villar for useful conversations regarding DL.In addition, we want to thank Charles Antonelli and John Theils for computational support.This project also made use of the Great Lakes computing cluster at the University of Michigan-Ann Arbor.We are grateful for support from the NASA grants AWD012791 and 21-ADAP21-0069.

Figure 1 .
Figure 1.An example of how different values of Γ affect the NILC extraction of the SZ signal.The images in the top row are 10 • x10 • and centered around a strong simulated SZ source.The leftmost panel denotes the input SZ signal.The panels to the right show residuals between the input and reconstructed y-maps using three values of Γ.The eight panels below display the different estimates of the center pixel in the above images.They are shown as a function of wavelet scale and Γ.The best possible NILC solution given an array of Γ values are shown as filled circles; this is the closest estimate to the true value which is shown as a horizontal line in red.Summing the true values (red lines) on each wavelet scale yields the total SZ signal.The main point of this plot is to demonstrate how the choice of Γ impacts the estimate of the SZ signal.It also shows that larger values of Γ are typically preferred when performing extractions where the SZ signal is strong.

Figure 2 .
Figure2.Same as Figure1except for the change in location of the sky.This example is centered around a pixel with a weak SZ signal, but it is also centered around a known radio source from the simulations.Smaller values of Γ were preferred in this scenario, which were generally true for pixels with weak SZ signals and/or strongly contaminated regions.

Figure 3 .
Figure 3. Average residual power spectra for the three validation y-maps.The gray line denotes the residual for the best possible NILC solution and black is the results from Deep-NILC.The blue, green, and red are the residuals for Γ = 1, 4.64, 10 respectively.

Figure 4 .
Figure 4. (top-left) The fractional bias as a function of SZ signal strength.We show the results for the three nominal values of Γ, Deep-NILC, and the best possible NILC solution.(top-right) Same as top-left panel but for the empirical scatter.(bottomleft) Displays the contributions from each source of error.The dashed lines show the absolute values of the top-left panel, solid lines show those from the top-right panel, and the circles denote the total contribution from both components.(bottom-right) The expected S/N for the pixel values.The y-axis is truncated at S/N= 1 and where log10y = −5.

Figure 5 .
Figure 5. Cross-correlation coefficients using the residuals from y ′ , Deep-NILC, and the three nominal values of Γ. (left)Correlation between the residuals and the SZ signal.The best possible NILC is in gray and lies just below zero for all ℓ.(right) Same as left panel but for the CIB.The main takeaway here is that Deep-NILC (black line) closely followed the best NILC solution (gray line) which was desirable.

Figure 6 .
Figure 6.The left column shows the stack of ten radial profiles for three resolved signals.The right column shows the cumulative SZ flux.The group-, Coma-, and Virgo-like signals are shown in the top, middle and bottom panels respectively.The results from different values of Γ are shown as stars, those from Deep-NILC as black circles, and the true signal as cyan squares.The stars and circles are slightly offset in the horizontal direction to help with visualization.Extractions for Γ = 4.64 are not shown in the group-and Coma-like panels to avoid over-crowding.The black line shows the profile that was injected into the simulated y-map, and the dotted brown lines indicate the positions of 1, 2, and 3 R500.The error bars show the standard deviation of the mean for each annulus.The cyan points do not exactly match the input profile since the injected signal has been convolved with other nearby SZ signals.The error bars on the cyan squares are the statistical fluctuations of the true signal within the stack after some masking.The main point of this plot is to show how well resolved signals are recovered with Deep-NILC and different values of Γ.

Figure 7 .Figure 8 .Figure 9 .
Figure 7.The average bias, scatter, and total error for galaxy group-like signals are shown in the top, middle, and bottom panels respectively.These quantities are shown in the left column while they are shown as a fraction of the SZ signal in the right column.

Table 1 .
Neural Network Parameters

Table 2 .
Percent Difference in Y