Detection of Dipole Modulation in CMB Temperature Anisotropy Maps from WMAP and Planck using Artificial Intelligence

Breakdown of rotational invariance of the primordial power spectrum manifests in the statistical anisotropy of the observed Cosmic Microwave Background (CMB) radiation. Hemispherical power asymmetry in the CMB may be caused due to a dipolar modulation, indicating the presence of a preferred direction. Appropriately re-scaled local variance maps of the CMB temperature anisotropy data effectively encapsulate this dipolar pattern. As a first-of-its-kind method, we train Artificial Neural Networks (ANNs) with such local variances as input features to distinguish statistically isotropic CMB maps from dipole modulated ones. Our trained ANNs are able to predict components of the amplitude times the unit vector of the preferred direction for mixed sets of modulated and unmodulated maps, with goodness of fit ($R^2$) scores $>0.97$ for full sky, and $>0.96$ for partial sky coverage. On all observed foreground-cleaned CMB maps, the ANNs detect the dipolar modulation signal with overall consistent values of amplitudes and directions. This detection is significant at $97.21\%-99.38\%$ C.L. for all full sky maps, and at $98.34\%-100\%$ C.L. for all partial sky maps. Robustness of the signal holds across full and partial skies, various foreground cleaning methods, inpainting algorithms, instruments and all the different periods of observation for Planck and WMAP satellites. The significant and robust detection of the signal, in addition to the consistency of values of amplitude and directions, as found independent of any pre-existing methods, further mitigates the criticisms of look-elsewhere effects and a posteriori inferences for the preferred dipole direction in the CMB.


INTRODUCTION
Departures from Statistical Isotropy (SI) of the Cosmic Microwave Background (CMB) temperature field may indicate limitations or errors in measurement of the CMB despite the use of highly precise instruments for observation, if not due to an actual breakdown of the rotational invariance of the primordial power spectrum. However, by means of appropriate statistical methods, systematic effects or foreground residuals may be considerably eliminated as possible causes of deviations from SI.
Several such departures from SI have been studied by authors in existing literature. These include the unusually low cosmic quadrupole (Bennett et al. 2003a;Gaztañaga et al. 2003;de Oliveira-Costa et al. 2004a), and planarity of the cosmic octupole and the quadrupole-octupole alignment as investigated by de Oliveira-Costa et al. (2004a); Tegmark et al. (2003); de Oliveira-Costa et al. (2004b); Schwarz et al. (2004a). The quadrupole-octupole alignment was seen to get strengthened on removal of the frequency dependent kinetic Doppler quadrupole (Notari & Quartin 2015). The low multipole regime was studied by Schwarz et al. (2004b) with the help of multipole vectors and found to be consistently anomalous with respect to multipole aligments. Further, Land & Magueijo (2005a) showed that a mysterious correlation exists between azimuthal phases of the third and fifth multipole moments. A significant power asymmetry between the two hemispheres of the CMB was found by Eriksen et al. (2004) ;Hansen et al. (2004); Eriksen et al. (2007) and further corroborated by Bernui et al. (2014).
A power excess for odd multipoles was studied in the work of Land & Magueijo (2005b). This parity asymmetry in the CMB angular power spectrum (APS) was confirmed by Kim & Naselsky (2010a,b) and the anomaly was seen to disappear without the contribution of the first six low multipoles (Aluri & Jain 2012). Using symmetry-based methods of power and directional entropy statistics, Samal et al. (2008Samal et al. ( , 2009 showed that the departures from SI extend to higher multipoles as well. For scales above 60 • , nearly negligible correlation was seen by Copi et al. (2007Copi et al. ( , 2009Copi et al. ( , 2015 with various CMB data releases. Kim & Naselsky (2011) showed that the occurrence of parity asymmetry in the APS is equivalent to this deficit of large angle correlation. Further, on the basis of behaviours of level clustering and repulsion for uncorrelated and correlated values, Khan & Saha (2022a) showed that only the level correlations between even multipoles is anomalously low. Zhao (2014) studied a directional dependence of the parity asymmetry and suggested a common origin of the low multipole anomalies.
Additionally Larson & Wandelt (2004) showed that the mean values of hot and cold spots of the CMB are unexpectedly low, while Monteserín et al. (2008) found that the variance of the CMB temperature anisotropy field is also anomalously low. The low CMB variance anomaly was seen to vanish when the quadrupole and octupole were excluded from the CMB maps under investigation (Cruz et al. 2011). Using novel statistics to measure the strength and shape of distribution of CMB local extrema, Khan & Saha (2022b) found a strikingly weak non-uniformity in the distribution of hot and cold spots on the CMB, which is due to the low CMB temperature variance and anomalous contributions of the quadrupole and octupole.
It is important to investigate any CMB anomaly from as many perspectives as possible to assess its significance and role in cosmological parameter estimation. For example, the direction associated with CMB parity asymmetry aligns at about 45 • from a best-fit dipole form for various cosmological parameters (Yeung & Chu 2022). Besides, a directional variation of the cosmological parameters on the CMB sky was found to be significantly anisotropic and this finding is corrrelated with the preferred direction for the hemispherical power asymmetry anomaly (Fosalba & Gaztañaga 2021). Since these works report a correlation between these departures from SI of the CMB and the anisotropic directional dependence of cosmological parameters, hence it becomes difficult to disregard the violations of SI as mere statistical fluctuations (Aluri et al. 2022).
These departures from SI were found to be robust against masking of the CMB sky, instruments used for observation, foreground cleaning methods, periods of observation, bands of frequencies at which the CMB is observed, and the like. Further, checks of robustness help reduce the possibility that the significant results can be attributed to look-elsewhere effects. Many such independently conducted findings of deviations from SI also weaken the inference that the consequent signal detection could have happened solely due to the nature of estimators which were designed by hand 'a posteriori' (Bennett et al. 2013;Rassat et al. 2014) to focus on some unusual features.
However, despite the high statistical significance of most of such departures from SI, they are ascertained to be fairly within the underlying probability distribution given by the ΛCDM model. Thus we can have either of two possible conclusions: (a) we may say that we happen to inhabit a rare realisation of the universe given by the ΛCDM standard model, or (b) we inhabit a reasonably probable realisation of a different model. The latter case then warrants contemplation of new physics beyond the Standard Model of Cosmology.
One of such departures from SI which has been robustly observed, is the hemispherical power asymmetry (Eriksen et al. 2004(Eriksen et al. , 2007. It was hypothesised to be engendered by the addition of a dipolar modulation to otherwise statistically isotropic CMB temperature anisotropy fluctuations T 0 (n). Thus, the net temperature anisotropies in this scenario are where, the amplitude of modulation is denoted by A, and the preferred direction is given by the unit vectorλ. In harmonic space, the temperature fluctuations T 0 (n) are decomposed as: These T 0 (n) are expected to be Gaussian random and generated from a rotationally invariant primordial power spectrum. Hence there are no preferred directions in the standard model that may couple modes of these temperature fluctuations in harmonic space. This notion of SI is encapsulated in the relation a m a * m = C δ δ mm .
Thus the spherical harmonic coefficients a m are uncorrelated between different multipoles. However, if the dipole modulated T (n) are similarly decomposed in spherical harmonics, the corresponding a m will contain correlations between multipoles and + 1 (Rath et al. 2015), indicating a violation of SI.
As an entirely novel approach towards understanding the possible presence of a dipolar modulation in CMB temperature anisotropy data, we employ Artificial Neural Networks (ANNs). ANNs are computer based analogs of networks of biological neurons, and constitute an important machinery with decision making and parameter estimation capabilities, that falls under the umbrella of Artificial Intelligence (AI). We use deep learning techniques to train the ANNs on a mixed set containing equal numbers of simulated SI obeying (unmodulated) and SI violating (dipole modulated) CMB maps, which is inclusive of a large number of possibilities of the presence or absence of the signal. Thus our trained ANNs can make a self-guided and robust estimation of the presence of the signal of dipolar modulation, quantified with the value of the amplitude. The rationale behind using the amplitude for this purpose is that CMB maps that obey SI will have zero amplitude for such modulation, whereas those that contain the modulation will have non-zero values of the amplitude. As a realistic approach, we design an ANN with partial sky coverage in addition to one that works for full sky coverage, since we may not always have completely reliable full sky observations. Besides, we are able to compute the directions of the modulation with the help of the trained ANNs. Thus our method serves as an independent investigation to establish or reject the existence of the dipolar modulation signal as seen in existing literature.
Previously, statistics or estimators have been devised to ascertain the amplitude and direction of a possible dipolar modulation in the CMB. Estimators can be constructed in pixel or harmonic space, as per the requirements of the studies that undertake the same. For example, since the amplitude of the modulation has been shown to be dependent on the scale (Hoftuft et al. 2009) and hence the multipole range under consideration, studying estimators in multipole space helps estimate this scale dependence (Marcos-Caballero & Martínez-González 2019; Rath & Jain 2013). Whereas, an analysis in pixel space can be immensely useful so as to avoid subtle biases introduced due to masking of the sky that causes extraneous couplings in multipole space (Hivon et al. 2002), or those caused due to inpainting of partial sky maps (Starck, J.-L. et al. 2013). In this work, we train ANNs with normalised or re-scaled local variance maps (Akrami et al. 2014) in pixel space, which serve as important input features containing direct information of the amplitude and directions of the dipolar modulation in the form of scalar products. This method helps us eschew the complex task of constructing statistics for detection of the signal. The ANNs are designed to work on scales of observation corresponding to the range of multipoles ∈ [2, 256]. We defer a study of the scale dependence of A to future work.
The implementation of ANNs for detecting previously studied features in the CMB could revolutionise perspectives towards understanding CMB anomalies as opposed to classical fitting or regression methods and traditional frequentist approaches. ANN architectures can 'learn' signal detection capabilities by being introduced to a training set of samples. Once trained, the ANN can then be fed observed foreground-cleaned CMB data to predict a possible signal in the same.
A comprehensive review of the preliminary use of ANNs in Astronomy and Astrophysics can be found in the article by Miller (1993) with regard to telescope optics, object classification and filtering of detector events. Further Tagliaferri et al. (2003); Wang et al. (2018); Djorgovski et al. (2022); Chen et al. (2020) describe the growth of ANN based algorithms to perform time series analysis, detection of noise, and data mining in addition to classification and identification of astrophysical objects such as new stars, galaxies or even dark matter.
In Cosmology, use of ANNs has ushered in a new era of numerical frameworks to ease computations and analyses. They were used by Liu (2017) for generating dynamics of inflationary trajectories in a multi-field scenario. Dialektopoulos et al. (2022) used ANNs to reconstruct late-time expansion and large scale structure (LSS) cosmological parameters. Wang et al. (2020) used them to estimate quantities such as the Hubble parameter and luminosity distance as a function of redshift of Type Ia supernovae. Gómez-Vargas et al. (2021) modelled ANNs with Bayesian inference to calculate the likelihood function and reduce computation time for cosmological parameter estimation. Escamilla-Rivera et al. (2020) provided a combined Bayesian and Recurrent neural network approach to ascertain confidence regions for parameters from dark energy models. Besides, ANNs can be designed for estimation of parameters using the 21 cm signal from the epoch of reionization (Shimabukuro & Semelin 2017;Choudhury et al. 2021). A general overview of ANNs and their applications in analysis of cosmological data can be found in the article by de Dios Rojas Olvera et al. (2022).
Recent applications of ANNs specific to CMB data analysis can be found in the following works. Petroff et al. (2020) implemented an appreciable full-sky foreground cleaning of the observed CMB, while Wang et al. (2022) were able to recover CMB signals from foreground contaminated maps using Convolutional neural networks (CNNs). Chanda & Saha (2021) applied ANNs to successfully recover full sky CMB temperature APS from a low resolution masked or partial sky CMB map, while  designed ANNs for such estimation of full sky CMB temperature power spectrum with higher resolution partial sky CMB maps. Using CNNs  reconstructed the full sky power spectra of CMB E and B modes for such high resolution CMB maps, while minimising the leakage between the two modes. Graff et al. (2012) implemented Bayesian inference algorithms to make ANNs learn the likelihood function and estimate cosmological parameters from CMB data. Moss (2020); Hortua et al. (2020) trained ANNs to mimic mixing of Markov chains (MCs) and parameterization of Monte Carlo MC proposals. Spurio Mancini et al. (2022) developed ANN based estimators to compute the matter and CMB power spectra as a replacement of Boltzmann codes suited for both LSS and CMB surveys.
We have organised our paper as follows. In Section 2 we present the underlying formalism behind normalised local variance maps which can be directly used as input features for training a neural network. In Section 3, we briefly describe the internal structures of ANNs and the algorithms with which they function as trainable artificial analogs of biological neural networks. We elucidate our procedure for obtaining mixed sets of unmodulated and modulated CMB maps, and using them for training the ANNs in Section 4. Following this, we discuss the specific structure of our ANNs and regularization methods used to train the same for both full and partial sky maps in Section 5. The analysis of test sets and observed foreground-cleaned CMB maps are presented in Section 6, after application of our trained ANNs to those. In Section 7, we summarise our work, and enumerate the key findings of the paper.

FORMALISM
For the CMB temperature anisotropy field defined on the 2-sphere of observation, we can compute its variances inside different local regions of the sphere. We consider these regions to be discs of equal area spanning the 2-sphere. In this Section, we present the formalism of how such local variances, after appropriate re-scaling are equivalent to the amplitude times a scalar product of the constant unit vector of modulation and the mean direction of the local disc, as first utilised in the work of Akrami et al. (2014).
In HEALPix (Górski et al. 2005) notation, the parameter n side characterises the pixel resolution of a CMB map. For convenience, we denote the n side of a high resolution CMB map by n h and that of a lower resolution CMB map by n l . We consider a disc of radius r h , on the map at resolution n h . We then calculate the local mean and variance of temperature fluctuations within the disc. Thus the mean of modulated disc temperature fluctuations from equation (1) is, where d denotes expectation value over the disc. Let us consider the second term in the equation above where A is a constant. The statistically isotropic Gaussian random fluctuations T 0 are approximately independent of and uncorrelated with the variations ofλ ·n, if the disc is sufficiently small enough so that theλ ·n term is slowly varying. Hence, the expression (4) reads: The local variance within the disc is Expanding out, the expression for the disc variance becomes where σ 2 0d stands for the disc variance in the absence of a modulation. We can replace the term λ ·n d witĥ λ · n d , sinceλ for any particular CMB map is constant. Further, as the average of position vectorsn is over a disc, n d =N , whereN is the centre of the disc. Hence, However, evaluating σ 2 0d from a single Gaussian random realisation may give a biased estimate. Instead, we will compute σ 2 0d e and use that in our expression above. This expectation e is over an ensemble of statistically isotropic realisations. Thus finally we arrive at the following normalised local variance (NLV), which shall be used in all analyses hereafter. Several discs on the n h map are considered and their NLVs computed. These NLV values are then assigned to corresponding pixels of another map at a lower resolution n l . Thus to construct an NLV map, the total number of discs to be considered = 12 × n 2 l . The centre of any particular disc on the n h map is taken to be the same as the position vector of the pixel of the n l resolution map to which the NLV of that disc is assigned. With this information, we can calculate the approximate number of pixels (n pd ) of the n h map inside each disc of given radius r h (in degrees). This is expressed as We can calculate the approximate number of pixels in possibly overlapping regions as follows. For n pl = 12×n 2 l discs, the total number of pixels taken is n pt = n pl ×n pd . Thus if n pt > n ph , then there are n pt − n ph number of pixels which are present in overlapping regions of discs, and vice versa. Here, n ph = 12 × n 2 h .

HOW DOES AN ANN WORK?
An artificial neuron is the building block of an ANN, which is inspired from the concept of the biological neuron (McCulloch & Pitts 1943;Rosenblatt 1958). It combines a weighted sum of several inputs, adds a bias to that sum to give a preliminary output. Since this preliminary output is a linear mapping, no matter how many neurons are interconnected to form a network, the mapping from initial inputs to the last output can always be described as a linear mapping (M. 2016). In reality, the complex relations between inputs and expected outputs may never be reducible to a linear mapping. Hence using activation functions (Dubey et al. 2021) to introduce non-linearity becomes pertinent. Thus a modern day artificial neuron can be represented by the example in Figure 1, which shows how inputs x i are weighted by w i and summed together along with an offset or bias b. An activation function A acts on this sum to give the subsequent output y.
The initial inputs x form a layer of nodes called the 'input layer'. Several such outputs y can be formed from these initial inputs, using different weights and biases. All these y are then a new set of nodes that constitute the first 'hidden layer'. The second hidden layer can be formed by considering y's from the first hidden layer as inputs, and so on. A small ANN may have one or two hidden layers, whereas a larger ANN could comprise several such hidden layers. The last layer of any ANN consists of the final outputs and is called the 'output layer'. The number of nodes in this layer is equal to number of expected outputs that the ANN is being trained for. Due to the presence of hidden layers that are densely connected to their preceding layers, the method of training such ANNs is referred to as 'deep learning' (Emmert-Streib et al. 2020).
The activation function used to obtain the outputs usually differs from the choices in other hidden layers. For example, while the ReLU or LeakyReLU activation functions can be used for hidden layers, for the output layer, the respective activation used could be a sigmoid or sof tmax for binary or multi-class classification problems, or linear for regression problems (Lederer 2021). To illustrate the arrangement of a neural network, we show an example ANN in Figure 2. For all layers apart from the input layer, each node is fully connected to all the nodes in the previous layer.
Beginning with the input layer as the 0 th layer, we can successively number the other layers. Consider the i th node in the (l − 1) th layer, which is connected to the j th node in layer l. The associated weight and bias will be w (l) ij and b (l) j . Thus the value taken by the j th node in layer l is: If l = 1, then y (l) j represents a node in the first hidden layer and y (l−1) i = x i corresponds to that of the input layer. The 2D weight matrix between layers (l − 1) and l is given by [W (l) j . To begin with, the weights and biases for the ANN can be chosen randomly.
In the process of assigning values to nodes in subsequent layers, a forward propagation in the ANN is achieved. To verify if the final outputs are as expected, a loss function is computed for the outputs generated (Goodfellow et al. 2016). Since in this paper, we are dealing with a regression problem, we will consider the loss function as the mse or mean squared error, given as where the summation is over a total number of output values N . Thus mse is the average of squared differences between the predicted outputs (y pred ) from the ANN and the true values (y true ). To train the ANN effectively, we must perform back-propagation (Rumelhart et al. 1986), in which the weights and biases associated with the different layers are updated iteratively so as to minimize the mse loss function. The rate or step-size for updating in this process is known as the learning rate σ. Thus the basic relations which describe the process of back-propagation for a loss function H are, These relations correspond to the algorithm of gradient descent. However, such an algorithm when applied to the whole data-set can be computationally expensive.
Thus the data set is divided into several batches (Goceri & Gooya 2018) randomly, and the above algorithm is applied. A batch represents the number of samples from the data-set which are used during a part of an iteration for updating the parameters of weights and biases. A complete iteration during which the whole data-set is made to undergo the algorithm for optimization is called an epoch. Due to the random subdivision of the training set into batches, some stochasticity is introduced into the loss function. For our problem of regression of the amplitude and directions of any possible dipolar modulation, we have considered the Adam (adaptive moment estimation) optimizer (Kingma & Ba 2014), which incorporates adaptive estimates of the gradients and their squares. In this method, the parameters (weights and biases) are updated as follows.
1. A step-size or learning rate σ is specified.
5. The first moment vector m 0 , second moment vector v 0 and time-step t are initialised to zero.
6. The time step is updated as t → t + 1 .
Here, values of γ 1 = 0.9, γ 2 = 0.999, and = 10 −7 . Superscripts t in γ t 1 , γ t 2 , denote that those are raised to the power of t. The algorithm described above for the Adam optimizer is run for each of the batches within an epoch so that the ANN trains with the entire data set during one epoch itself. Several epochs may be required for the ANN to become fully trained.
At the end of an epoch, the ANN evaluates the final loss function from the set on which it is being trained. To infer whether or not an ANN is fully trained, i.e, if it is able to generalise its knowledge to sets on which it has not been trained, another data-set for validation is simultaneously considered at every epoch. The ANN acts on this set and generates the corresponding loss value. Depending on the nature of the loss function, the optimization of the same may either correspond to that of minimisation or maximisation. We consider the case of mse as the loss function, which must be minimized. Thus, over a considerable number of epochs, if the training loss does not appear to minimize, then the ANN is said to be 'under-fitting' and usually a more complex network architecture can help resolve the issue. On the other hand, if the training loss adequately reaches its minimum, while that of validation does not, then the ANN is said to be 'over-fitting'. This can be seen from the loss curves, where the validation loss curve is always above the training loss, whereas the training loss has converged to an appropriate minimum. The condition of over-fitting indicates that the weights and biases in the ANN are very well suited for the training set, but those are not optimal for the ANN to make appropriate predictions for new data sets that it has not 'seen' before.
Over-fitting can be resolved using regularization methods (Ying 2019) such as those of the L1 or L2 penalty or with the help of a 'dropout'. In designing our ANNs for estimation of dipolar modulation parameters, we have used kernel regularizers with L1 and L2 penalties in addition to a dropout. Kernel regularizers penalise the loss function of the training set by adding to it a strength factor (S) times the penalty P, which is formed from entries in the weight matrices. In case of the L1 kernel regularizer, P is the sum of absolute values of the weights, whereas for the L2 regularizer, it is the sum of the squares of the weights. On the other hand, if a dropout (Hinton et al. 2012) is applied before a layer, a fraction of the inputs from the previous layer are randomly dropped out, i.e, set to zero. This fraction is known as the rate of dropout and its value lies ∈ [0, 1].

METHODOLOGY
We have considered a mixed set of 5 × 10 4 randomly generated CMB realisations of maps at n h = 128. Half of these are statistically isotropic, and the other half are dipole modulated versions of the same.
The statistically isotropic CMB temperature maps at n h = 128 are obtained by choosing the spherical harmonic coefficients a m as Gaussian random variables with zero mean and variance given by the theoretical CMB temperature APS best fit to Planck 2018 data. Thus, we generate 2.5 × 10 4 SI obeying maps with different seed values.
In order to obtain 2.5 × 10 4 dipole modulated counterparts of the SI obeying maps at n h = 128, we utilise equation (1). We pick A from a uniform random distribution in accordance with the order of magnitude of reported values from observed foreground-cleaned data. We have further considered a wide range given by A ∈ [0.03, 0.15] so as to sufficiently accommodate a large number of possible values of amplitude. This further helps minimize the epistemic uncertainty (Hüllermeier & Waegeman 2021) of the ANN.
The direction of the dipole for modulation (λ) is chosen in the following manner. For three different seed values, we generate three random numbers from a normal distribution with a mean of zero and standard deviation equal to one. The numbers are chosen such that the sum of their squares are non-zero. They are then normalised by the square root of the sum of their squares. Thus the three resulting numbers form components of the randomly chosen unit vectorλ, which gives the preferred direction of dipolar modulation for a particular realisation. The rationale behind choosing the components as random normal numbers is to take into account all possible directions on the sphere (Muller 1959), since other choices such as those of random uniform numbers restrict the randomness in directionality of the modulation.
The NLV maps at n l = 16 are constructed using discs of radius = 6 • . Thus inside each disc, the approximate number of pixels at n h = 128 over which the local variances are computed is approximately 540, according to equation (10), for which it can be shown that there are about 1459237 pixels in overlapping regions of discs.
The manually adopted choice of r h = 6 • is an optimal one due to the following reasons. Local variance estimates over smaller disks will have relatively higher contributions from Monte Carlo noise due to lower number of pixels contained by them, which must be avoided. Besides, very small radii such as r h 4 • are subject to non-negligible contributions from the Doppler dipole (Adhikari 2014). However, choosing large radii weakens the assumption of a slow variation ofλ ·n inside the disc (Section 2), and can cause results to concur with statistically isotropic maps (Akrami et al. 2014) for very large r h . Hence, we choose r h = 6 • which is sufficiently small, and reasonably free from contributions of the Doppler dipole and Monte Carlo noise.
In order to construct the NLV maps, we require a mean variance map containing σ 2 0d e values. This mean variance map is obtained using an ensemble of 1 × 10 5 local variance maps at n l = 16, which were extracted from the same number of corresponding SI obeying realisations of maps at n h = 128. Of the total mixed set of 5 × 10 4 NLV maps, 2 × 10 4 maps are used for training the ANNs, 10 4 are used for validation, and the remaining 2 × 10 4 are used for testing the trained ANN.
We consider two cases of sky coverage, i.e, full and partial sky. In both cases of sky coverage and for both simulated and observed foreground-cleaned CMB maps at n h = 128, the multipole range under consideration is [2,256]. This is because the monopole (which corresponds to the uniform temperature of the CMB) and the dipole (containing contributions from the Doppler shift due to solar motion) must be disregarded for cosmological inferences. For the observed foreground-cleaned full sky CMB map (m i ) at resolution n h in harmonic space, we set the spherical harmonic coefficients corresponding to the monopole and dipole to zero. With the new set of spherical harmonic coefficients, we generate the corresponding full sky CMB map (m f ) which is devoid of the monopole and dipole.
Further, since the simulated maps are devoid of any beam smoothing effects, therefore any existing beam smoothing in all the observed foreground-cleaned CMB maps is removed as well. In order to deconvolve beam effects from the observed foreground-cleaned CMB map, we consider the map m i in harmonic space. We divide all the spherical harmonic coefficients (a ( m)i 's) of the map by the beam window function (B ( )i ) of the respective observational instrument. With these new spherical harmonic coefficients (a ( m)f ) we construct the full sky map m f which is devoid of beam smoothing effects. Hence, Here, the initial beam window function B ( )i corresponds to a full-width at half-maximum (FWHM) = 5 for Planck maps, or = 60 for WMAP maps. The final beam window function B ( )f corresponds to an FWHM = 0 for both Planck and WMAP maps. The initial pixel window function P ( )i corresponds to an n side = 2048 for Planck maps or n side = 512 for WMAP maps. The final pixel window function P ( )f corresponds to an n side = 128 for both Planck and WMAP maps. Firstly we consider the case of full sky CMB maps, for which we directly use map m f to compute the NLVs inside the discs of radius r h and assign them to pixels of a map at resolution n l . In this case, an ANN is modelled to be trained with 12 × 16 2 = 3072 input features. The observed foreground-cleaned CMB maps tested in this case are all the available inpainted ones from Planck 2013 (Planck Collaboration et al. 2014) and 2018 (Planck Collaboration et al. 2020a) data releases. The application of the ANN to NLV maps obtained from these inpainted maps makes it unlikely to attribute the findings to any minor residual foreground contamination from the galactic region.
Secondly, we consider partial sky CMB maps, obtained after masking with the U 73 mask from Planck 2013 release (Planck Collaboration et al. 2014), which sufficiently excludes the galactic region in addition to extragalactic point sources. The use of a mask helps minimize contributions from any minor foreground resid-uals. This U 73 mask at n h = 128 is obtained after downgrading the original binary mask and setting all pixels with values ≥ 0.8 to one and the rest to zero. Thus for the case of partial sky coverage, we take a map m f and apply the U 73 mask, for example in Figure  3. We calculate the NLVs only for discs which are not masked beyond 90% of their area, following the strategy of Akrami et al. (2014). We then assign these NLVs to the map at resolution n l . Thus for partial sky coverage, the ANN architecture is designed to work with 2652 input features, which are the remaining pixels in the n l = 16 map, after obeying this criterion for disc rejection. We test the ANN on NLV maps obtained from the observed foreground-cleaned partial sky CMB maps from all releases of Planck (2013-2021) (Planck Collaboration et al. 2014Collaboration et al. , 2016Collaboration et al. , 2020b, and WMAP (1yr-9yr) (Bennett et al. 2003b;Hinshaw et al. 2007Hinshaw et al. , 2009Gold et al. 2011;Bennett et al. 2013).

TRAINING THE NEURAL NETWORKS
We have two ANNs, one for each of the full and partial sky cases. The input features used for training the ANNs are values of the NLV map arrays. Both the ANNs are similar in structure, save the difference in the number of input features and hyper-parameters associated with regularization methods. The ANNs are trained on realisations with the same seed values, but with different sky coverage.  Figure 4. A schematic diagram of the common ANN architecture for detecting the dipolar modulation signal. The differences between full and partial sky cases are mentioned with 'or'. All layers after the input layer are densely connected to their preceding layers. The dropout implemented after the first hidden layer has a rate of 0.01 for full sky or 0.04 for partial sky. Additionally, the kernel regularizers used are L2 in the first hidden layer and both L1 and L2 in the second hidden layer with strengths of 0.007 each for the full sky case and 0.005 each for the partial sky case. Sinceλ is a unit vector, the degrees of freedom in ascertaining the components ofλ are only two. Another degree of freedom in constructing the dipole modulated map is that of A. Thus, there are three degrees of freedom in total. Therefore we utilise the three components of the vector A ×λ as the associated training labels. For SI obeying CMB maps, these three labels are always zero, whereas for SI violating ones, they are non-zero. We choose the training labels in this manner since any other choices of training labels such as those of (A, θ, φ) or (A, λ 1 , λ 2 ) and the like cannot be unambiguously defined for SI obeying maps.
For the training set from simulated data, we compute the mean and standard deviation of the input features (denoted by µ in , σ in ) as well as those of the training labels (denoted by µ out , σ out ). We re-scale both the inputs and labels by subtracting their respective means from the entire set and dividing the resultant by their respective standard deviations. For similarly re-scaling the validation and test sets for simulated data, we use the previously computed means (µ in , µ out ) and standard deviations (σ in , σ out ) from the training set. Further this scaling is appropriately taken care of for the test set from observed foreground-cleaned CMB data.
A schematic flowchart to describe the ANN architecture common to both full and partial sky cases is shown in Figure 4. The differences between the two cases in the input layer and regularization parameters such as the rates of dropout and strengths of penalty for kernel regularizers are mentioned accordingly. We describe the two cases in further detail as follows. In the full sky case, we consider 3072 features in the input layer, which is followed by two hidden layers having 64 and 34 nodes each. The first hidden layer has an L2 kernel regularizer with strength of penalty = 0.007. There is a dropout at a rate of 0.01 after this layer. In the second hidden layer, we have both L1 and L2 kernel regularizers each with strength of penalty values = 0.007. The output layer has three nodes corresponding to the three components of A ×λ. The activation function used in each of the hidden layers is LeakyReLU whereas that in the output layer is linear.

Full Sky ANN
For training the ANN, we use mse as the loss function, while we use Adam for optimization purposes with a learning rate of 10 −4 . We see from Figure 5 that the training is accomplished by the end of approximately 80 epochs, when training with a batch size of 64. The time taken for a complete run of 100 epochs is 263.18 seconds, or ∼ 4.4 minutes on an ordinary CPU.

Partial Sky ANN
Similar to the full sky case, we have two hidden layers of 64 and 34 nodes each. The input layer however takes 2652 features, which are the remnant pixels on the partial sky NLV map. In this case, the dropout rate used after the first hidden layer is 0.04. The kernel regularizers have strengths of penalty of 0.005 for L2 at the first hidden layer and 0.005 for both L1 and L2 at the second hidden layer. The output layer as usual has three nodes. The activation function is LeakyReLU for both hidden layers, while that for the output layer is linear. Again, the mse is used as the loss function, and the optimizer is Adam with a learning rate of 10 −4 . The loss curves for training and validation sets are stabilised by 80 epochs, as seen in Figure 6, when the batch size is 64. For a complete training run of 100 epochs, the time taken is 254.85 seconds or ∼ 4.25 seconds on an ordinary CPU.

ANALYSIS AND RESULTS
First we present the analysis of test samples with the trained ANNs for the two cases of sky coverage. We specify the goodness of fit for the same with the help of R 2 scores for each of the three outputs of Aλ 1 , Aλ 2 , Aλ 3 . Mathematically, R 2 score can be expressed as where the summations are over all the samples of the set for which the outputs are predicted. It ranges between 0 and 1, where R 2 = 1 indicates a perfect fit or the notion that all variations in the predicted data can be explained by the intrinsic dispersion of the actual values. Of the total 2 × 10 4 test samples, half are SI obeying (unmodulated) and the rest are SI violating (modulated) maps. So we can separate them and calculate their respective modulation amplitudes from the predicted outputs as A = (Aλ 1 ) 2 + (Aλ 2 ) 2 + (Aλ 3 ) 2 . We expect the spread of values in amplitude for the unmodulated maps to be very close to zero, and that of modulated maps to closely follow the range [0.03, 0.15] in which we have randomly chosen the amplitude.
In the following subsections, we present the respective probability densities of amplitude for unmodulated and modulated maps. Despite our expectation that A from unmodulated maps must be equal to zero, there is a very small non-zero spread in the values of A. This is attributable to the fact that the goodness of fit can not be achieved to be exactly equal to one, and is due to the underlying aleatoric uncertainty (Weytjens & De Weerdt 2021) of the realisations in question. Hence when we compute A for the observed foreground-cleaned CMB maps, we can say within the confidence defined by this uncertainty, as to whether the predicted value of A from an observed foreground-cleaned CMB map corresponds to a signal of modulation. The significance of detection of the signal is thus quantified with p-values of predicted A for observed foreground-cleaned maps versus the null hypothesis prediction for test samples of unmodulated maps.

Full Sky
We test the trained full sky ANN on the 2 × 10 4 test samples and note the predictions of the ANN for the three components of Aλ. These results for the test set are shown along with their goodness of fit scores in Figure 7. The R 2 scores are > 0.97 indicating that the predicted components of Aλ fit the expected true values quite well. In addition we present the scatter graphs of predicted versus true values of the amplitude (A) for the mixed set of unmodulated and modulated maps in Figure 8. We show amplitudes for both the unmodulated and modulated maps, and the former can be seen on the left corner of the graph with some dispersion. The scatter graphs of the direction given by the three components ofλ are shown in Figure 9, only for modulated maps, since they are undefined for unmodulated maps.
The observed foreground-cleaned CMB maps considered in the full sky case are all the available inpainted maps, namely, NILC from Planck 2013 release, and Commander (COMM), NILC, SMICA and SEVEM from Planck 2018 release. We have evaluated the directions for each observed foreground-cleaned CMB map in the following manner. Firstly we normalise the predicted Aλ vector by its respective A to getλ. We then compute θ = cos −1 (λ 3 ) and the galactic coordinate b = 90 • − θ. A general procedure to obtain Galactic l can be outlined as follows: 1. We must find φ = tan −1 |λ2| |λ1| .
The consistency of the preferred directions given byλ or (l, b) for these five inpainted observed foregroundcleaned CMB maps can be illustrated by plotting the same in a Mollweide map, as shown in Figure 10. Additionally, we present the probability densities of the amplitudes computed from the predicted vector components for the 10 4 unmodulated and modulated maps of the test set in Figure 11, which show that the spread in predicted values closely obeys expectations. However, the ANN does not predict a perfect zero for the amplitude in the case of all the 10 4 unmodulated maps in the test set. Hence, we must gauge the possibility that the ANN predicts a non-zero value for these modulation amplitudes, even if there was no modulation in the observed foreground-cleaned CMB. This is given by a p-value which is computed as the percentage of predicted amplitudes from 10 4 unmodulated maps of the test set that lie above the predicted amplitude for an observed foreground-cleaned CMB map.
In Table 1, we present the results obtained for these inpainted maps, which lists the direct outputs (Aλ components) from the ANN, the derived values of the amplitude and direction which are consistent across maps, Figure 7. Predicted Aλ components for the test set versus their true values, obtained using the full sky ANN. The predicted values present a good fit to their actual counterparts, as given by R 2 > 0.97. Table 1. Predictions for observed foreground-cleaned CMB maps using the full sky ANN. Both amplitudes and directions for all the maps are similarly valued, and the detection of the dipolar modulation signal is statistically significant.

Partial Sky
For the case of partial sky coverage, we apply the corresponding trained ANN on the test set and then on available foreground cleaned CMB maps from all the releases of Planck and WMAP satellites.
In Figure 12, we present the scatter graph of predicted values of the components of the Aλ vector with respect to the true values, along with their respective R 2 scores. Despite the fact that the goodness of fit scores are > 0.96, we notice that they are lower than those in the full sky case. This is due to the increased variations that cannot be explained by a similar dispersion in the true values, obviously caused by the use of a mask in this case. Further, since the U 73 mask primarily conceals galactic sources, it is somewhat symmetric about the z-axis, and hence the R 2 score for Aλ 3 is not as compromised as those of Aλ 1 and Aλ 2 . This is in con-trast with the full sky case, for which the goodness of fit values of all the three components were similar (about 0.975).
Further we show the scatter graphs of the amplitude (A) for the test set in Figure 13. We present amplitudes for both the unmodulated and modulated maps, and the former shows some dispersion for the unmodulated case on the left corner of the figure. Similar to the case of full sky coverage, the three components ofλ are undefined for unmodulated maps. Hence, the scatter graphs for the direction are shown in Figure 14 only for modulated maps.
When the partial sky ANN is applied to NLV maps of the partial sky observed foreground-cleaned CMB, we see a very consistent amplitude and direction of dipolar modulation across all the maps from Planck and WMAP releases ranging from Planck 2013-2021 data and WMAP 1yr-9yr data. The consistency of the preferred directions given byλ or (l, b) for the 18 observed foreground-cleaned CMB maps can be inferred from a Table 2. Predictions for observed foreground-cleaned CMB maps using the partial sky ANN. Overall, both amplitudes and directions across all maps are consistent. Additionally, the detection of the signal of dipolar modulation in all the maps is statistically significant.

Map
Aλ1 plot of the same in a Mollweide map, as presented in Figure 15. Following our approach in the full sky case, we estimate the amplitudes from predicted Aλ's and present their probability densities for the 10 4 unmodulated and modulated maps from the test set in Figure 16. On finding the minimum and maximum values of predicted amplitudes for these two types of maps, we see that there is a very subtle increase in their spreads (of the orders of 10 −4 for unmodulated maps, and 10 −3 for modulated maps), compared to the full sky case. Nonetheless, these histograms are qualitatively similar to those obtained for the case of full sky coverage, and the predicted amplitudes for unmodulated maps are again not exactly zero. Thus we can quantify the significance of detection for the dipolar modulation signal in the observed foreground-cleaned partial sky CMB maps. We represent this significance with the p-value which is computed as the percentage of 10 4 unmodulated maps for which the predicted amplitudes are larger than those from the observed foreground-cleaned CMB maps.
We finally present all the findings from the partial sky analysis in Table 2, which shows the partial sky ANN outputs for the three Aλ components, the computed val- ues of the amplitudes and directions which are consistent across maps, and the p-values which correspond to a significant detection of the signal of dipolar modulation for all the maps.

SUMMARY AND CONCLUSION
The CMB temperature fluctuations are expected to obey statistical isotropy (SI) according to the Standard (ΛCDM ) model of Cosmology. This entails that there must be no preference of a direction in the CMB. However, the hemispherical power asymmetry as seen by many authors in existing literature indicates a departure from the Standard Model. This departure is significant given the reported magnitudes of the p-values, and the sheer volume of such findings obtained with independent methods. An underlying dipolar modulation is suggested as a possible cause of this power asymmetry, the strength of which is known to vary with the scale at which it is estimated.
For the first time, we use deep learning with Artificial Neural Networks (ANNs) to probe the existence of a possible dipolar modulation signal. This provides Figure 11. Probability densities of predicted amplitudes p(A) for unmodulated and modulated maps from the test set using the full sky ANN. The histograms closely follow expected ranges of values for unmodulated and modulated maps. a novel approach towards validating or rejecting evidence for such a signal in previous literature. Employing ANNs for studying features in the CMB may introduce a paradigm shift in interpretation of signals of SI violation, relative to traditional methods of regression or fitting associated with the frequentist approach. This is because ANN architectures can 'learn' how to detect a signal when presented with a set of samples for training. Upon adequate training the ANN develops the artificial intelligence to act on observed foreground-cleaned CMB data and consequently estimate a possible signal in such data.
We consider normalised local variance (NLV) maps which are very useful as input features to train ANNs. We build two ANN architectures namely, for the full and partial sky cases. To obtain partial sky maps, we use the Planck U 73 mask released in 2013. The key findings of this work are as follows.
1. With full sky coverage, (a) generally consistent values of amplitude and direction of the modulation are seen across all available observed foreground-cleaned full sky inpainted maps from all releases of Planck (COMM 2018, NILC 2013, NILC 2018, SMICA 2018, SEVEM 2018. (b) The detected signal is significant (at 97.21%− 99.38% C.L.) for all these 5 maps.
2. With partial sky coverage, (a) we find reasonably consistent values of amplitude and direction of the modulation across all observed foreground-cleaned partial sky maps from all releases of Planck (2013-2021) and WMAP (1yr-9yr).
(b) The detected signal is significant at 99.9% − 99.98% C.L. for all the 13 Planck maps, and at 98.34% − 100.0% C.L. for all the 5 WMAP maps. 3. These results are therefore robust against sky coverage, observational instruments, periods of observation, and foreground cleaning and inpainting methods.
In the following paragraphs, we discuss two criticisms that have been posited against the manner in which any possibly anisotropic signals in the CMB are probed, and address how our method is able to mitigate those effects further.
Firstly the look-elsewhere effect occurs when a signal is detected purely by chance and is attributable to the large sample size for which it becomes more favourable to see some random fluctuations that are statistically significant. It is additionally the result of a constant approach of 'looking elsewhere' to find a significant signal, while disregarding any previously insignificant findings. In this work, we are able to weaken the look-elsewhere effect (a) with the robustness of the detection, and (b) by adding to existing literature an independent method like the one in this work, which also detects a significant signal, thus strengthening the repeatability of the initial findings.
Secondly, the concept of a posteriori statistical inference is based on devising estimators to shift our focus to visually anomalous features. However, (a) since the method of deep learning to distinguish unmodulated maps from modulated ones (quantified with the magnitude of the modulation) is distinct from the process of devising an estimator which focuses on a search for such a signal after looking at the data, and (b) as we use a wide range of amplitudes and directions to train the ANN so that it is not focused at detection of amplitude and directions specific to the observed foregroundcleaned data and can be used to probe unseen data, we are able to alleviate the criticism of an a posteriori choice of statistics.
In conclusion, we can say that our findings agree quite closely with those in existing literature. Further, assuming that no unknown residual systematics are commonly present in all the observed foreground-cleaned CMB maps considered here, this entails that either our universe is a rare realisation of the Standard Model, or that we live in a statistically anisotropic universe which could be a rather common realisation of a different model.  Figure 15. The preferred directions causing a dipolar modulation in the 18 observed foreground-cleaned CMB maps we have investigated with partial sky coverage. Preferred dipole direction in observed foreground-cleaned partial sky CMB maps from Planck releases of 2013, 2015, 2018, and 2021 are indicated with red ×'s, green ×'s, green 's and a red •, respectively. Those for each of WMAP 1yr, 3yr, 5yr maps are shown with a green, blue and yellow •, and directions for WMAP 7yr and 9yr are shown with a red and a blue . The figure shows that the directions are mostly consistent over all these variously procured maps. Figure 16. Probability densities of predicted amplitudes p(A) for unmodulated and modulated maps from the test set for the partial sky ANN. The range of predicted amplitudes appropriately follow zero and non-zero values for unmodulated and modulated maps.