Predicting interloper fraction with graph neural networks

Upcoming emission-line spectroscopic surveys, such as Euclid and the Roman Space Telescope, will be affected by systematic effects due to the presence of interlopers: galaxies whose redshift and distance from us are miscalculated due to line confusion in their emission spectra. Particularly pernicious are interlopers involving the confusion between two lines with close emitted wavelengths, like Hβ emitters confused as [O iii], since those are strongly spatially correlated with the target galaxies. They introduce a particular pattern in the 3D distribution of the observed galaxy catalog that can shift the position of the BAO peak in the galaxy correlation function and bias any cosmological analysis performed with that sample. Here we present a novel method to predict the fraction of interlopers in a galaxy catalog, using Graph Neural Networks (GNNs) to learn the posterior distribution of the interloper fraction while marginalizing over cosmology and galaxy bias. The method is developed using simulations with halos acting as a proxy for galaxies. The GNN can infer the mean and standard deviation of the posterior distribution of interloper fraction using small-scale information that is usually not considered in cosmological analyses. The injection of large-scale information into the graph as a global attribute improves the performance of the GNN when marginalizing over cosmology.


Introduction
Upcoming galaxy redshift surveys like Euclid1 and Roman2 will observe emission-line galaxies up to high-redshift.They will use slitless spectroscopy to obtain galaxy spectra and identify one or more emission lines to measure the redshift of each galaxy.This technology will allow us to map the 3D distribution of galaxies with unprecedented detail, but the noisy spectra will require lots of attention to avoid systematics.One major issue concerns the presence of interlopers: objects for which the redshift has been wrongly estimated and, therefore, the distance has been miscalculated.This phenomenon can happen when lines in the galaxy spectrum are misidentified, which is more likely when spectra are noisy and when the redshift is estimated using a single emission-line.
The main target of the Euclid spectroscopic survey is Hα emission-line galaxies at redshifts z = 0.9 − 1.8, whereas the High Latitude Spectroscopic Survey in Roman will observe Hα emitters at redshifts z ∈ [1,2], and [O iii] emitters at z ∈ [2,3].In Euclid, only the Hα line will be used to calculate the galaxy redshifts, making the procedure prone to line confusion, which happens when the Hα line is not visible and another strong line can be confused with the target.In Roman, redshifts are expected to be measured using at least two lines in each spectrum, a requirement that will discard many spectra where only one line is above the detection limit and will reduce the number of objects in the final catalog.The use of a single-line redshift identification would allow a much higher number density of galaxies and probe the large-scale structure in greater detail; however, such a procedure is more prone to line confusion and the presence of interlopers.
In this paper, we consider the possibility of creating such a large catalog with the Roman Space Telescope, using single-line identification at high redshifts, where [O iii] is the main target.Many lines can be confused as the target, and they can be classified into two main groups depending on the vicinity of the emitted wavelength of the detected line to that of the [O iii] line.If they are far from each other in the spectrum, the interloper's inferred location is very far away from its true position, which is likely outside the redshift range of the main targets.As a consequence, this type of interlopers will only be very weakly correlated with the target population.On the other hand, if the two emitted wavelengths are close to each other in the spectrum, then the difference between the inferred and true distance to the interloper can be small, and the interloper population can be correlated to the target sample.An example of this second case is an Hβ line confused as an [O iii] line, resulting in the interloper distance being miscalculated by about 100 h −1 Mpc.
Both types of interlopers will modify the clustering properties of the full (target+interloper) sample and thus its two-point correlation function.Interlopers that are uncorrelated with the target sample have been studied in many papers (e.g.[1][2][3][4]) and their effect on the correlation function can be easily modeled.On the other hand, interlopers that correlate with the target sample are more pernicious and more difficult to model.[5] studied the impact of [O iii]-Hβ interlopers on the correlation function, and in particular on the baryon acoustic oscillation (BAO) peak in it.They pointed out that the presence of this type of interlopers shifts the position of the BAO peak, introducing a systematic error in the BAO analysis.[6] proposed a BAO fitting function that takes into account the presence of [O iii]-Hβ interlopers in a catalog (hereafter called BAO+f i fit), enabling an unbiased cosmological BAO analysis and a prediction for the fraction of objects in the catalog that are interlopers.Even if very successful, that model assumes that the target and interloper populations have the same galaxy bias, which is not guaranteed to be true for [O iii] emitters and the galaxy population where Hβ is the only detectable line; thus further improvements will be needed to accurately describe the impact of these interlopers in a BAO analysis.While these previous studies focused on the BAO analysis, we should remember that [O iii]-Hβ interlopers will affect other statistics of the large-scale structure beyond the two-point correlation function.
In this paper, we develop a new method to predict the fraction of interlopers in a given catalog that are Hβ emitters and confused as [O iii].Our method does not rely on the measurement and modeling of a summary statistic of the galaxy distribution, such as the two-point correlation function.Instead, it considers the full 3D distribution of galaxies by describing their positions with a mathematical graph and uses graph neural networks (GNNs) [7] to extract the information about the interloper fraction.GNNs are designed to deal with sparse and irregular data, and have been applied in many areas of astrophysics (e.g.[8][9][10][11][12][13]).Since graphs encode the 3D spatial information beyond the two-point function, they allow us to use additional information to the two-point function to predict the fraction of interlopers in a catalog, and especially the information on small scales.We will show that thanks to these features GNNs can efficiently detect a subset of objects whose spatial clustering properties are different from the main sample.Moreover, while we present a GNN application to detect [O iii]-Hβ interlopers that will be a specific issue in Roman, GNNs can be used to recognize also other types of interlopers, such as those that are not correlated to the main sample and that will contaminate the spectroscopic catalog in Euclid.More generally, GNNs are promising tools to detect any subsets of objects exhibiting a spatial pattern different from a main sample.The paper is organized as follows: in Section 2 we discuss the spatial distortion introduced by [O iii]-Hβ interlopers and in Section 3 we describe the data set used, how graphs are built, and the architecture and training procedure of our models.We present our results in Section 4 and discuss the GNN approach used in this work in Section 5, before presenting the conclusions in Section 6.

Interlopers
An interloper is a galaxy located at the true redshift z but considered to be at the wrong redshift z ′ .This mistake can happen when a line in the emission spectrum-where it appears at the observed wavelength λ o -is confused as another line having emitted wavelength λ false e different from the true emitted wavelength λ true e .Subsequently, the redshift is converted into distance assuming a fiducial cosmology, thus a redshift misidentification leads us to infer a wrong distance for the interloper that is incorrect by for a ΛCDM model, where Ω Λ and Ω m are the energy density of cosmological constant and matter in the assumed fiducial cosmology, respectively.This displacement does not carry any cosmological information since it only depends on the fiducial cosmology, and it is equal to about 97 h −1 Mpc in a Planck-like [14] cosmology at z = 1, the redshift that we consider in this paper 3 .The offset is small enough that the true positions of the interlopers are in the same volume of the Universe covered by the target sample and the two populations are strongly correlated.When computing the two-point correlation function of the full sample, the correlation between the target galaxies and the interlopers appears as a peak centered at the scale of the displacement ∆d, which is close to the BAO peak position at redshifts z ∼ 1−3.The redshift and angular distances of the BAO peak depend on the true cosmology, and they are converted to comoving distances based on the fiducial cosmology.But in the BAO case, unlike the interloper offset, they still conveys cosmological information from redshift and angular distances.The match between the BAO position and the interloper offset therefore strictly relies on the choice of the fiducial cosmology, the emission lines considered for the interloper and target galaxies, and the redshift of the target sample.
Figure 1 shows measurements of the monopole (left) and quadrupole (right) of the twopoint correlation function in halo catalogs created with different fractions of interlopers f i at z = 1 (see Section 3.1 for details on how these catalogs are generated).As the interloper fraction increases, the BAO peak in the monopole appears to be more and more enhanced, broadened, and shifted towards smaller separations r.Moreover, the quadrupole exhibits a peak at around r = 97 h 1 Mpc, whose height increases with the interloper fraction, showing that interlopers create an anisotropic pattern in the 3D distribution of the halos.Since the BAO position is used as a standard ruler in cosmology, any source that shifts its position will introduce a systematic bias in the cosmological analysis: the larger the interloper fraction, the larger the shift of the BAO position, and the larger the systematic introduced in the analysis [5].[6] developed the BAO+f i fit, a BAO fitting function that takes into account and models the effect of interlopers, however the 3D distortion introduced by this type of interlopers is going to affect other statistics measured from a contaminated catalog.In this paper, we develop a method to predict the [O iii]-Hβ interloper fraction directly at the field level, and this methodology can be in principle used to detect the fraction of different types of interlopers.We do this by focusing on using the small-scale information in order to separate the measurement of interlopers from the cosmological measurements.Because of the shift in position, the measured small-scale clustering around the interlopers will be very different from that of the unshifted galaxies, but is difficult to model.It is this information that we hope the GNN will be able to extract.

Data sets
In this work, we use dark matter halos as a proxy for galaxies to test our method.Both galaxies and halos are biased tracers of the underlying matter field, with bias schemes that can be tuned by the choices made to build their population.To test the feasibility of our GNN method, we do not build galaxy catalogs with particular emission line distributions that have realistic fractions of interlopers in them.We instead generate a sufficiently large data set with enough variation in the objects' bias, the underlying cosmology, and the fraction of interlopers, so that the training set can be used to train a flexible enough model that can predict the fraction of interlopers effectively marginalizing over the other unknowns.This is sufficient to test the main idea of the paper, that Graph Neural Networks can help us predict the number of interlopers in a catalog.However, a sufficient level of detail in the creation of these catalogs will be necessary to deploy the GNN method to the analysis of real data.In particular, galaxy catalogs will need to be used since they exhibit the Fingers-of-God effect on small scales [15].Moreover, the effect of angular selection, completeness, and other observation systematics will need to be assessed and eventually introduced at the training stage level.
We use halo catalogs from the Quijote simulations [16], a suite of thousands of N-body simulations built to perform Fisher analysis and train deep-learning models.The products we use were obtained in the following way.The initial conditions of the simulations were generated at redshift z = 127 using second-order Lagrangian perturbation theory, and dark matter particles were evolved through the present time using the TreePM code Gadget-III; snapshots at redshifts z = 0, 0.5, 1, 2, 3 were saved and used to identify halos via the Friendsof-Friends (FoF) algorithm [17] with linking length parameter b = 0.2: All halos with more than 20 particles were saved into halo catalogs.All the simulation boxes have a cubic volume equal to 1 h −3 Gpc 3 and contain 512 3 cold dark matter particles.In our analysis we made use of two different sets of simulations: • SET1: 100 simulations at the so-called fiducial cosmology, a flat ΛCDM cosmology with matter density parameter Ω m = 0.3175, baryon density parameter Ω b = 0.049, dimensionless Hubble constant h = 0.6711, spectral index n s = 0.9624, linear matter fluctuation amplitude σ 8 = 0.834, and sum of neutrino masses M ν = 0 eV.With this cosmology, the minimum halo mass present in the catalogs is equal to 1.31×10 13 h −1 M ⊙ .
• SET2: 155 simulations selected among 2,000 simulations whose cosmological parameters are arranged in a Latin Hypercube (LH) configuration.The selected boxes cover a fraction of the full LH volume around its center and have parameters We consider the halo catalogs at redshift z = 1 for all of these simulations and use the ones in SET1 to build a model that predicts the fraction of interlopers at fixed cosmology and fixed/varying halo bias, while we employ the catalogs in SET2 to build a more flexible model that can marginalize over both cosmology and halo bias, having them both varying in the training step.We analyze both sets assuming a fiducial cosmology, corresponding to the cosmology used to run SET1.Because of this, the boxes in SET2 are stretched to account for the difference between the cosmology of that simulation and the assumed one.We consider the line-of-sight to be along the ẑ direction so that the 3D halo comoving positions (x x , x y , x z ) are mapped into the stretched coordinates (x ′ x , x ′ y , x ′ z ) via using the parameters where the subscript "sim" denotes the parameter in the cosmology used to run the simulation and absence of subscript denotes parameters in the assumed fiducial cosmology, and D A is the angular diameter distance.
Due to constraints from the memory of the GPUs, we cannot build a graph using the full halo catalog in a simulation box, thus we crop them into sub-boxes of size that varies depending on the task.Interlopers are introduced in each sub-box by randomly selecting a fraction f i of halos and shifting them along the line-of-sight ẑ by the distance predicted in Equation 2.2.The fraction of interlopers f i assigned to each sub-box is a number randomly sampled within the interval

Graphs
We describe the 3D distribution of the halos using graphs, where halos are represented by nodes that can be connected to each other via edges depending on their separation: only nodes closer than the linking radius r link are connected via edges.The value of the linking radius is a hyperparameter of our GNN model and will be chosen to maximize the performance of the model.
Nodes, edges, and the graph itself can be labeled with attributes carrying information that describes them.For the purpose of this study, each graph will encode knowledge about the halo spatial distribution only; any information about halo properties, such as their mass, concentration, or history, is not used.Moreover, it is important to create deep-learning models that respect the symmetries of the problem they are going to describe [18].The observed Universe exhibits a cylindrical symmetry 4 and we construct graphs that respect it by assigning the spatial information to the edges instead of the nodes: each edge attribute is composed by three scalars that describes the 3D vector connecting two nodes via where d ij = v i − v j is the vector connecting the two nodes i and j at the beginning and end of the edge e ij , and v ⊥ = v × ẑ is the component of v perpendicular to the line of sight.
Initially, an entire simulation box is cropped into smaller subvolumes, along the x and ŷ directions (not cut in the ẑ direction to better capture interloper effects), to have a representative sample of halos but small enough to fit into the GPU memory.We build a graph from the halos in each subvolume using the above procedure.
In some cases, we consider a global attribute to describe the full graph.Depending on the cases, it will be given by the five cosmological parameters or the monopole of the halo power spectrum measured in the sub-box.

GNN architecture
We use the graphs built from the halo catalogs to train a GNN that predicts the fraction of interlopers in the catalog.The GNN consists of multiple blocks that update the node, edge, and global attributes while maintaining the structure of the graph.These blocks are Metalayers [7] that use a message-passing scheme to effectively flow the information from node to node and update attributes.At the end, a global pooling and a multi-layer perceptron (MLP) are added to compress the graph information to the desired size of the output: the prediction for the fraction of interlopers in a graph.
Each GNN block takes as input a graph, updates its edge, node, and global attributes, and outputs the updated graph.Thus, even if the initial graphs do not have node attributes, the GNN blocks will assign and update them via message passing.Each block l is composed of the following elements: • The edge model that updates each input edge attributes e • The node model that updates the node attributes: where ϕ and ψ are MLPs, u is the global attribute (when specified), and is a permutation invariant aggregation operator applied to the edges e ij with j ∈ N i and N i being all the nodes that are neighbours connected via edges to the node n i .We consider three different types of aggregations-summation, maximum value, mean-and concatenate all of them in equation 3.5.In this way each node is updated depending on the values of the node attributes of its neighbours and the attribute of the edges connecting them, allowing for a flow of information.The number of GNN blocks is a hyperparameter chosen to maximize the performance of the model; it determines how many times the message passing is performed and the attributes are updated.
After the GNN blocks, the information contained in the graph G is compressed using one last aggregation operation performed on all nodes.The result is concatenated to the global attribute, if present in the graph, and passed to a final MLP τ to output the vector The MLP ϕ, ψ, and τ are built using two fully connected layers with ReLu activation function.
The number of GNN blocks and neurons per fully connected layer are hyperparameters to optimize.

Training procedure
Our aim is to build a GNN model that can infer the fraction of interlopers f i in a catalog G. Our model predicts the mean and standard deviation of the interloper posterior distribution without making any assumption about its shape.Thus, it outputs the vector being the mean and standard deviation of the marginalized posterior distribution p(f i |G), where θ 1 , ..., θ n are cosmological parameters or parameters describing the galaxy/halo bias scheme.Here σ(G) represents the aleatoric error alone since we do not include the small epistemic error (see Section A for a quantitative analysis).
In order to train such a model, we implement the loss function [19,20] whose minimization is equivalent to solving for the mean and standard deviation of the posterior distribution (see [19,21]).We divide the data into training, validation, and test sets with a 80/10/10 split ratio.In the training stage, we minimize the loss function using the Adam optimizer [22], with values for the learning rate and weight decay that we treat as hyperparameters to be optimized.We train the models for at least 1,000 epochs and choose the model with the best validation loss.The optimization of the hyperparameters (learning rate, weigh decay, number of GNN blocks, number of neurons in ϕ, ψ, and τ , and linking radius r link ) is performed using the Optuna package [23] with at least 100 trials, each of those consisting in the training of a model with a specific choice for the value of hyperparameters.
We select the GNN model with hyperparameters that give the best validation loss after training.Then, we determine the performance of the selected GNN model using the test set.

Accuracy metrics
We quantify the performance of our GNN models using different metrics, all applied to the test sets.We consider with < ... > indicating the mean among the test set, which quantifies the precision of the model-the lower the RM SE the more precise the model is.
• The coefficient of determination that measures the accuracy of the model.It is limited to be R 2 ≤ 1: the closer it is to 1 the more accurate the model is.A value close to 0 indicates that the model is not properly trained and it can only predict the mean of the training set, while a negative value for R 2 denotes that the model is performing even worse than that.
• An estimation for the bias b =< µ − f i > . (3.13) • An estimation for how well the standard deviation of the posterior distributions is determined by the GNN model, A χ 2 close to 1 indicates that the standard deviations are properly predicted.

Fixed cosmology and halo bias
We first consider the simplest case where all catalogs have the same cosmology, the fiducial cosmology of the Quijote simulations, and are built using SET1.Moreover, we maintain all halos in the catalogs, so that the halo bias is also shared among the whole data set.This will allow us to build a GNN model at fixed cosmology and fixed halo bias scheme, where the posterior distribution of the fraction of interlopers p(f i |G) in equation 3.9 is not marginalized over any other parameter.
In this case, we crop sub-boxes of size 150 × 150 × 1000 (h −1 Mpc) 3 along the x, ŷ, and ẑ directions, respectively, and with the line-of-sight assumed to be the ẑ axis.Depending on the sub-box, a different fraction of halos are selected to represent interlopers and are displaced by 97 h −1 Mpc along the ẑ direction, while taking into account the boundary conditions of the simulation box along the ẑ axis.As a result, these sub-boxes contain about 4,500 objects.We consider both the catalogs in real (no velocity added) and redshift space, and perform two different studies on these two types of data sets to determine the best values for the hyperparameters.
In real space, the trained model with the best validation loss has N block = 1, N hid = 42, l r = 1.1 × 10 −4 , w d = 4.9 × 10 −3 and r link = 11.5 h −1 Mpc; in redshift space the equivalent best model has N block = 1, N hid = 35, l r = 3.8 × 10 −4 , w d = 10 −2 and r link = 11.68 h −1 Mpc.The single GNN block and the low values for r link indicate that the GNN models are using small-scale clustering properties to determine the fraction of interlopers.As expected, there is lots of information on small scales, since the number of pairs and their spatial distribution change depending on the number of interlopers in the catalog and graph.
Figure 2 shows the inference performed on the test set using the two models: real space (left) and redshift space (right).The x-axis indicates the true value f i , while the y-axis denotes the prediction of the model.The black line shows the values y = x, indicating when the prediction matches exactly the true value.The blue points denote the predicted mean and the error bars denote the predicted standard deviation in each test catalog.By eyes, we can see that the error bars follow the black diagonal line, and no bias is present, as confirmed by the values of b ∼ 0 reported in the figure.Moreover, χ 2 = 1.6, 1.4 for the real and redshift space respectively, indicating that the standard deviation is slightly under-predicted by the two models, probably because we are not taking into account the epistemic error of the GNN (see A for more details).The models estimate the interloper fraction with a precision of ±0.015, consisting of an error equal to 15% on the mean value f i = 0.1 of the considered interloper fraction range.This precision is obtained using a volume equal to 0.0225 h −3 Gpc 3 , which is a very small fraction of the [O iii] survey in the Roman space telescope and a volume thousands of times smaller than the one considered in [6].We can compute the RM SE for the analysis in [6] using the results reported in their Figure 1 for the same displacement considered in this paper: ∆d = 97 h −1 Mpc.We obtained RM SE = 3.1 × 10 −3 , which rescaled to the volume considered in this section (0.0225 h −3 Mpc 3 ) is equal to 0.65.Therefore, it is 40 times larger than the RM SE we obtain with the GNNs; in other words, the GNN models are 40 times more precise in predicting the interloper fraction than the BAO+f i fit.We can also compare the bias in the two methods: the one from the BAO+f i model fit in [6] is equal to −2.1 × 10 −3 , which is an order of magnitude larger than the bias obtained with GNNs5 .However, we should note that the comparison is not truly fair since the GNN has been trained at fixed cosmology, whereas the BAO+f i fit includes the possibility for a variation in cosmology via the dilation parameters.

Fixed cosmology and varied halo bias
Using the 100 realizations at the fiducial cosmology in SET1, we can build catalogs with different halo biases.In this case, the posterior distribution p(f i |G) of which we aim to learn the mean and the standard deviation is marginalized over the halo bias scheme.We consider two different strategies to implement the bias scheme.The first method involves selecting the N halo most massive halos in a catalog, where N halo is a random number-we will refer to this method as varied-N halo .Small values for N halo correspond to the selection of only very massive halos and the creation of a halo catalog that has high bias.On the other hand, large values of N halo involve the selection of low-mass halos as well and yield a population that is less biased.The second method concerns (1) randomly selecting the minimum halo mass M h,min to include in the catalog and (2) randomly sub-sampling to a chosen number N halo all the halos that are more massive or as massive as the chosen limit.In this way, we can have catalogs with different halo biases but sharing the same number density of objects-we will refer to this method as fixed-N halo .
The implementation of the first method (varied-N halo ) is done by cropping sub-boxes of size 150 × 150 × 1000 (h −1 Mpc) 3 along the x, ŷ, and ẑ directions respectively, and randomly selecting N halo halos, with N halo ranging between 6,000 and 4,200.This corresponds to a variation of the minimum halo mass within the range 1.31 × 10 13 − 5 × 10 13 h −1 M ⊙ and a variation in halo bias equal to a factor of 2. Interlopers are implemented by selecting a different fraction of halos in each catalog that are displayed by 97 h −1 Mpc along the line-ofsight.This procedure is repeated twice for each sub-box to augment the number of data.The model with the best validation loss trained using this data setup has hyperparameters N block = 2, N hid = 66, l r = 9.4 × 10 −5 , w d = 6.46 × 10 −3 and r link = 17.96 h −1 Mpc, and its performance of the test set is shown in the left panel of Figure 3.
To implement the second method (fixed-N halo ) we obtain sub-boxes of size 250 × 250 × 1000 (h −1 Mpc) 3 along the x, ŷ, and ẑ directions respectively.We choose to work with larger volumes because they allow us to have in each sub-box a number of halos larger than N halo = 4, 500, once only those with mass M ≤ M h,min are selected.We sample M h,min within the range 1.31 × 10 13 ≤ M h,min ≤ 1.97 × 10 13 h −1 M ⊙ , which corresponds to a variation in halo bias at the level of 20%.Analogously to the varied-N halo method, interlopers are then introduced in each sub-box by shifting a selected number of objects along the line-of-sight, and the procedure is repeated twice in each sub-box to augment the number of data.In this case, the model with best loss has N block = 1, N hid = 41, l r = 4.9×10 −4 , w d = 1.7×10 −5 and r link = 25.05 h −1 Mpc.Results showing the model performance on the test set are displayed in the right panel of Figure 3 In both cases, the models are unbiased and are able to extract information about the fraction of interlopers from small scales.All metrics are very similar, with χ 2 being close to 2, meaning that the standard deviations are under-predicted.When we compare these results with those in Section 4.1, we can see that allowing the halo bias to vary across different halo catalogs yields worse precision: the fraction of interlopers is determined with a precision of ±0.025 over the full range considered.The value of R 2 is also smaller than in the case of fixed halo bias.A comparison with the BAO+f i fit method in [6] shows that GNN models that marginalize over halo bias are about 15 − 25 times more precise in predicting the interloper fraction than the BAO model, depending on the method used to introduce a variety of halo bias schemes in the training set (they consider different volumes), but they exhibit a similar level of bias in the prediction.In order to apply a GNN to real data, the model needs to predict the mean and standard deviation of the poster distribution marginalized over halo bias and cosmology.We train such a model by building halo catalogs from SET2, the LH of the Quijote suite.We implement the two ways described in the previous section to obtain differently biased populations starting from the same box or sub-box.The only difference with the previous section is that here each simulation box is at a different cosmology, thus each box has been stretched via Equations 3.1 and 3.2 and reshaped into a cube using boundary conditions before being cropped to the desired volume.The value for the interloper shift depends on the assumed fiducial cosmology via Equation 2.2, but it does not depend on the true cosmology of the simulations.Thus it does not carry any meaningful cosmological information and its value is the same in all the simulations of the LH.The performances of the best validation model on the test set are shown in Figure 4 for the varied-N halo method (top left) and fixed-N halo method (top right).The corresponding hyperparameters are N block = 2, N hid = 24, l r = 1.1 × 10 −4 , w d = 1.1 × 10 −4 , r link = 28.5 h −1 Mpc for the first method and N block = 2, N hid = 39, l r = 1.2 × 10 −4 , w d = 1.0 × 10 −6 , r link = 29.9h −1 Mpc for the second.The metrics on the performance of the two models are very similar and show a decrease in precision when allowing for the data set to have different cosmologies.Compared to previous cases at fixed cosmology, the accuracy went down with R 2 drastically decreasing to 0.56 − 0.58, and the precision became worse and equal to ±0.039.However, these results show that GNNs can achieve a precision 10 − 17 times better than the BAO+f i fit analysis in [6], with both methods allowing for the cosmology and halo bias to vary.

Varied cosmology and halo bias
The studies performed to understand the best values for the hyperparameters suggest the need to use larger linking radii, but that is not possible due to the limitation of the GPU memory on the system used.This suggests that there might be a degeneracy between cosmology, halo bias, and the fraction of interlopers on small scales.In order to add information coming from large scales, we consider adding a global attribute to the graphs.
First, we implement as global attribute the monopole of the halo power spectrum measured from each halo catalog up to a maximum wavelength equal to k max = 0.3 h Mpc −1 .We select different halo biases using the fixed-N halo method because its setup allows us to use larger volumes than with the varied-N halo scheme.However, even if the sub-boxes are larger, we do not include higher-order multipoles to the global attributes because their volume is still relatively small and higher-order multipoles are dominated by cosmic variance.The results using graphs with N halo = 4, 500 and a GNN model that used the monopole of the power spectrum as global attribute is shown in the bottom left panel of Figure 4.The best model has N block = 2, N hid = 27, l r = 4.0 × 10 −5 , w d = 1.3 × 10 −4 , r link = 26.4h −1 Mpc.Adding this global attribute improves all metrics used to quantify the performance of the model, however, these improvements are relatively minor.
Second, we consider as global attribute five numbers that are a guess for the values of the five cosmological parameters of the underlying cosmology.We guess them by randomly sampling the marginalized posterior distribution for these five parameters obtained by the Planck collaboration [14], after shifting the posterior so that it is centered at the right value of a given parameter for the simulation considered.We train GNN models with this global attribute and chose the one with best validation loss, having N block = 2, N hid = 27, l r = 1.3 × 10 −4 , w d = 6.7 × 10 −4 , r link = 22.3 h −1 Mpc.Training such a model is possible because we do know the cosmology of each graph, but it is also possible to use it with real survey data by assuming the underlying cosmology of our Universe is the one measured by Planck.The performance of the model is shown in the bottom right panel of Figure 4.The accuracy of the model is significantly improved by the use of this global attribute: the value of the coefficient of determination is R 2 = 0.75, being much closer to 1 than in all the other cases studied with the LH.The precision is also significantly improved and equal to ±0.029.The precision obtained from the BAO+f i fit in [6] yields RM SE = 0.39, once it has been re-scaled to the volume of the sub-boxes considered for this study (and all those implementing the fixed-N halo method).This means that the interloper fraction predicted by this GNN model is 13 times more precise than that of the BAO+f i fit, while allowing for both cosmology and halo bias to vary.
We consider the case where the posterior distributions of the cosmological parameters from which we draw the guessed cosmology are broader.We consider them to be a Gaussian distribution with a standard deviation equal to 2× and 3× the 1σ error reported from Planck (the precedent discussed results used 1× the 1σ error).We trained such models where graphs have these updated global attributes and chose the model with the best validation loss.As expected, having a less precise determination of the cosmology of a graph worsens the performance of the GNN model: the RM SE increases by 3% and 14% when considering 2× and 3× broader posteriors for the cosmological parameters, respectively, while the value of R 2 decreases by 3% and 9% in the two cases.Even if the posterior is broader by 3×, the GNN model performs better than when considering the monopole of the power spectrum as global attribute, and much better than without any global attribute.
5 Testing the framework of the GNN approach

Large scale information
When training a GNN model on the Latin Hypercube (SET2), we observed that the information coming from small scales does not seem to be sufficient to accurately and precisely predict the mean and standard deviation of the posterior distribution of the interloper fraction.In order to understand if including the information from large scales can ameliorate the GNN model, we generate a new data set where a graph is built using the full simulation box of size 1 h −3 Gpc 3 , and the halo bias scheme is implemented using the fixed-N halo method with N halo = 4, 500.The number of halos is the same as the one previously used in 1/16 of the full simulation volume, thus here the sample is less dense, allowing us to explore larger linking radii without running out of GPU memory.We perform a study using this setup applied to the LH and find that all the models with best validation loss have large linking radius, r link ∼ 100 h −1 Mpc.The best model has hyperparametrs N block = 1, N hid = 21, l r = 1.7 × 10 −4 , w d = 2.8 × 10 −7 , r link = 100.2h −1 Mpc.Its performance on the test set is shown in Figure 5, and the performance metrics are comparable to those obtained with graphs with global attribute being a guess for the cosmological parameters drawn from a Gaussian posterior with standard deviation equal to the 1σ reported by the Planck collaboration.This suggests that allowing the inclusion of large scales information is desirable when the training set exhibits variation of cosmology and halo bias scheme.Exploring large values of r link while maintaining the full halo catalog has not been possible in this work because of the limited GPU memory.A promising avenue to include large-scale information while assuring for a small enough number of edges in each graph is represented, for example, by hierarchical GNNs [24].

Spatial information
In this work, we built graphs so that the edge attribute consists of 3 scalars that respect the symmetry of the problem.We want to understand if one or two of them carry most of the information about the amount of interlopers in a catalog, allowing us to obtain a GNN model with similar performance but operating on graphs with smaller (from the memory point of view) edge attributes.To investigate this, we consider four different cases where the edge attributes of each graph are We construct these four different sets of graphs using sub-boxes of size 250×250×1, 000 (Mpc/h) 3 along the x, ŷ, and ẑ directions and obtained from the 100 simulations in SET1, with fixed-N halo halo bias scheme that allows us to have the same halo number density in all sub-boxes.Figure 6 shows the performance of the best models on the test sets.The left panel displays the results for e 1 , where both distances along and perpendicular to the line-of-sight are encoded in the initial edge attribute.In this case, the best model exhibits values for the coefficient of determination, R 2 = 0.83, and the root mean square error, RM SE = 0.029, that are very similar to those reported in the right panel of Figure 3, where the same setup is considered except for the inclusion of cos θ in the edge attribute.This indicates that most of the information about the interloper fraction can be extracted using only distances between connected nodes.The second, third, and fourth panels in Figure 6 show the results when using e 2 , e 3 , and e 4 , respectively.In these cases, the performances of the best models are degraded, as an increase in the RM SE values and a decrease in the coefficients of determination indicate.Moreover, while the best models for e 2 and e 4 have r link ∼ 20 − 30 h −1 Mpc and use 1 or 2 GNN layers, the one for e 3 has a very small linking radius, r link ∼ 7 h −1 Mpc, and uses only 1 GNN layer, indicating that is it taking information from extremely small scales only.
From this analysis, two key messages emerge: eliminating cos θ from the edge attributes might be a good choice to make the graphs less heavy from the memory point of view, and using only one of the three scalars as edge attribute degrades the GNN performance.

Conclusions
GNNs are designed to handle sparse and irregular data and are therefore a compelling approach for various analyses of galaxy or halo catalogs.In this paper, we showed that they can solve problems where one needs to identify a sub-sample of objects whose spatial properties differ from that of the core sample, such as interloper galaxies.We worked with cosmological simulations using halo positions to mimic the positions of galaxies in a real survey.In each halo catalog, we introduced [O iii]-Hβ-like interlopers by displacing randomly selected halos along the line-of-sight, and we created different levels of contamination by randomly choosing the fraction of interloper within the range f i ∈ [0.0 − 0.1].Using the contaminated catalogs, we built graphs to describe their 3D spatial distribution: halos are nodes that are connected via edges only if their distance is smaller or equal to the linking radius r link that is a hyperparameter.The spatial information is encoded in the edge attribute while respecting the symmetries of the problem.We considered the easiest case where GNNs perform likelihood-free inference of the interloper fraction in catalogs sharing the same underlying cosmology and the same halo bias scheme.In this case, using a small volume equal to 0.0225h −3 Gpc 3 , a GNN can accurately predict the mean of the posterior distribution of interloper fraction without any bias and with a precision equal to 0.015, corresponding to the 15% of the mean value considered: f i = 0.1.The analysis also showed that the GNN uses small-scale information, which is typically not used for cosmological constraints, to determine the interloper fraction.
Subsequently, we studied a more complicated task where the training set was built using a shared cosmology but different halo bias schemes.We implemented this scenario in two different ways: the varied-N halo and the fixed-N halo methods, where the number of halos is varied or fixed among the different catalogs.In both cases, the GNN gave unbiased results using small-scale information only, but the accuracy and precision were worse than in the scenario where all catalogs shared also the same halo bias.
Finally, we considered the task that is more similar to the application to real data, but it is also the most difficult: predicting the interloper fraction while marginalizing over cosmology and halo bias, i.e. the training is performed using halo catalogs with varying cosmology and halo bias.In this case, a GNN that uses only small-scale information has limited capacity in estimating the mean and standard deviation of the posterior distribution of the interloper fraction.This might be due to degeneracies between cosmology and halo bias appearing on small scales.We tested this hypothesis by introducing large-scale information in different ways.First, we considered adding the monopole of the power spectrum measured from each catalog as a global attribute of each graph.In this case, the gain in the performance of the GNN was very marginal, probably due to the large cosmic variance in the monopole.Second, we considered adding a guess for the 5 cosmological parameters as the global attribute of each graph.In this case, the performance of the GNN improved considerably, reaching similar accuracy and precision as the scenario where only the halo bias was varied, but the cosmology was kept fixed among the catalogs.Lastly, we considered adding large-scale information by allowing for a large linking radius, r link < 120 h −1 Mpc.In order to deal with such large linking radii while avoiding heavy graphs with many edges, we considered larger volumes but sub-sampled the number of halos.In this case, the performance of the GNN is considerably ameliorated compared to the case where only small-scale information was used, giving a precision and an accuracy that are better than in the scenario where small linking radii and a global attribute with a guess for the cosmology were used.
We also investigated the information contained in the three scalars that make up the edge attributes.We found that each singular scalar is not sufficient to learn the mean and standard deviation of the posterior distribution of interloper fraction as precisely as when using all of them.However, r ∥ and r ⊥ -the size of the edge perpendicular and parallel to the line-of-sight-seem to encode most of the information.Therefore, in an effort to make the graphs less heavy, it might be a good choice to use only these two scalars as edge attributes.
Overall, GNNs have proven to be valuable methods for inferring the interloper fraction in a catalog.They outperformed more standard methods, such as BAO+f i fitting functions, in measuring the interloper fraction with higher precision.On the other hand, when applied to data from galaxy surveys, additional work needs to be done when building the training set, which includes modeling the Finger-of-God effect, the survey geometry, and the systematics of the survey.
It is worth noticing that, although we have worked in comoving coordinates and assumed a fiducial cosmology to perform the analysis, GNNs can in principle work in observational space, i.e. the GNNs can be trained directly in (RA, DEC, z) rather than comoving coordinates.This can be a major advantage, since it does not require assuming a fiducial cosmology; on the other hand, its implementation is not straightforward since it requires a different procedure on building the graphs and in particular on deciding if two nodes are connected via an edge.Moreover, GNNs have shown to be powerful tools to constrain cosmology using information beyond the two-point function [11,12].A further generalization of the method proposed in this paper could consist of the simultaneous inference of interloper fraction and cosmological parameters.

A Epistemic Error
We evaluate the epistemic error-the error of the GNN model itself-by retraining 10 times a model with given hyperparameters; each time, the initial weights of the MLPs are drawn using a different random seed.We perform this test in the case where both cosmology and halo bias scheme are fixed across the training set, i.e. for the case discussed in Section 4.1 and in particular for the redshift-space scenario.Figure 7 displays the inference predicted by these 10 different retrained models on the test set.We can see that the results are very similar, almost indistinguishable, among different retrained models.We quantify the epistemic error

Figure 1 .
Figure 1.Monopole (left) and quadrupole (right) of the two-point correlation function of halos measured from 1,000 Quijote simulations at redshift z = 1 by [6].Different colors and line styles show catalogs with different interloper fractions f i .

. 1 )
where c is the speed of light and H(z) is the Hubble parameter in the fiducial cosmology assumed to analyze the galaxy survey, i.e., to convert angles and redshifts to distances.When considering Hβ emitters (λ Hβ e = λ true e = 486.1nm)that are confused to be [O iii] 7nm), the offset between the true and wrongly inferred interloper position is equal to ∆d ≃ 87.41 1 + z

Figure 2 .
Figure 2. Likelihood-free inference of the fraction of interlopers in halo catalogs sharing the same cosmology and same halo bias.Left panel shows the results in real space and right panel displays the results in redshift space.

Figure 3 .
Figure 3. Likelihood-free inference of the fraction of interlopers in redshift space halo catalogs sharing the same cosmology but having different halo bias.Left panel: Each catalog is obtained from a volume equal to 150 × 150 × 1, 000 (h −1 Mpc) 3 and only N halo more massive halos are considered, where N halo varies between 600 and 4,200.Right panel: the number of halos is fixed to N h = 4, 500, with halos being randomly selected from those with mass larger than M h,min in a volume equal to 250 × 250 × 1, 000 (h −1 Mpc) 3 .M h,min randomly varies in the interval 1.31 − 1.97 × 10 13 h −1 M ⊙ , to obtain different halo biases in each catalog, while maintaining the same number density.

Figure 4 .
Figure 4. Likelihood-free inference of the fraction of interlopers in redshift space halo catalogs with different cosmologies and halo bias.Top left panel: Each catalog is obtained from a volume equal to 150×150×1, 000 (h −1 Mpc) 3 and only N h more massive halos are considered, where N h varies between 600 and 4,200.Top right panel: the number of halos is fixed to N h = 4, 500, with halos being randomly selected from those with mass larger than M min,h in a volume equal to 250 × 250 × 1, 000 (h −1 Mpc) 3 .M min,h randomly varies in the interval [1.31 − 1.97] × 10 13 h −1 M ⊙ , in order to obtain different halo biases in each catalog, while maintaining the same number density.Each simulation box in the selected area of the Latin Hypercube is used three times.Bottom left panel: the number of halos is fixed, and the halo mass cut and volume cut are the same as in the top right panel, and each simulation box in the selected area of the LH is used three times.However, here each graph has a global attribute that is the monopole of the measured power spectrum up to k = 0.3 (Mpc) −1 h.Bottom right panel: the number of halos is fixed, and the halo mass cuts and volume cuts are the same as in the top right panel, but each simulation box in the LH is used three times.Moreover, the graphs have a global attribute with guess values for the cosmology randomly drawn from a normal distribution centered at the values of the true cosmology of the simulation and with variance equal to the 1σ uncertainties from Planck.

Figure 5 .
Figure5.Likelihood-free inference of the fraction of interlopers in redshift space halo catalogs built from the full boxes of the Latin Hypercube (SET2).The halo bias scheme has been implemented using the fixed-N halo method.

Figure 6 .
Figure 6.Likelihood-free inference of interloper fractions in the test sets built using the fiducial cosmology simulations (SET1) with fixed-N halo bias scheme.From left to right, each plot shows the cases where different edge attributes have been implemented in the graphs: e 1 ij = r ∥ , r ⊥ (blue), e 2 ij = r ∥ (orange), e 3 ij = [r ⊥ ] (green), e 4 ij = [cos θ] (red).