Graph Neural Networks for low-energy event classification & reconstruction in IceCube

: IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1

A : IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challenge due to the irregular detector geometry, inhomogeneous scattering and absorption of light in the ice and, below 100 GeV, the relatively low number of signal photons produced per event. To address this challenge, it is possible to represent IceCube events as point cloud graphs and use a Graph Neural Network (GNN) as the classification and reconstruction method. The GNN is capable of distinguishing neutrino events from cosmic-ray backgrounds, classifying different neutrino event types, and reconstructing the deposited energy, direction and interaction vertex. Based on simulation, we provide a comparison in the 1 GeV-100 GeV energy range to the current state-of-the-art maximum likelihood techniques used in current IceCube analyses, including the effects of known systematic uncertainties. For neutrino event classification, the GNN increases the signal efficiency by 18% at a fixed background rate, compared to current IceCube methods. Alternatively, the GNN offers a reduction of the background (i.e. false positive) rate by over a factor 1 Introduction

The IceCube Detector
The IceCube Neutrino Observatory, located at the geographic South Pole, consists of a cubic kilometer of ice instrumented with 5,160 Digital Optical Modules (DOMs) [1] on 86 strings, placed at depths between 1450 m and 2450 m. The main detector array consists of 78 strings arranged in a roughly hexagonal array, with an average horizontal distance of 125 m between neighboring strings [2] (see Fig. 1). Each string supports 60 DOMs separated vertically by 17 m. Each DOM contains a 10" Photomultiplier Tube (PMT) facing downwards. Around the center string in the deepest part of the array where the optical transparency of the ice is highest, modules on 8 additional strings have been installed. This volume, named "DeepCore" [3], has an increased spatial density of DOMs and features PMTs with an enhanced quantum efficiency (QE) compared to the main array. The IceCube Observatory is constructed to detect neutrino interactions spanning the energy range of a few GeV to several PeV, with the purpose of exploring properties of both the cosmos and fundamental properties of neutrinos [3,4].
Charged particles resulting from neutrino interactions in the ice will emit Cherenkov radiation detectable by IceCube. The number of detected photons is many orders of magnitude lower than those emitted. For the neutrinos detectable by IceCube, the signal in a neutrino event may range from a low of a few to a high of order 100,000 detected photoelectrons. In the energy range this work is focusing on-from a few GeV up to around 100 GeV-a typical interaction deposits about 49 GeV of energy and has just 17 detected [5] photoelectrons. This signal, however, is interspersed with noise stemming from, for example, radioactive decays in the DOMs. Event candidates are read out from the detector at a trigger rate of around 2.7 kHz [6]. These recorded events are dominated by triggers due to downgoing atmospheric muons, followed by first random triggers caused by noise pulses, second by atmospheric neutrinos at a rate less than 10 mHz [3].

The Challenge of IceCube Event Classification and Reconstruction
Event reconstruction can be framed as a problem of parameter inference. Given a set of detector observations, the reconstruction aims to infer the physics properties of a (neutrino) interaction. A reconstruction algorithm defines and implements this process of taking input data and returning parameter estimators.

Parameters of Interest and Categorization of Events
The parameters of interest estimated in the reconstruction vary by application. The deposited energy, the direction of the neutrino candidate, and the interaction vertex are of central relevance to many IceCube physics analyses. Events are also categorized into multiple morphological classes, of which only two are relevant for this work because the vast majority of the sample is below 100 GeV. These two event classes serve as proxies for the underlying neutrino flavor and interaction type. "Track-like" events are proxies for charged-current (CC) interactions from atmospheric and astrophysical neutrinos. These events contain a muon that can travel a long distance inside the detector while emitting  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18   19  20  21  22  23  24  25  26  27  28  29   30   31  32  33  34  35  36  37   38  39   40   41  42  43  44  45  46   47  48  49   50   51  52  53  54  55  56  57  58  59   60  61  62  63  64  65  66  67  68  69  70  71  72  73   74  75  76  77  78   79 [3]. Ice properties as a function of depth is shown on the left. The in-ice dust layer is marked in gray. The dust layer is a layer in the ice with dust impurities and therefore reduced optical qualities. DeepCore is placed under the dust layer and shown in green. High Quantum Efficiency (HQE) DOMs are placed under the dust layer in the DeepCore region.
Cherenkov radiation, producing a signature that looks like a track going through the ice. However, track-like signatures can also be produced by atmospheric muons from cosmic rays, and from the 17% of CC interactions that produce leptons that decay into muons [7], but these are not considered true tracks for the purposes of this work. The other class comprises "cascade-like" events, containing everything that is not described by the track-like class. These events consist of electromagnetic and hadronic particle showers, such as those produced in , , neutral-current (NC) and , CC interactions. For brevity, we will refer to the classification of track (T ) and cascade (C) morphologies as "T /C". Another classification task is the discrimination of neutrino events from atmospheric muons events, which will be referred to as " / ". Input Data Data from the IceCube PMTs is digitized with waveform digitizers which record the majority of the PMT signals from neutrino interactions. These waveforms undergo an unfolding process [8] that attempts to extract estimated photon arrival times and the corresponding charge of individual photoelectrons each of which is a so-called "pulse". The pulses form the detector response to an interaction and make up a time series. Each element in this sequence corresponds to the time at which the PMT readout indicates a measured pulse with charge . Most pulses in the energy regime considered in this work stem from single photons. These result in pulse charges close to one photoelectron, with variation in the charge due to the stochastic nature of the PMT amplification process. The detection of multiple, nearly instantaneous photons results in higher per-pulse charges. The PMTs themselves are located at fixed DOM positions ( xyz ) and have an empirically determined QE. The DOMs deployed in DeepCore have an efficiency that is roughly a factor of 1.35 times the QE of standard DOMs [3]. These six variables of the time sequence are summarized in Table 1 and form the input data for the reconstruction. Quantum efficiency of PMT - The amount of Cherenkov radiation produced in an event depends primarily on the energy of the interacting particle, and less energetic neutrinos often lead to a reduced amount of Cherenkov light. Consequently, the number of pulses for an event, pulses , is highly dependent on the event itself, and makes low energy neutrinos particularly difficult to identify and reconstruct. In this study, we focus on neutrinos with energy less than 1 TeV. The range of 1-30 GeV is of particular importance to the study of atmospheric neutrino oscillations [9,10]. For typical events in this energy range, pulses can range from below ten to several hundred after noise hit removal. Because some pulses in an event are due to noise pulses, low-energy neutrino events often suffer from a relatively poor signal-to-noise ratio. In addition, at low energies the track events can be so short that they cannot be easily distinguished from cascade-like events. While the irregular geometry of IceCube help distinguish events that would otherwise leave identical detector responses in a regular geometry configuration, it provides an additional layer of complexity to the development of reconstruction algorithms. In addition, reconstruction is further complicated by varying ice properties as a function of x,y,z, position in the IceCube array [11]. The z-dependence of the ice absorption and effective scattering is shown to the left in Fig. 1.

Traditional Reconstruction Methods
Maximum likelihood estimation is a standard technique for parameter inference. The key ingredients are the likelihood itself and a maximization strategy. The exact reconstruction likelihood for IceCube events is, however, intractable and can only be approximated. In IceCube, event reconstructions range in complexity from simple analytical approximations of likelihoods [12] to highly sophisticated detector response functions based on photon ray tracing [5].
A major challenge in modeling the detector response is including the right amount of detail of an event's Cherenkov light emission profile, as well as the inhomogeneous optical properties of the ice, which is visualized on the left side of Fig. 1. Both affect the expected photon pattern in the PMTs. Analytic approximations, which are fast but inaccurate, are predominantly used as first guesses for both online and offline processing [12]. More sophisticated reconstruction techniques come with a higher computational cost and can only be applied to a subset of the events. For the neutrino energies considered in this work, (i.e. low-energy events between 1 GeV -1000 GeV) such likelihood-based reconstructions were used, for example, in the analyses of Refs. [9,10]. We will use the state-of-the-art IceCube low-energy reconstruction algorithm [5] as a benchmark for our work. On average, requires around 40 seconds to reconstruct one event on a single CPU core, which is one of the main shortcomings of the method. Another limitation is its use of simplifying approximations, such as the assumed uniformity in the azimuthal response of the IceCube DOMs, as well as the assumption that ice properties change as a function of depth only.
The IceCube Upgrade [13] will augment IceCube's detection capabilities by adding additional detector strings featuring new DOM types: the "mDOM" and "DEgg". The mDOMs carry 24 3" PMTs providing almost homogeneous angular coverage, while the DEggs carry two 8" PMTs, one facing up and one down [14]. Adjusting to work with the IceCube Upgrade is difficult while keeping memory usage and computing time at a reasonable level, due to assumed symmetries being broken and the increase of different module types.

Machine-Learning-Based Reconstruction Methods
An alternative method to maximum likelihood estimation is regression with Neural Networks (NNs) [15]. Instead of approximating the likelihood and traversing its parameter space with an optimization algorithm, a regression algorithm returns parameter estimates directly. These regression models are trained by minimizing a defined loss function. Since an optimal regression algorithm may be highly nonlinear, artificial NNs can offer a viable solution. A regression algorithm needs to map the ragged input data of shape [ pulses , 6], where the number of columns refers to the six node features ( , , , , , ) described in Table 1, onto an output of shape [1, ] estimating event level truth information, where is the number of parameters we are interested in. The spatio-temporal nature of IceCube data, with the addition of varying sequence lengths, makes it difficult to map event reconstruction in IceCube to common machine learning techniques. Previously published IceCube machine learning methods embed events into pseudo-images and perform regression using a Convolutional Neural Network (CNN) [16]. This class of algorithms is effective at identifying local features in data -i.e., CNNs efficiently group similar data points in the compressed "latent space" representation of the input images. However, embedding IceCube events into images is a lossy operation aggregating pulse information into per-DOM summary statistics, a process that severely degrades the information available in low-energy events. Due to this reason, the CNN reconstruction method is mainly used at energies higher than what is considered in our work, but an adaptation for our energy range exists [17] [18]. An alternative NN architecture employing a Graph Neural Network (GNN) [19] approach is used in [20], but focuses solely on / classification. We propose a general reconstruction method based on GNNs that can be applied to the entire energy range of IceCube, is compatible with the IceCube Upgrade, and can reconstruct events studied in this work at speeds fast enough to run in real time online processing at the South Pole. We consider a set of reconstruction attributes: / classification, T /C classification, deposited energy , zenith and azimuth angles ( , ), and , , and coordinates of the interaction vertex, denoted by xyz . We will benchmark our reconstruction method on a simulated low-energy IceCube sample used for atmospheric neutrino oscillation studies.

Graph Neural Networks Applied to IceCube Data
Generally, GNNs are NNs that work on graph representations of data. A graph consists of nodes interconnected by edges. Nodes are associated with data, and the edges specify the relationship among the nodes. By adopting graphs as the input data structure, the idea of convolution generalizes from the application of filters on the rigid structure of grids to abstract mathematical operators that utilize the interconnection of nodes in its computation. This allows such models to naturally incorporate an irregular geometry directly in the edge specification of the graph, without imposing artificial constraints. For this reason, GNNs are a natural choice of machine learning paradigm for problems with irregular geometry, such as IceCube event reconstruction.
We choose to represent each event by a single graph. Each observed pulse is represented by a node in the graph and contains the per-DOM information shown in Table 1. Each node in the graph is connected to its 8 nearest neighbors based on the Euclidean distance, and for this reason we consider the interconnectivity of the nodes in the graphs to be spatial [21].

Preprocessing of IceCube Events
The variables listed in Table 1 come in different units and can span over different ranges, for example the relative positions of triggered DOMs can take values between tens of meters to > 1 km; or the times spanned by pulses may range from tens of nanoseconds to several microseconds. While neural networks can in theory process data in any range of real numbers, the complexity of the loss landscape that a model navigates during training is highly dependent on the relative scale of the input data. In addition, distributions not centered around zero can lead to a slower convergence time [22]. Each input variable is therefore transformed into¯usinḡ where th ( ) is the -th percentile of the distribution of input feature . This transformation brings the input variables into roughly similar orders of magnitude, gives a median of zero and makes them unitless.

Model Architecture
Our GNN implementation, henceforth referred to as , is a general purpose reconstruction and classification algorithm that is applied directly to the event pulses from the IceCube detector. Therefore, the algorithm does not rely on the pulses to be aggregated into summary statistics like [16]. Instead, uses graph learning to extract features from the pulses, making the mapping of measurements to prediction a fully learnable task. Additionally, learns the optimal edges between nodes and can be applied to events with any number of pulses. is implemented using GraphNeT [23], a software framework for graph neural networks in neutrino telescopes built using PyTorch Geometric [24].
uses a convolutional operator EdgeConv [25], developed to act on point cloud graphs in computer vision segmentation analysis. For every node with node features , the operator convolves via local neighborhood of as where˜denotes the convolved node features of , and the Multilayer Perceptron (MLP) takes as input the unconvolved node features of and the pairwise difference between the unconvolved node features of and its -th neighbor. Thus, the connectivity of the node in question serves as a specification of which neighboring nodes contribute to the convolution operation. A full convolution pass of the input graph consists of repeating Eq. (2.2) for every node in the graph. When compared to convolutions as understood from CNNs, Eq. (2.2) corresponds to a convolutional operator where the kernel size is analogous to neighbors and the "pixels" on which the kernel acts are defined by the edges, instead of the position of the kernel. The full architecture of is shown in Fig. 2. First, the following 5 global statistics are calculated from the input graph: node homophily ratio [26] of xyz and , and number of pulses in the graph. The homophily ratio is the ratio of connected node pairs that share the same node feature, and thus a number between 0 and 1. The homophily ratio of xyz indicates what fraction of connected pulses originate from the same PMT. For a graph where all nodes are connected to each other, a value of 0.5 indicates that half the pulses in the event are same-PMT pulses. As such, the homophily ratio quantifies how many of the pulses originate from the same PMT and how many of the pulses happened at the same time. Second, the input graph is propagated through 4 different EdgeConv blocks such that the output from one flows into the next. As illustrated in the bottom right corner of Fig. 2, the EdgeConv block is initialized with a k-nearest neighbors (k-nn) computation that redefines the edges of the input graph. This step lets calculate the 8 nearest Euclidean neighbors and means that the first convolution block connects the nodes in true -space and the subsequent convolution blocks connects the nodes in an increasingly abstract latent space. By letting the subsequent convolutional blocks interpret the convolved -coordinates from the past block, is allowed to connect the nodes arbitrarily in each convolutional step, which effectively lets learn the optimal neighborhoods of the event for a specific task given the 8-nearest-neighbors constraint.  The "Input Graph" is a toy illustration of the input, and the subsequent "State Graphs" illustrates how the node position and connectivity change after convolutions by EdgeConv. An illustration of the inner workings of the EdgeConv block is provided to the right, where h represents an arbitrary number of columns. The number of pulses is represented by .
Third, the input graph and each state graph are concatenated together into a [ , 1030]dimensional array that is passed through a fully connected MLP block containing two layers. The block maps the [ , 1030]-dimensional array into a [ , 256]-dimensional array. This array is aggregated node-wise into summary statistics in four parallel ways; mean, min, max, and sum, each corresponding to a many-to-one projection of the form : [ , 256] −→ [1,256] . Aggregations are concatenated together to minimize information loss, which produces an array with dimension [1, 4 · 256] that is concatenated together with the initially calculated 5 global statistics, producing a [1, 1029]-dimensional input array to the subsequent 2-layer MLP block that makes the final prediction by mapping the [1, 1029]-dimensional array to a [1, outputs ]-dimensional output. This node aggregation scheme removes the need for zero-padding of the input data, and allows the model to function on any number of pulses.
While more sophisticated pooling methods were tested, none improved upon this choice.
The architecture shown in Fig. 2 is the result of multiple iterations of architecture tests. We find that increasing the number of convolutional layers leads to identical performance but at increased training time, whereas a decrease in the number of convolutional layers leads to a noticeably worse performance. In addition, a wide variety of data-driven pooling operations have been tested, but none improved upon the many-to-one projections shown in Fig. 2. Also, hyperparameters such as the number of nearest neighbors used in the k-nearest-neighbors computation have been subject to tuning. We find that for larger than 8, the convolutions become too coarse to learn local features, whereas for smaller than 8 the convolutions become too fine to be globally descriptive. Also, the internal dimensions of the data referenced both in text and in Fig. 2 have been subject to hyperparameter optimization. These dimensions effectively set the number of learnable parameters in the model. We find that a significant reduction in learnable parameters yields an under-parameterized model that cannot learn the complicated relationships between data and task. A significant increase in learnable parameters increases training time and yields similar performance as reported for the current configuration.

Training Configuration and Loss Functions
A network is trained for each of the reconstruction variables: deposited energy , zenith and azimuth angles ( , ), the interaction vertex xyz , / classification and T /C classification. This totals 6 independently trained models. The difference between each model is the choice of loss function, number of outputs, and training selection.
Each model is trained with a batch size of 1024, using the ADAM [27] optimizer. The batch size indicates the number of events in each sub sample of the training set, on which the gradients of the loss are calculated. A custom piece-wise-linear implementation of a One-Cycle [28] learning rate schedule with a warm-up period [29] is used. The scheduler increases the global learning rate linearly from 10 −5 to 10 −3 during the first 50% of iterations in the first epoch, and thereafter linearly decreases the learning rate to 10 −5 during the remaining iterations in the 30 epoch training budget. To counteract overfitting, early stopping [30] is implemented with a patience of 5 epochs.
For all classification tasks, the Binary Cross Entropy [31] loss is used, since we only consider two categories: neutrino or muon events, and tracks or cascades. However, for each regression task, a specific loss function is chosen.
For regression of deposited energy, a LogCosh [32] function is used applied to the residual = log 10 ( reco /GeV) − log 10 ( true /GeV). The embedding : true −→ log 10 ( true /GeV) is added to account for the large range that the deposited energy spans, and log 10 ( reco /GeV) is the model's prediction of the embedded, deposited energy. LogCosh is symmetric around zero and, therefore, punishes over-and under-estimation equally. When compared to more conventional choices such as mean-squared error (MSE), LogCosh offers a steadier gradi-ent around zero and is approximately linear for large residuals, both beneficial to the training process.  is then tasked with predicting this embedded vector together with an uncertainty k. The d=2 von Mises-Fisher distribution, where d is the dimension of the unit vectors, is given by (¯|¯, ) = 2 ( ) exp (¯·¯) = 2 exp ( cos Δø) where¯is the predicted embedded vector;¯is the embedding of the true angle; resembles 1/ 2 of a normal distribution; and 2 is a normalization constant written in terms of modified Bessel functions. The d=2 von Mises-Fisher distribution describes a probability distribution on a 1-sphere embedded in R 2 and bears resemblance to loss functions such as 1 − cos(Δø) but with the added functionality of uncertainty estimation via . Note that Δø is chosen to be the angle between the predicted and the true embedding vector . The loss function is created by taking the negative log of (¯|¯, ): Loss(ø) = − ln (¯|¯, ) = − ln 2 ( ) − ln ( /4 ) + + ln (1 − −2 ) − cos (Δø). (2.4) After zenith and azimuth angles are regressed individually, the direction is produced by transforming zenith and azimuth into a direction vector ì reco ∈ R 3 . The true interaction vertex is embedded using : ( , , ) −→ | max( ) | , max( ) | , | max( ) | for the same reasons mentioned in Section 2.1.
then predicts the embedded interaction vertex vector, and the loss function is the Euclidean distance between the true (˜t rue ) and reconstructed (˜r eco ) embedding vectors.
A more intuitive approach such as a d=3 von Mises-Fisher, where Δø is chosen to be the angular difference between predicted and true direction vectors in R 3 , was found to lead to suboptimal results because the azimuthal component of the loss is too dominant at lower energies. Substantial improvement in zenith reconstruction is gained by estimating the two angles separately.

Performance on Low-Energy Neutrino Events
In this section, we quantify the performance of our proposed algorithm based on simulated data in the energy range of 1 GeV-1000 GeV, where the majority of selected events are below 100 GeV (see Fig. 4), and compare it with the state-of-the-art algorithm. When possible, we provide comparisons for tracks ( ) and cascades (all other neutrino interactions) separately. At the end of the section, a short examination of the runtime performance will be provided.

Selected Datasets
The IceCube simulation used for training and testing the classifications and reconstructions is borrowed from the collaboration's simulation for neutrino oscillation analyses. The dataset and selection process is similar to the ones described in [10], which is also used in [5]. The interactions were simulated with GENIE [34], and simulations of the propagation of secondaries, particle showers, and the propagation of Cherenkov light are the same as used in [10]. The event selection aims to provide a clean neutrino sample in the DeepCore region of IceCube by removing pure noise events, atmospheric muon events and by applying containment cuts.
is run at the late stages of the event selection (level 7 in [10]) because of it's high computational cost. We have chosen this late stage of the event selection as our dataset because it allows for a direct comparison with . The resulting selection totals approximately 8.3 million simulated neutrino and muon events, and because the event selection removes virtually all pure noise events, these have been omitted from this work. Once weighted to a physical spectrum, the data sample contains less than 5% atmospheric muons [10] and the neutrino sample consists of mostly contained interactions below 100 GeV. Distributions of a few event level variables of the selected events are shown in Fig. 4. From the 8.3 million event sample, we construct three task specific datasets, summarized in Table 3. Table 3: Selected datasets for the reconstruction and classification tasks for . Testing sets are defined as the data on which no back-propagation has been made, and therefore includes the validation set. Because the event selection aims to produce a pure neutrino sample, the raw count of muon events in the 8.2 million sample is less than 200k. In addition, there are significantly more cascade events than tracks ( CC), making the full data set class-imbalanced. This work under-samples [35] the full data set into three task-specific training sets with corresponding test sets to counteract the imbalance between the three classes: tracks, cascades and muons.
The T /C training data set contains an even amount of CC events (considered "tracks") and CC + NC events (considered "cascades"), and due to the high number of cascade events, the training sample consists of only 731k events. The events are omitted from the training sample (but included in the test sample), since 17% of events produce track-like signatures that may confuse the model during training with a track-like signature but a cascade-like label. The corresponding test set for T /C constitutes the remaining neutrino events, mostly consisting of cascades. For the / classification task, the training dataset is chosen such that there are an equal amount of neutrino and muons events. Because of the few muon events, this dataset is the smallest, and the neutrinos have been sampled such that there is an even amount of each flavor.
For reconstruction, a dataset with equal amounts of neutrino flavors has been chosen. On this dataset, energy, zenith, azimuth, and interaction vertex are regressed. In order to keep the balance between tracks and cascades, one may most naturally choose the dataset from the T /C task, because it is balanced between the two event classes, but this selection yields significantly lower statistics. It was found that the high statistics choice provided general improvements compared to reconstructions from trained purely on the balanced T /C set. The classification results are available in Fig. 3 and reconstruction results are shown in Fig. 5. The normalized distributions of regression targets on test and training sets from Table 2 are plotted in Fig. 4 for each event selection in Table 3. The T /C and Reconstruction selections are similar in target distributions, whereas the / event selection have additional artifacts due to the presence of muons.

Event Classification Performance
The performance of the classifiers and the currently-used Boosted Decision Tree (BDT) methods is characterized using Receiver Operating Characteristics (ROC) [36] curves, from which the Area Under the Curve (AUC) [37] is calculated. The / classification task is used in event selection, and for this reason a threshold for the classification scores, shown in the bottom left plot of Fig. 3, is used to make a decision on whether an event is labeled a neutrino or a muon. A typical threshold in classification score is indicated in red in the left panel of Fig. 3. Intersection points between the red line and the ROC curves represents the corresponding False Positive Rate (FPR) and True Positive Rate (TPR) at this choice of selection. By comparing the TPR of intersection points, one can deduce that offers increased signal efficiency by 18% relative to the BDT at a FPR of 0.025. Alternatively, the FPR can be reduced by over a factor of 8 from 2.5% to 0.3% relative to the BDT at a fixed TPR of 0.6 for the / classification task. From the ROC curve for the T /C classification task, offers an increased AUC score of about 6%. A threshold for the T /C classification task is not shown because the T /C classification scores, shown in the bottom right plot, are typically binned when used in an analysis. When inspecting the T /C classification score distributions in the bottom right of Fig. 3 it can be seen that offers a better separation of T /C events than the BDT. The large difference in classification performance shown in Fig. 3 between the BDT and can be explained by the fact that classical BDTs require sequential input data, such as IceCube events, to be aggregated from [ pulses , features ] into [1, ]-dimensional arrays as a preprocessing step, where is the number of event variables for the BDTs to act on. This preprocessing step reduces the information available to the BDTs.

Event Reconstruction Performance of
The event reconstruction performance for deposited energy, zenith, direction, and interaction vertex is shown in Fig. 5. The performance metric is the resolution, which we quantify here as the width of the residual distribution . For the deposited energy and zenith reconstructions, W is defined as where 16th and 84th correspond to the 16th and 84th percentile. For a normal distribution, the quantile corresponds to the standard deviation , but it is more robust to outliers. For direction and interaction vertex reconstructions, (which have strictly positive residual distributions), the resolution is defined as the 50th percentile 50th . Since reconstruction resolutions are highly dependent on the energy of the neutrino, we provide the resolutions binned in energy. Additionally, we separate the residual distributions in true track ( CC) and cascade events to examine the performance of the algorithms in detail. For comparison between and , the relative improvement defined by is used. Here and corresponds to the resolution of and for a given reconstruction task. The relative improvement provides a percentage comparison of the resolutions.  Table 2 except for vertex coordinates, which show the relative difference between reconstructed and true coordinate.
The residual distributions for deposited energy ( ) , angles ( angle ), interaction vertex ( xyz ), and direction ( dir ) are defined in Table 2 and some are shown in Fig. 4. From Fig. 4 it can be seen that tends to over-estimate the deposited energy for very low-energetic events. This is partly due to the lower statistics and the relatively poor signal-to-noise ratio below 30 GeV. On average, it will be optimal for a machine learning model to estimate a value close to the mean of the true distribution for examples where it has not yet learned a more optimal solution. Such examples can be considered to be difficult for the method, and by predicting a value close to the mean of the true distribution, the model minimizes the loss. This behavior is evident for in the reconstructions of zenith and azimuth, where events with a relatively high estimated uncertainty Notice that the residual distribution for energy is changed to = reco − true true · 100 for the calculation of resolutions.
have a preferred solution that minimizes the loss function. For the regression of azimuth, this behavior yields multimodal artifacts due to the cyclic nature of the variable. As seen in Fig. 1, DeepCore have more strings in the north/south than east/west. The events with high estimated azimuthal uncertainty piled around and 2 are generally low energetic track events with a detector response similar to a neutrino traveling east-west, but are in fact traveling north or south. If there isn't enough information to pinpoint whether the event belongs in either north or south, the loss can be optimized by picking a value close to the observed locations of the pileups, as they correspond to a "mean" between north and south. The timing information of the event can be used to pick the optimal pile. We emphasize that while the distribution of predictions give the impression that is closer to the true distribution than , the correct assessment of the quality of reconstructions comes from interpretations of the residual distributions. From the bottom panel of Fig. 4 it's evident that produces reconstructions with narrower residual distributions and therefore superior resolutions.
As can be seen from Fig. 5, performance improves upon the existing reconstruction in all 6 reconstruction variables for both track and cascade events between 1 GeV -30 GeV. This energy range is of particular importance to neutrino oscillation analyses, where flavor oscillations of atmospheric neutrinos appear below 25 GeV [10]. The typical reconstruction improvement in this range is at the level of 15%-20%. For zenith and direction reconstruction, the improvement is constant with energy for cascade-like events, while for track-like events the improvements ends around 100 GeV.
performance relative to generally decreases at higher energies, and is possibly due to the lower number of available training samples at these energies, as shown in Fig. 4. As mentioned in Section 3.1, the reconstruction event selection is chosen to maximize available statistics with an equal amount of neutrino flavors, rather than optimizing the balance of track and cascade events. This choice improves on the overall performance under the given circumstances, but at the same time leads to an underrepresentation of track-like events, which further lowers the available statistics for this event type at higher energies. In all four cases, the performance is compared to , and the ratio below the plots shows relative improvement of w.r.t. , where the light blue and dark blue curves represents the relative improvement for cascades and tracks, respectively. Positive values indicate an improvement in resolution. Lines represent reconstruction resolution, and the bands cover 1 resolution uncertainty.

Runtime Performance
The inference speed test measures the wall-time required to reconstruct the zenith angle on 1 million neutrino events in batches of 7168 events . The test includes the time required to load the batch into memory, apply re-scaling, convert the batch into graphs, move data to GPU memory, and pass them through . A budget of 40 parallel workers is allocated to feed the model with batches. The test does not include overhead associated with writing the predictions to disk. The test was repeated 5 times on both GPU and CPU. The average speed was found to be 30.6 ± 2.0 kHz on GPU, and 0.221 ± 0.004 kHz for a single CPU core. The inference speed was tested for other variables considered in this work and was found to be within the uncertainties of the above values. Since a separate model is trained for each variable, the individual reconstructions can be run in parallel, in which case the reconstruction rate of an event is equal to the inference speed if fully parallelized. If the reconstruction of each variable is not run in parallel, the reconstruction rate of an event is approximately the inference speed divided by the number of desired variables. For a full reconstruction of energy, zenith, azimuth, interaction vertex, / , and T /C classification, the reconstruction rate is approximately 5.1 kHz on the GPU, or 37 Hz for the single CPU core, respectively. With classification and reconstruction speed of nearly double of the median IceCube trigger rate (2.7 kHz), is in principle capable of real time reconstruction of IceCube events using a single GPU.

Robustness Test
The results presented in Section 3 show that is suitable for both classification and reconstruction tasks in the low-energy range of IceCube, specifically for the neutrino-oscillation-relevant energy range. However, these results are computed on simulation based on a nominal configuration of the detector. In Section 4.1, we investigate the robustness of to perturbations of input variables , and . These variables constitute the node features, and as such the perturbation test probes the robustness of to systematic variations in the node features themselves. In Section 4.2, we investigate the robustness of to changes in systematic uncertainties associated with the detector medium and the angular acceptance of the DOMs. Variations in such assumptions lead not to variations in the node features, but to the topology of the events. The test therefore probes the robustness of to variations in the connectivity of the graph representations of the events.

Perturbation of Input Variables
There are systematic uncertainties associated with the input variables shown in Table 1. The position of each string is only known to a precision of a few meters horizontally, and the vertical position The inference speed was tested on a server running Ubuntu 20.04.3 LTS with a 64-core @2.00 GHz AMD EPYC 7662 using a single NVIDIA A100 SXM4 40 GB VRAM, 1TB of RAM, and 5TB of NVME disk space.
is known with a precision better than one meter [6] . In this section, we investigate the variation in resolution and AUC from perturbations of input variables by −→ + , where x is an input variable and is a random number from a normal distribution with standard deviation . For the DOM positions, one is drawn for each string, resulting in correlated shifts for all DOMs in one string, and uncorrelated shifts between strings. The horizontal and vertical positions of the DOMs are also independently perturbed. Time and charge are perturbed pulse-wise, meaning that each pulse is perturbed independently.
is not re-trained, i.e. we use the networks trained on nominal detector assumptions, and run on perturbed input datasets. We perturb time, -coordinate, -coordinates, and charge separately, for a few choices of . We calculate the percentage variation with respect to the nominal resolution and AUC score. Our test addresses two questions. First, it gives an indication of the expected change in reconstruction performance due to a systematic shift in the input variables. Second, it serves as a gauge on feature importance for Strings are also not perfectly vertical, which results in a non-uniform uncertainty on the horizontal and vertical position, but this effect is neglected in this test. the different reconstruction tasks. In Fig. 6, variation in resolution and AUC is shown as a function of the perturbation width . A negative value indicates a worsening with respect to the nominal performance. For example, -5% means a worsened resolution with a width that is 5% larger. For AUC scores, a variation by -5% means a decrease of 5%. As seen in Fig. 6, zenith is the most sensitive to perturbations of the time, as expected, since time perturbations of the pulses may reverse the direction. Perturbations of vertical and horizontal positions of strings have little effect on energy and direction, but impact the interaction vertex resolution.

Variations in Ice Properties and Module Acceptance
The South Pole ice-IceCube's detector medium-is a glacier with an intricate structure of layers with varying optical properties, and the refrozen boreholes where strings were deployed are understood to have different optical properties than the bulk ice. These sources of systematic uncertainty are constrained from fits to calibration data, but only to a finite precision of about ±10% of the nominal value [9]. In the tests presented in this section, we quantify the robustness of our reconstruction to changes in the ice properties allowed by calibration data, using a five parameter model covering the most important sources of systematic detector uncertainties in IceCube.  We consider 20 additional simulation sets with altered ice properties and module acceptance. Specifically, there are 8 perturbed simulation sets varying the scattering and absorption coefficients of the South Pole bulk ice by (-30%, -10%, +10%, +30%) independently, four sets with variations in the overall optical efficiency of DOMs (-10%, -5%, +5%, +10%), and 9 sets with independent variations in two key parameters 0 and 1 altering the angular acceptance of the DOMs. These last two parameters are the principal components of a unification of different approaches in IceCube that model changes in the borehole ice via the angular photon acceptance . As seen in Fig. 7, an increase in 0 leads to an increase in directly up-going photon acceptance (face of the PMT), whereas changes in 1 predominantly affect the photon acceptance around the waist of the PMT.
In this subsection, we seek to quantify the robustness of to these variations by letting the model predict on the systematic sets and comparing the bias, resolution, and AUC to that recorded on the nominal data set used in Section 3. For reference, the variation in the predictions of is compared with the variations in predictions from the current method . For each individual systematic data set, only the events that are present in both the nominal set and the given systematic set are considered. The constraint to overlapping events is applied because the event selection process is itself sensitive to systematic uncertainties and can therefore lead to different distributions in parameters such as deposited energy, impacting event selection.

Classification
To quantify the robustness of the T /C and / tasks, the AUC is computed from each ROC curve on each systematic set for both and the BDTs. The robustness of classification is then defined as the relative improvement in AUC as compared to the nominal AUC: Currently, muon simulation is only available for the sets with variation in optical efficiency, limiting the test of / AUC variation to these sets. The test of T /C variation extends to all sets, and the results are shown in Fig. 8  As seen in Fig. 8, the error on AUC variation for / is much higher than for T /C. This is due to the lower statistics of the data selection shown in Table 3. In general, we observe that T /C classification has a variation well below ±2% of the nominally expected AUC for both methods, and . The / classification task, however, indicates a higher robustness of our proposed model than the current BDT classifier.

Reconstruction
The robustness of the reconstruction is reported in two main metrics; bias variation ( ) and resolution variation ( ), defined as sys = 50th sys − 50th nom and sys = 1 − sys / nom where and are as defined in Section 3.3. These measures quantify the change relative to the nominal quantities. Since bias for the direction reconstruction is ambiguous, the bias variations of the zenith and azimuth reconstructions are shown instead.     for all variables considered, while a slight overall increase in magnitude for can be observed. Zenith bias is very sensitive to ice properties, whereas the induced bias in reconstructed energy is strongly affected by the optical efficiency and scattering/absorption of light, but less sensitive to angular acceptance of the DOMs. Such an effect is expected as the deposited energy is correlated with the number of measured pulses. Also, a linear change in DOM efficiency leads to an approximately linear change in energy bias for both methods. As seen in Fig. 10, the overall effect on the resolution of the reconstruction is nearly identical for and . While the zenith bias is unaffected by optical efficiency variations, the zenith resolution changes for both and as a function of the optical efficiency. A decrease/increase in optical efficiency leads to a decrease/increase in available information for the methods, which, as expected, leads to a widening/narrowing of the residual distribution.
To summarize the difference in robustness between and or the current BDT classifiers, the RMS of the variations shown in Figs. 8 to 10 is presented in Table 4. / AUC (%) 0.5 3 T /C AUC (%) 0.5 0.8 The RMS values for / AUC are based only on the variations seen on 4 sets varying the optical efficiency of the DOMs. When considering the RMS values for the regression tasks, is generally affected by higher RMS values for both track and cascade events, indicating a higher variation induced by the systematic uncertainties compared to . The single exception is the resolution in the reconstructed vertex, where exhibits a clear advantage over RETRO.

Summary and Conclusions
We propose a GNN-based reconstruction algorithm for IceCube events named , that can be applied to any IceCube event. We have selected simulated low-energy data used for studies of atmospheric neutrino oscillations in IceCube as our dataset, and in this energy range, we have benchmarked against the state-of-the-art reconstruction and classification algorithms as a proof of concept.
offers substantial improvements to both T /C and / classifications in the entire low energy range. In the energy range 1 GeV-30 GeV, relevant to standard atmospheric oscillation studies, exhibits a 15-20 % improvement in reconstruction of energy, zenith, direction, and interaction vertex. Outside the energy range relevant to neutrino oscillations, the improvement in event reconstruction decreases, and at high energies becomes worse than . This worsening effect is ascribed to lower statistics in the training set at these energies and a disproportional amount of cascades compared to more energetic (i.e. longer) tracks. Future studies will focus on improving performance in this region. and are both robust against systematic uncertainties in DOM optical efficiency and angular acceptance, as well as the scattering and absorption properties of the bulk ice. is also robust against random perturbations to inputs such as DOM position, timing, and charge.
Based on tests of reconstruction speed, could reconstruct events online at the South Pole.
is also flexible enough to be compatible with the planned IceCube Upgrade featuring new DOM types on new strings, as well as other neutrino detectors with arbitrary geometries. This feature may make particularly useful for the next generation of large neutrino detectors, such as KM3Net [38] and the proposed IceCube Gen2 [39].