Neutrino Characterisation using Convolutional Neural Networks in CHIPS water Cherenkov detectors

This work presents a novel approach to water Cherenkov neutrino detector event reconstruction and classification. Three forms of a Convolutional Neural Network have been trained to reject cosmic muon events, classify beam events, and estimate neutrino energies, using only a slightly modified version of the raw detector event as input. When evaluated on a realistic selection of simulated CHIPS-5kton prototype detector events, this new approach significantly increases performance over the standard likelihood-based reconstruction and simple neural network classification.


Introduction
In pursuit of answers to some of the open questions in physics, the study of the phenomenon of neutrino oscillations has become paramount, owing to the implication of the neutrinos' small masses in relation to the mass of the Universe and, potentially, the question of the matter anti-matter asymmetry. The next generation of giant neutrino detectors aim to push the size and therefore detector mass to the 100-500 kilo-ton level. Such massive detectors will quickly become limited by systematic uncertainties, whereas most present experiments are still statistics limited.
For detectors to remain practical and affordable into the future, a novel design strategy is highly desirable in order to break the systematic limitation of single large detectors. One approach is to break the financial dependency, which is presently on the order of $20M/kilo-ton, such that several identical detectors can be afforded. When such detectors are then placed at different positions in the neutrino beam, the oscillation parameters could be measured with substantial cancellation of the aforementioned systematic errors.

The CHIPS R&D Project
The Chips R&D project aims to develop such a novel strategy: a 'cheap as chips' water Cherenkov detector has been designed [1] and was built in 2018-2019. Primarily aimed for deployment in long-baseline accelerator beam scenarios, the idea was to lower the cost per kt of sensitive mass to between $200k-$300k. For comparison, the Super-Kamiokande detector cost approximately $4 million per kt to build more than 30 years ago. As physics sensitivity depends on the detector performance in addition to sensitive mass, this comparison is not entirely rigorous; however, it highlights the scale of possible cost savings. This paper focusses on the simulation and reconstruction of events in the 5 kton Chips detector (Chips-5).
The Chips detector module is a cylindrical water Cherenkov detector submerged in a deep body of water on the Earth's surface, such as a lake, reservoir, or as in the first prototype, a flooded mine pit. A graphical rendering of which is shown in Figure 1. The water above the sunken detector provides a modest overburden from cosmic rays, while the surrounding water supports a lightweight structure. By removing the need for underground excavation and expensive structural support, the cost of construction can be dramatically reduced. On the other hand such bodies of water are not available everywhere, and some compromise between baseline distance and off-axis angle must be incorporated to maximize the sensitivity.
Additionally, the common practice of using majority bespoke components is replaced by using modern commercially available components wherever possible. The number of expensive elements, such as photomultiplier tubes, are also reduced by only considering multi-GeV accelerator beam neutrino events, such that full high-density detector instrumentation is not required.
Furthermore, Chips detectors are not only designed to be cheap but practical. Easy to build, quick to deploy, and upgradable once operational, multiple detector modules can be flexibly combined depending on available resources and funding. Compared to DUNE and Hyper-Kamiokande, both of which require a large upfront budget and many years to construct, cheap Chips detector modules can be deployed as needed in under two years by a relatively small team.

Modern Machine Learning for CHIPS
Alongside work to realise the concept outlined above, much effort has been made to explore alternative water Cherenkov neutrino event reconstruction and classification techniques for Chips style detectors. This work is motivated by the need to maximise the achievable physics performance from a 'cheap as chips' detector.
Over the last few years, neutrino experiments have adopted modern machine learning techniques for a range of event analysis tasks [2]. In 2016 the NOvA experiment applied a Convolutional Neural Network (CNN) [3], a type of deep learning [4] neural network, to the task of classifying the interaction type of events within their sampling calorimeter detector [5]. Two views of raw detector data were used as input to train a CNN network [6]. Further NOvA iterations have since been applied to both the classification of individual energy deposit clusters [7] and ν e and e − energy reconstruction [8].
Another example of modern machine learning techniques in neutrino physics is the usage of deep neural networks by the IceCube collaboration, who, similar to Chips make use of a Cherenkov detector [9]. They have employed a classifier to distinguish between the charged current (CC) interactions of cascade causing neutrinos and the tracks of Cherenkov radiationemitting muons as they traverse the ice.
CNNs have also been applied to liquid argon time-projection chambers. The MicroBooNE experiment [10] has shown that in addition to classification tasks and particle identification [11], the spatial localisation of single particles within events is also possible [12]. Furthermore, the DUNE collaboration has designed a network to output both the interaction class and counts of different particle types within an event [13,14], as well as kinematic energy estimation [15]. This approach is called multi-task learning and is discussed in detail in Section 3.3. The DUNE collaboration has also explored raw data denoising, the first step of reconstruction, in ProtoDUNE [16], with Graph Neural Networks also being investigated as an alternative to CNNs.
Submanifold sparse convolutional networks (SSCNs) have also been suggested to account for the poor scalability of naive applications of CNNs in large scale liquid argon time-projection chambers such as DUNE. The SSCNs can be used for image classification and object detection in such detectors and can have over 95% pixel clustering efficiency and purity for Michel electrons [17]. There are also investigations of the efficiencies of different CNN based algorithms such as Residual Networks and Inception Networks algorithms reconstructing shower energies in a LArTPC [18].
Applications to water Cherenkov detectors have also been made by both the Daya Bay experiment [19] and the KM3NeT/ORCA collaboration [20]. The output from a water Cherenkov detector, such as those envisioned by the Chips concept, is a simple image of each event where two pieces of information are known for each photomultiplier tube (PMT): the number of collected photoelectrons and the associated hit times. Therefore, it is natural to use CNNs primarily developed for image-based computer vision tasks for Chips event characterisation.
This work outlines an event analysis methodology for Chips following this natural progression. Three forms of a CNN have been developed: One for cosmic muon rejection, one for beam event classification, and one for neutrino energy estimation, all using only a slightly modified version of the raw detector event as input.
For completeness, we discuss the detector simulation and then detail how CNNs are used for Chips. We then present a comprehensive analysis of the new methodology's performance. For evaluation purposes, only the implementation as applied to the Chips-5 prototype detector module is considered in this work.

Detector Simulation
The detector simulation used throughout this work [21] builds upon the WCSim water Cherenkov simulation package [22] which employs the Geant4 simulation framework [23,24,25]. Developed initially to simulate possible water Cherenkov detectors in the Long Baseline Neutrino Facility beam [26], WCSim is now used more widely in the field.
The simulation builds an n-sided, regular polygonal prism consisting of two endcaps and a barrel, filled with water and lined with a low reflectivity blacksheet. The geometry is separated into regions within both the barrel and endcaps. Each region is filled with a unique base unit of geometry known as the unit cell.
The unit cell defines a pattern of any number of PMTs, including their relative positions and in which direction they face. The final geometry is built by tiling the defined regions with their respective unit cell scaled to match the required regional photocathode coverage. Although exact detector PMT positions are not explicitly defined using this procedure, A given configuration will deterministically generate the same geometry.
In this work, the Chips-5 geometry is generated with 28 sides and regions matching the photocathode coverage of the Chips-5 detector design. A conservative photon attenuation length of 50 m at 405 nm is used alongside negligible scattering [27], with the PMT glass reflectivity set to 24% [28], and the blacksheet reflectivity kept at the WCSim default of 4%. Note that veto PMTs are ignored for simplicity.
Using the expected NuMI flux at Chips-5, the Genie neutrino event generator (version 3.0.6) [29,30] is used to generate beam neutrino events using default neutrino cross-sections on water. All channels of neutrino interaction are considered and can be broken down into five main categories: • Quasi-Elastic scattering (QEL): The dominant channel for energies below 1 GeV, involving the neutrino scattering off the entire nucleon.
• Meson Exchange Current (MEC): Additional contribution channel for energies below 1 GeV, involving two nucleons and producing two protons in the final state.
• Resonant pion production (Res) The dominant channel between 1 and 2 GeV, involving the neutrino exciting the nucleon into a resonant state.
• Coherent pion production (Coh) A mechanism where the neutrino scatters coherently from the entire nucleus.
• Deep Inelastic Scattering (DIS) The dominant channel for neutrino energies above 3 GeV with the additional energy allowing for the neutrino to resolve the individual quark content of the nucleon.
The Cosmic-Ray Shower Library (CRY) [31,32] is used for cosmic ray event generation, assuming a Chips-5 overburden of 50 m and a 2.2 MeV/cm 2 muon energy loss in water as suggested by [33]. Beam event vertices are randomly placed within the inner detector volume, while cosmic vertices are set at 1 m above the detector volume. WCSim then simulates the passage of all particles through the detector materials, with interactions, decays, and Cherenkov emission all considered. Whenever a photon is calculated to have hit the photocathode of a PMT, an angular dependent acceptance efficiency is applied to see if it is recorded [28]. If accepted, all hits within 200 ns windows are grouped to form a single recorded hit, with a smeared first hit time used as the recorded time. The standard WCSim methodology is used to determine the total output charge of the hit, given the number of incident photons. This procedure involves a single photoelectron charge distribution being repeatedly probed for each photon before the combined sum is returned which corresponds to the sum of the simulated responses for each photon [34]. The simulation output, used as the input for the rest of this work, is a collection of hits for each neutrino interaction event, each with an associated number of photoelectrons and hit time.

Convolutional Neural Networks for CHIPS
For the majority of HEP experiments, event analysis entails the separation of signal from background, the identification of particle types, the determination of spatial properties, and the estimation of energies. The same is true for Chips detectors, with the primary aims being the selection of appeared CC ν e signal beam events from a sizeable background (beam and cosmic) and the estimation of associated neutrino energies.
For this purpose, the Chips project has so far relied on a likelihood-based reconstruction algorithm and a simple classification neural network driven by hand-engineered features [35]. Both suffer from only considering what has been implemented in software and consequently what features are explicitly extracted and then modelled from the data. This restriction makes them prone to ignoring the wide range of edge cases not contained within the bulk of neutrino events and unable to use all the underlying informative features of the data.
The methods outlined in this section present a replacement event analysis methodology for Chips. As mentioned previously, the image like nature of the raw detector output is a natural fit for the standard grid-like input to a CNN. Notably, using a CNN on the raw detector event removes the requirement to build hand-engineered features as the CNN learns to extract the most powerful features within the input data itself [36]. Three forms of a baseline CNN have been developed to achieve the primary aims outlined above: • The cosmic rejection form aims to prevent the vast cosmic muon background from contaminating the final selected sample of beam events. Therefore, the primary task is a simple binary classification between beam-like and cosmic-like events.
• The beam classification form aims to separate beam events by their neutrino and interaction type to primarily select a pure and efficient sample of appeared CC ν e events, but also a sample of survived CC ν µ events. Therefore, the principal task is a categorical classification between CC ν e , CC ν µ , and background Neutral Current (NC) events.
• The neutrino energy estimation form aims to accurately estimate the energy of the signal event ν e and ν µ . Therefore, the primary task is a regression on the interactive neutrino energy.
A Python-based software package named chipsnet [37] has been built for this purpose. By using the high-level Application Programming Interfaces (APIs) provided by the Tensorflow framework (version 2.3.0) [38], a complete pipeline including data preparation, network training, and performance evaluation has been implemented. We separate our methodology into four sections: the inputs used, the CNN architecture, the CNN outputs, and finally the training procedure. For each, we detail the differences between the three network forms.

Inputs
The primary difficulty in applying CNNs to Chips (or any cylindrical detector) is determining how to map an event captured on a cylindrical surface to a two-dimensional grid. This must be done in such a way that the underlying Cherenkov emission topology is not distorted, which could inhibit network learning. This work builds upon the ideas outlined in reference [39]. Simply put, an event is mapped onto a two-dimensional grid as though it is viewed from its estimated interaction vertex position. The primary motivation behind this is to remove any detector shape effects and focus on the underlying event topology and Cherenkov emission profiles.
To estimate the interaction vertex position, the PMT hits of an event are firstly sliced in both space and time. Gaps in the time ordering of hits are used to separate the event into time slices. Each of these slices then undergoes basic clustering to remove outlying hits and ensure only the dominant collections of hits are considered. Each cleaned slice is then run through a simple geometric vertex finding algorithm to estimate the interaction vertex position.
A circular Hough transform algorithm, traditionally used for water Cherenkov ring finding, is then applied [40]. As output, the voting-based transformation produces a space within which rings of PMT hits exist as peaks. The interaction vertex position is further refined using the Hough peaks in this space.
Using θ and φ components calculated as viewed from the estimated interaction vertex position facing along the beam axis, hit PMTs are mapped onto a 64 × 64 grid. This procedure is used to generate two event maps. Firstly, a hit-charge map where each grid bin is given by the total collected photoelectrons from all PMTs mapped to that bin. Secondly, a hit-time map where each grid bin is given by the first hit time (in nanoseconds) across all PMTs mapped to that bin. Each hit-time map is further corrected so that the first hit time across all bins lies at zero.
By design, the Hough transform uses the estimated interaction vertex position to generate the transform space. Therefore, by re-binning the transform space to a 64 × 64 grid, a third hough-height map is generated for each event. This event map provides a complementary but different representation of the event where Cherenkov rings are instead represented as peaks, allowing for additional discriminating features to be learnt.
All three event maps: hit-charge, hit-time, and hough-height, are downsampled using 8-bit encoding by converting each 32-bit float value to an integer between 0 and 255. Encoding significantly reduces storage requirements and dramatically increases the speed with which data can be loaded during training (which is the primary training bottleneck). For each map type, a range over which to encode from zero up to a cap-point is chosen to minimise the number of bin values that are capped at the maximum encoded value of 255. The generated event maps for an example CC ν e event are shown in Figure 2. Each network form (cosmic rejection, beam classification, and energy estimation) uses the same event maps as input.

Architecture
An illustrative diagram of the baseline chipsnet CNN architecture is shown in Figure 3. Based on the Visual Geometry Group (VGG) network [41] there are a few key differences from the literature defined network: • Each of the three event maps: hit-charge, hit-time, and hough-height, are initially fed into three separate branches. Each branch contains two VGG blocks with two convolutional (conv) layers each (four convolutional layers in total). The outputs from each branch are merged using a concatenation layer before being fed to the rest of the network. This configuration allows for event map specific features to be learnt independently before combined features are learnt by the rest of the network.
• Batch normalisation [42] is included before the activation (ReLU) function for every convolutional layer.
• Squeeze-and-excitation units [43], are included after the max-pooling operation in all VGG blocks. These units introduce extra parameters to model the interdependencies between output feature maps, allowing the network to learn how to weight each feature map effectively.
• Dropout is included at the end of each VGG block as well as after the final fully-connected layer. Instead of dropping individual kernel elements, the dropout within the VGG blocks drops entire kernels at each training iteration, this is commonly called two-dimensional spatial dropout. The dropout after the fully-connected layers is standard, in that it drops out individual fully-connected neurons.
• Five parameters from the estimation of the interaction vertex position (seed pars) are concatenated with the flattened layer before the fullyconnected layers. Included are the three components of the estimated interaction vertex position (s x , s y , s z ), and the two components of the estimated track direction (s θ , s φ ). These parameters provide the net-  Figure 3: Illustrative diagram of the baseline chipsnet architecture. The three input event maps are separately passed through two VGG blocks each before their outputs are combined and passed through a further three VGG blocks together. The flattened VGG blocks outputs are then concatenated with five seed parameters (seed pars) and passed through two fully-connected (FC) layers of 512 neurons each before the output layer. Both the number of convolutional units (1st value) and kernels (2nd value) is shown for each block. The detailed VGG block structure is shown within the grey box. The circular yellow R and Bn indicate the use of the ReLU activation function and batch normalisation, respectively.
work with spatial context as to where the input event maps have been generated in the detector and the dominant direction of PMT activity.
The chipsnet baseline architecture is implemented using the Keras API built into Tensorflow [44]. Each network form (cosmic rejection, beam classification, and energy estimation) uses the same baseline model architecture.

Outputs
Many CNN applications are found to benefit from learning multiple tasks simultaneously. This is believed to be because training with multiple tasks tends to return a network with an improved generalised representation of the inputs, with features learnt for one task improving the performance of another. Additionally, multiple tasks work to prevent any one output from overfitting. Commonly named multi-task learning, this methodology is used within this work.
To train a network with multiple tasks (outputs), a loss function E tot , must be defined to combine the individual loss functions for each task E i . The simplest way to do this is via a linear weighted sum, such that where N is the number of tasks and w i are the associated weights. In this work this is referred to as the simple multi-task loss. The final network performance can strongly depend on the relative weighting between loss functions, especially when the values returned by each differ by many order of magnitude (common when combining regression and classification tasks). Therefore, finding the optimal w i weights can be both difficult and time-consuming. Another approach outlined in reference [45] remedies this problem by learning the optimal weighting between loss functions. This is done by introducing an additional trainable parameter σ i , for each loss function, such that In this work we refer to this as the learnt multi-task loss. The specific number and nature of outputs for the specific network forms are detailed below. Although physically motivated to some degree, the exact set of tasks and the loss combination technique used is mainly driven by extensive trial-and-error.

Outputs -Cosmic Rejection
Alongside the primary task of classification between beam-like and cosmiclike events, training the network to also separate events where the primary charged lepton escapes the detector volume or not, is found to improve cosmic rejection performance. As a large proportion of cosmic muons are relatively high in energy and, therefore, escape the detector in this fashion, there is motivation as to why this additional task is helpful. Hence the two outputs for this network form are: 1. Cosmic score (1 classification neuron): Returns a score between zero and one corresponding to whether the event is beam or cosmic like. 2. Escapes score (1 classification neuron): Returns a score between zero and one corresponding to whether the charged lepton in an event is contained or escapes the detector. NC beam events without a charged lepton are masked (do not contribute to the loss) during training for this output.
Both outputs are trained using a binary cross-entropy loss function and combined using the simple multi-task loss in Equation 1, each with a weight of 1.

Outputs -Beam Classification
Alongside the primary task of classification between CC ν e , CC ν µ , and background NC events, training the network on additional classification and particle counting tasks is found to improve performance. Note that the particle counting tasks are not used in this work for anything but to increase the primary classification performance. However, future work could exploit any ability to separate exclusive final states, deduced from these particle counts, to improve energy resolution and systematic errors. Hence the nine outputs for this network form are: 1. Combined category (3 classification neurons): Returns a classification probability score between zero and one for each of CC ν e , CC ν µ , and NC (summing to one). probability score between zero and one for each of 0, 1, 2, and 3+ muons in the final state (summing to one). 6. Proton count (4 classification neurons): Returns a classification probability score between zero and one for each of 0, 1, 2, and 3+ protons in the final state (summing to one). 7. π ± count (4 classification neurons): Returns a classification probability score between zero and one for each of 0, 1, 2, and 3+ charged pions in the final state (summing to one). 8. π 0 count (4 classification neurons): Returns a classification probability score between zero and one for each of 0, 1, 2, and 3+ neutral pions in the final state (summing to one). 9. Photon count (4 classification neurons): Returns a classification probability score between zero and one for each of 0, 1, 2, and 3+ photons in the final state (summing to one).
All outputs are trained using a categorical cross-entropy loss function and combined using the simple multi-task loss in Equation 1, each with a weight of 1.

Outputs -Energy estimation
Alongside the primary task of estimating the neutrino energy, training the network to additionally estimate the primary charged lepton energy and the interaction vertex position and time are found to improve performance. Although this improvement is relatively small for neutrino energy estimation, it dramatically improves primary charged lepton energy estimation. With two energy tasks, the network is encouraged to learn how the primary charged lepton and neutrino energies are related. As the interaction vertex position within the detector and hence distance from the wall can impact the number of deposited photoelectrons, there is motivation as to why this additional task is also helpful. Hence the six outputs for this network form are: All outputs are trained using a mean-squared error loss function and combined using the learnt multi-task loss in Equation 2 which includes the additional trainable parameters.

Training
All networks are trained on an 18 core CPU (36 thread) machine equipped with four NVIDIA GeForce RTX 2080 graphics processing units (GPUs). The Tensorflow dataset API is used to create an efficient input data pipeline where data is loaded on-the-fly at training time. This procedure ensures all CPU threads are utilised loading, decoding, and preprocessing data for the primary GPU based network calculations before being needed, maximising computational efficiency.
During preprocessing, all 8-bit input event map values are converted to 32-bit float values bounded between zero and one. A random factor scaling is applied to each map bin, generated from a normal distribution centred on one with a standard deviation of σ r . By fluctuating the bin values the network is forced to focus less on the absolute bin values and more on the underlying event topology. This process provides valuable regularisation to reduce overfitting and makes the trained networks robust to small changes within the input.
A minibatch training strategy of minibatch size of n b , using the Adam optimiser [46] (β 1 = 0.9, β 2 = 0.999, and = 10e − 7) is used. The exact training sample size and composition for each specific network are given in below, but for all network forms a 95% training to 5% validation data split is employed across the full training sample. The learning rate for each epoch η e , is set to decrease throughout training according to where η 0 is the initial learning rate, e is the epoch number (starting at one), and c d is the learning rate decay coefficient. Early stopping is also used to stop training once the network form specific performance metric has plateaued. When training each network, there is a list of tunable hyperparameters, all of which are optimised using the SHERPA hyperparameter tuning framework [47]. To maximise performance, SHERPA uses a random search algorithm to select random configurations of hyperparameters which are then tested by training the network for five epochs on the available training data. Each configuration's performance is assessed by using a metric detailed for each network below. The search space is confined to a specific range or selection of choices for each hyperparameter, with: • the initial learning rate η 0 , in a range from 0.00005 to 0.001; • the learning rate decay coefficient c d , in a range from 0.2 to 0.8; • the dropout probability p d , in a range from 0.0 to 0.5; • the random scaling size σ r , in a range from 0.0 to 0.1; and • the minibatch size n b , choosing from either 32, 64, 128, or 256;

Training -Cosmic Rejection
The cosmic rejection network form is trained on a sample of 3.15 million simulated events produced using the detector simulation and event generation methods outlined in Section 2. Roughly 1/3 rd are ν µ beam events, 1/3 rd ν e beam events, and 1/3 rd cosmic muon events, the counts of which are shown in Figure 4. All beam events (both ν µ and ν e ) are generated using the expected unoscillated Chips-5 ν µ energy spectrum to closely mimic the dominant ν µ beam component and appeared ν e signal.
The network is allowed to train for up to 30 epochs using the SHERPA optimised hyperparameters: η 0 = 0.00005, c d = 0.7, p d = 0.1, σ r = 0.02, and n b = 128. The cosmic score accuracy metric, defined as the fraction of validation sample events that are correctly classified when a cut value of 0.5 on the cosmic score output is used to determine the classification of each event and is used for SHERPA optimisation and early stopping.

Training -Beam Classification
The beam classification network form is trained on a 1.67 million event subset of the training sample used for the cosmic rejection network form, excluding the cosmic muon events. The event counts are again shown in Figure 4.
The network is allowed to train for up to 30 epochs using the SHERPA optimised hyperparameters: η 0 = 0.0002, c d = 0.5, p d = 0.1, σ r = 0.02, and n b = 128. The combined category accuracy metric, defined as the fraction of validation sample events that are correctly classified when the highest scoring combined category output neuron is used to determine the classification of each event and is used for SHERPA optimisation and early stopping.

Training -Energy estimation
Accurate neutrino energy estimation is accomplished using multiple energy estimation networks trained on separate samples of ν e and ν µ events across multiple CC interaction types. It is found that separation such as this results in greater performance compared to if a single energy estimation network form or even separate ν e and ν µ network forms are trained. This is expected as a single set of network weights is unlikely to capture the specific topological features that contribute to the energy for all types of event.
Separate energy estimation network forms are trained for each of CC-QEL (and CC-MEC), CC-Res, and CC-DIS for both ν e and ν µ events (6 in total) using 250000 corresponding simulated events each. All events (both ν µ and ν e ) are produced using the detector simulation and event generation methods outlined in Section 2. Only events for which the primary charged lepton is fully contained within the detector volume are used for training. Note that CC-QEL and CC-MEC energy estimation is combined into a single network as both have incredibly similar final state topologies (a single charged lepton).
Each energy estimation network is allowed to train for up to 30 epochs using the SHERPA optimised hyperparameters: η 0 = 0.0002, c d = 0.5, p d = 0.1, σ r = 0.0, and n b = 128. The neutrino energy mean absolute error metric, defined as the average difference between the true and estimated neutrino energies across all validation sample events is used for SHERPA optimisation and early stopping.

Results
Before discussing the results, a hugely impactful advantage of the CNN approach must be highlighted. Although the time taken to train the CNNs in this work is approximately two days, once trained, the time required to calculate all network outputs (inference time) for a single event is on the order of 2 ms. When combined with event seeding and event map generation, the total time taken to reconstruct and classify a raw event fully is less than 0.1 seconds. Compared to the ∼15 minutes required for each event using the standard reconstruction and classification methods, the difference is stark.
Armed with this incredible speed, the time taken to fully process a large dataset containing millions of events becomes a matter of hours, compared to the many weeks typically required. This change has far-reaching implications for how physics analysis is conducted. By removing the processing bottleneck, larger datasets can be used without worry, new techniques can be tested quickly, and overall analysis turnover increased.

Evaluation sample
An independent sample of events is used to evaluate the combined performance of the trained CNNs. The evaluation sample consists of 400000 beam and 350000 cosmic muon events produced in the same way as the training and validation events, using the detector simulation and event generation methods outlined in Section 2. The beam events include the expected ν µ , ν µ , ν e andν e components as well as events generated to mimic the appeared ν e component. Only the neutrino mode (forward horn current) of NuMI beam operation is considered here. During the evaluation, the beam's intrinsic neutrino and antineutrino components are considered the same for simplicity.
All evaluation events are weighted to match the expected spectrum at the Chips-5 detector using the flux, cross-sections, and oscillation probabilities (derived from the NuFIT oscillation parameters [48], assuming the normal hierarchy, and including matter effects). Additional weighting also scales the sample to match data taking in the NuMI beam for a single year, corresponding to 6 × 10 20 protons on target (POT). Cosmic muon events are weighted according to a 11.8 kHz expected Chips-5 rate over a full year. The final weighted spectrum of evaluation events is shown in Figure 5.

Preselection cuts
Separate from the CNN driven work, a simple preselection is applied to all evaluation sample events. Designed to reject cosmic and NC events while keeping the selection efficiency of CC beam events high, the preselection consists of four simple cuts, shown in Figure 6. Firstly, the total number of collected photoelectrons (charge) across all PMTs in the event must be greater than 250. Secondly, the maximum Hough transform space height must be greater than 250 photoelectrons 1 . Thirdly, the seeding procedure cos(θ) direction must be between ±0.7. Finally, the seeding procedure φ direction must be between ±1.1 radians. The first two cuts reject low energy events, typically NC, while the last two reject events whose activity is not along the beamline, typically cosmic. Beam events are weighted by combining the expected unoscillated flux with cross-sections and standard oscillation probabilities, while cosmic events are weighted using the expected cosmic rate. Shown in blue, green, and olive are the surviving CC ν µ , appearing CC ν e , and intrinsic beam CC ν e spectra respectively, binned in terms of their neutrino energy. Shown in red is the NC event spectra, binned in terms of the energy of the hadronic component (excluding the outgoing neutrino energy) to represent more accurately the energy visible to the detector. Finally, shown in black is the cosmic muon event spectra binned in terms of the muon energy. Seed φ direction (radians) A score close to one signifies a cosmic-like event, while a score close to zero corresponds to a beam-like event. Only preselected events are shown to better highlight the events this output aims to classify.

Cosmic rejection and containment
The cosmic score output from the trained cosmic rejection network form shows an excellent separation between beam (output close to zero) and cosmic (output close to one) events, as can be seen in Figure 7. Notably, 99.2% of beam events are associated with a score < 0.0001. Given this fact, a cosmic score of below 0.0001 is chosen to select beam-like events. Out of the total 350000 cosmic events in the evaluation sample, all are rejected by this cut alongside preselection.
It is crucial for accurate neutrino energy estimation that the activity of an event is fully contained within the volume of the detector. If charged event particles instead leave the detector and emit Cherenkov radiation not captured by PMTs, estimating the resulting missing energy and, hence, neutrino energy can be incredibly difficult. Within the Chips-5 detector, this is particularly important for long track CC ν µ events for which only 44% of the primary charged muons are fully contained within the detector volume. Therefore, the second output from the cosmic rejection network, escapes score is also used to select events. Although this output only considers the primary charged lepton, instead of all event particles, it still acts as a reasonable proxy for event containment. The distribution of escapes score output values for each event category is shown in Figure 8. An escapes score value below 0.33 is chosen to select events for which the primary charged lepton is deemed to be fully contained within the detector. This cut value is chosen to maximise the fraction of CC ν µ events which are correctly classified as having their primary charged lepton contained or not within the detector, leading to 96.8 ± 0.1% of CC ν µ events being classified correctly. As expected, the vast majority (97.2 ± 0.1%) of the short track CC ν e and NC events are selected by this cut.
The total number of expected events per year that pass each successive cut (including preselection) for each event category is shown in Table 1. Both CC ν e categories are selected with an efficiency greater than 92%, relative to the total number of expected events, while CC ν µ events have a 38.9 ± 0.1% selection efficiency, mainly due to the escapes score cut (as expected). NC events are found to be primarily rejected by the preselection, while cosmic events are heavily rejected by both the preselection and cosmic score cuts.
An upper limit of < 6 ± 0.6 cosmic muon events per year passing all the cuts is determined, showing that the cosmic muon rejection works well. Crucially, of all the true cosmic muon events with a cosmic score less than 0.9, none would be classified as signal CC ν e events by the beam classification detailed in Section 4.4. Therefore, the expected cosmic muon contamination  The total number of expected (weighted) events and the number that pass successive selection cuts for the different event categories. The preselection, cosmic score cut, and escapes score cut numbers are shown. The selection efficiency relative to the total number of events after all the cuts have been applied is also shown for each event category. As none out of the 350000 cosmic events in the evaluation sample are selected after the cosmic score cut, an upper limit on the values is instead given.
of both beam selections (CC ν e and CC ν µ ) is expected to be negligible relative to the selected number of signal events for each and ignored for the rest of this evaluation.

CC ν e selection
The distribution of CC ν e scores from the trained beam classification network form for the different event categories are shown in Figure 9. A strong separation between appeared CC ν e signal and both CC ν µ and NC background events is achieved. As no attempt is made to separate the appeared CC ν e signal component from the intrinsic beam CC ν e background, both are clustered with scores close to one as expected.
To assess the classification performance more rigorously, a selection score for each of the output categories, found by maximising a figure-of-merit (FOM), is calculated. All events with a score above this optimised value are then deemed signal. To minimise the expected measurement statistical error, the value of efficiency×purity (proportional to the square of s/ √ s + b) is optimised as the FOM [49].
The efficiency, purity, and their product (the FOM-ν e ) for CC ν e events (both appeared and beam) as a function of selecting events above a certain CC ν e score are shown in Figure 10. The FOM-ν e is optimised by selecting events with a CC ν e score above 0.8, achieving a value of 0.519 ± 0.004. Note that the FOM-ν e is optimised considering both the appeared and beam CC ν e components as signal due to their indistinguishable nature.
The total number of events are shown in Table 2 with event category alongside the corresponding selection efficiencies and appeared CC ν e signal and combined CC ν e purities. The purities are defined as the fraction of events within the selection that are true signal events. The final FOM-ν e selected appeared CC ν e signal purity of 38.3±0.3% may appear low, but this is mainly due to the indistinguishable intrinsic beam CC ν e contamination, note that when both CC ν e components are considered signal, the selection purity is 70.9 ± 0.6%. The FOM-ν e optimised CC ν e selection efficiency, relative to the total number of events as a function of energy for the different event categories, is shown in Figure 11 alongside the signal purity. From low neutrino energies, both CC ν e category selection efficiencies rise to a plateau of approximately 80% beginning at 4 GeV. This is expected as low energy CC ν e events have less well-defined electron Cherenkov rings, leading to their rejection. Problematically, this turn-on behaviour cuts into the true appeared CC ν e distribution, especially around the 1.5 GeV oscillation maximum. Future work should explore whether a CC ν e selection cut that varies with energy can lead to a greater proportion of these low energy events being selected.
Due to the abundance of selected intrinsic beam CC ν e events at higher energies, the appeared CC ν e purity is observed to peak at approximately 2.5 GeV (reasonably close to the oscillation maximum) before declining. Im-    Table showing CC ν e selected event numbers and corresponding efficiencies for the various event categories as well as associated purities. Shown are the total event numbers, those after the preselection, cosmic score cut, and escapes score cut (Cuts), in addition to the numbers after the FOM-ν e optimised selection with the efficiencies relative to the total number of events shown for both. Both the appeared signal CC ν e purity and the joint appeared and beam CC ν e purity are shown for each selection. portantly, within the key signal region from 2 to 4 GeV, the appeared CC ν e purity is > 55%, larger than the 38.3 ± 0.3% across the full evaluation sample. The NC efficiency is seen to slowly increase, approaching 15% for hadronic component energies above 5 GeV; this is likely due to misidentification of high energy pions or protons as electrons. Crucially, however, within the key signal region, NC selection efficiency remains low. The best way to understand the relative performance of the CNN CC ν e classification is by comparison with the standard event selection, outlined in detail in references [50] and [51]. A maximum efficiency × purity of 0.132 ± 0.005 is achieved; only ∼ 25% the value reached by the CNN approach. Both the combined appeared and beam CC ν e efficiency of 34.7±0.8% compared to 73.4±0.2% and purity of 39.3±1.2% compared to 70.9±0.6% are considerably lower than that provided by the new CNN classification.
Furthermore, the new appeared CC ν e signal efficiency of 70.8 ± 0.2% compares well to the 62% and 64% achieved by the NOvA and T2K CC ν e selections, respectively. However, purity is significantly lower at 38.3 ± 0.3% compared to the 78% and 80% reached by NOvA and T2K [52,53]. A large proportion of this difference can be explained by the lower neutrino energies and greater off-axis angles at which these experiments operate. Not only does this increase the proportion of easy to identify CC-QEL events, but it also reduces the indistinguishable beam CC ν e contamination.

CC ν µ selection
The distribution of CC ν µ scores from the trained beam classification network form for the different event categories are shown in Figure 12. Excellent separation between appeared CC ν µ signal and both CC ν e components and NC background is achieved. For high CC ν µ scores (close to one) the difference between signal and background rates is approximately three orders of magnitude.
The efficiency, purity, and their product (the FOM-ν µ ) for CC ν µ events as a function of selecting events above a certain CC ν µ score are shown in Figure 13. FOM-ν µ is optimised by selecting events with a CC ν µ score above 0.315, achieving a value of 0.365 ± 0.002. The total number of events, those selected by the previously mentioned cuts, and those furthermore selected by the FOM-ν µ optimised selection are shown in Table 3 for each event category alongside the corresponding selection efficiencies and CC ν µ signal purity.
The signal efficiency of 37.0 ± 0.1% compares well to the 31% and 36% achieved by the NOvA and T2K CC ν µ selections, respectively [52,53]. This is also the case for the signal purity of 96.0 ± 0.1% compared to the 98.6% and 94% purities of the NOvA and T2K selections. Although the final signal efficiency is low, this is desirable to ensure events are fully contained for energy estimation. When considering just those CC ν µ events for which the primary charged muon is contained within the detector volume at the truth level, an 87.5 ± 0.1% selection efficiency is achieved. The FOM-ν µ optimised CC ν µ selection efficiency, relative to the total number of events as a function of energy for the different event categories, is shown in Figure 14. Survived CC ν µ selection efficiency peaks at just below 2 GeV before slowly declining, this is explained by higher energy events being less likely to have their primary charged muon fully contained within the detector. Of interest is the expected dip in the otherwise very high (> 90%) CC ν µ purity at approximately 1.5 GeV, corresponding to approximately the muon neutrino oscillation maximum. As in the CC ν e selection case, the NC efficiency is seen to rise with energy, again likely due to misidentification of energetic protons and pions.

Interaction type classification
Using the CC category output of the trained beam classification network form, the CC interaction type for both CC ν e and CC ν µ selected events  Table 3: Table showing CC ν µ selected event numbers and corresponding efficiencies for the various event categories as well as the associated signal purity. Shown are the total event numbers, those after the preselection, cosmic score cut, and escapes score cut (Cuts), in addition to the numbers after the FOM-ν µ optimised selection with the efficiencies relative to the total number of events shown for both. The CC ν µ signal purity is also shown for each selection.  can be determined. As in the combined category output case, the highestscoring neuron can be used for classification, resulting in the matrix shown in Figure 15. Note that only events which are selected by either the CC ν e or CC ν µ selection are shown. Reasonable classification accuracy greater than 60% is achieved across the three dominant interaction types, CC-QEL, CC-Res, and CC-DIS. The less common CC-Coh and CC-MEC types are found to be commonly misidentified as CC-Res and CC-QEL respectively, likely due to the imbalanced training dataset and their corresponding topological similarities. Background NC events which pass either CC ν e or CC ν µ selection (commonly high in energy) are found to be typically classified as CC-DIS; this is expected as they commonly contain multiple energetic particles in the final state.

Energy estimation
By using the CC interaction type classification just described, the differences between CC interaction types can be exploited to improve neutrino energy, charged lepton energy, and interaction vertex position and time estimation. For events classified as either CC ν e or CC ν µ with an associated CC category interaction type, the corresponding bespoke energy estimation network form outlined in Section 3.4.3 is used for estimation.
Only three networks for each neutrino type are trained, one for each of the dominant interaction types CC-QEL (and CC-MEC), CC-Res, and CC-DIS. For events not classified by the CC category output as one of these categories, such as CC-Coh or CC-Other, the CC-Res network is used as it is the most topologically similar interaction type.
The distributions of CNN estimated neutrino energy output and true ν e and ν µ neutrino energies for true CC ν e and CC ν µ events respectively that are also selected by their corresponding CC selection are shown in Figure 16. The CNN estimated distributions match the truth well across the full range of neutrino energies expected within Chips-5, except in the peak regions where the truth distribution shape is not fully captured.
To fully understand CNN energy estimation performance, histograms of ratios of fractional differences between CNN estimated (reco) and true neutrino energy for both true CC ν e and CC ν µ beam events that are also selected    Figure 17 and Figure 18.
by their corresponding CC selection are shown in Figure 17. Similar distributions splitting the signal components by interaction type are shown in Figure 18. Furthermore, the Full Width Half Maximum (FWHM) neutrino energy resolutions derived from these plots for both CC ν e and CC ν µ events are shown in Table 4. The interaction type FWHM values follow the expected pattern, with the simple to reconstruct single charged lepton QEL interactions achieving a smaller value than multi-particle DIS events. Furthermore, when the approximate resolution is derived from the FWHM values 2 the resolutions of 10.2±0.2% and 12.5±0.2% for CC ν e and CC ν µ respectively are comparable 2 The approximate resolution, given by the standard deviation σ is found by dividing the FWHM by 2 √ 2 ln 2 ≈ 2.355. Events (arb.) QEL Res DIS Figure 18: Distributions of (reco-true)/true neutrino energies for both selected CC ν e (left) and CC ν µ (right) signal beam events by interaction type. The relative number of events between interaction types has been scaled for clearer comparison.
to the resolutions obtained by NOvA of 10.7% and 9.1% [52]. As in the CC ν e selection case, the best way to understand the relative performance of the energy estimation is by comparison with the standard Chips reconstruction outlined in detail in references [50] and [51]. Although the standard reconstruction does not attempt to estimate the neutrino energy, the energy of the primary charged lepton in each CC event is predicted. The value can be compared to the charged lepton energy output of the energy estimation CNNs. Histograms of ratios of differences between CNN estimated (reco) and true charged lepton energy to true charged lepton energy for both CC ν e and CC ν µ beam QEL events are shown in Figure 19.
A significant improvement is made using the new CNN approach. FWHM lepton energy values 30% and 39% the size of the standard reconstruction values for CC ν e and CC ν µ QEL events respectively is achieved, at 10.0 ± 0.1% and 9.0 ± 0.1%, in their fractional percentage form. When the approximate resolution is found from the CC ν e FWHM value (4.2 ± 0.1%), it compares well to the ∼ 2.5% CC QEL charged lepton energy resolution reached by the Super-Kamiokande fiTQun algorithm [54]. Impressive, given the significant differences in detector design.

Explainability
A common and justified concern with CNNs is their tendency to be used as a black box (inputs in, outputs out) with no understanding of their inner working. For detailed physics analyses, this can have significant confidence implications for the final results. Although difficult quantitatively, qualitative assessments of the trained networks can go a long way to prove they behave as desired.
Using an example ν µ event, visualisations of the output feature maps from the first, second, and third VGG blocks for the trained beam classification network form are shown in Figure 20. Learnt Cherenkov ring features are observed: ring edges, ring holes, outlying hits, Hough peaks, and a myriad of combinations are seen.
Another technique to analyse trained CNNs is t-Distributed Stochastic Neighbour Embedding (t-SNE) [55]. The t-SNE procedure is an unsupervised learning algorithm to visualise the learnt high-dimensional feature-space of a trained network in a lower number of dimensions. It accomplishes this by clustering events with similar features nearby in two-dimensional space and separating events with dissimilar features. Here, the outputs from the last fully connected layer before the output layer (with 512 dimensions) are used as input, as they provide the final representation of the learnt network  Figure 21: Two dimensional probability space of different beam events generated using the t-SNE procedure on the final fully-connected layer of the trained beam classification network. Three events, one highly CC ν e like, one highly CC ν µ like, and one highly NC like are highlighted and shown in Figure 22.

features.
For the beam classification network form, three events, labelled in the t-SNE space of Figure 21, are shown in Figure 22. Each event is highly representative of its class, achieving a high respective combined category score. Both the CC ν e and NC events are typical of that expected. However, the CC ν µ event contains a primary charged lepton that escapes the detector volume, identified by the central peak. This topology suggests that strongly classified CC ν µ events can be identified by this 'escaping' feature rather than the shape of the muon ring. Future work, therefore, should explore using only fully contained events during beam classification training.

Conclusions and future work
This paper presented a neutrino event characterisation methodology for the Chips water Cherenkov neutrino detector R&D project. Chips puts forward a novel water Cherenkov based concept to counter the vast expense, increased complexity, and long construction time expected from future longbaseline neutrino oscillation experiments. As is the case within the world around us, future Chips concept detectors will also make ever-increasing use of modern machine learning techniques. This comes as a direct result of the dramatic improvement in Chips-5 reconstruction and classification performance brought about by the work presented in this paper. Three forms of a baseline Convolutional Neural Network have been trained to reject cosmic muon events, classify beam events, and estimate neutrino energies, all using only the raw detector event as input. This new approach replaces the standard likelihood-based reconstruction and simple neural network classification, whilst greatly increasing generalisability and processing speed.
The new CNN-based approach is found to provide excellent performance in selecting an efficient and pure appeared CC ν e sample for which the neutrino energy can be accurately determined. In some cases, the performance is comparable with similar experiments, which is impressive given the significant differences in detector design. The vast cosmic muon background of Chips-5 is found to be accepted by only a factor of < 2.9 ± 0.3 × 10 −6 (without the help of a veto), equivalent to < 6 ± 0.6 cosmic muon events contaminating the beam sample per year, of which none are expected to be classified as CC ν e events.
Furthermore, the key performance metrics for both the CC ν e and CC ν µ beam selections are summarised in Table 5. Not only are the trained Selection Signal Efficiency Signal Purity ∼ ν Energy Resolution CC ν e 73.4 ± 0.2% 70.9 ± 0.6% 10.2 ± 0.2% CC ν µ 37.0 ± 0.1% 96.0 ± 0.1% 12.5 ± 0.2% Table 5: The key performance metrics for both the CC ν e and CC ν µ beam selections using the new CNN-based approach. The signal efficiency relative to the total number of expected events, the signal purity defined as the fraction of selected events which are signal, and the approximate signal neutrino energy resolution. The values considering both the appeared and beam CC ν e components as signal are given for the CC ν e selection.
CNNs found to provide excellent performance, but some insight into their inner workings was achieved. Cherenkov ring and Hough peak features are extracted from the input images, resulting in a learnt representation of the inputs seen to have strong discriminating power between categories when visualised using the t-SNE technique.
It is sincerely hoped that other water Cherenkov neutrino experiments will take inspiration from and then build upon the work presented in this paper for their own Convolutional Neural Network implementations. Although the results presented in this work are somewhat compelling, there are still clear avenues for exploration and improvement. These are all principally related to the critical performance drivers outlined within this paper.
Firstly, generating the input event maps to focus on the underlying Cherenkov profiles is incredibly important; therefore, any methodology to remove distortions further or more accurately determine the interaction vertex position will be beneficial. Secondly, the distribution of events (in energy or type) used within the training sample heavily impacts performance; thus, a comprehensive study of this behaviour could optimise the sample used. Finally, multi-task learning clearly shows promise, with further trial-and-error or a more generalised approach likely to uncover additional valuable tasks.