Configurable calorimeter simulation for AI applications

A configurable calorimeter simulation for AI (COCOA) applications is presented, based on the Geant4 toolkit and interfaced with the Pythia event generator. This open-source project is aimed to support the development of machine learning algorithms in high energy physics that rely on realistic particle shower descriptions, such as reconstruction, fast simulation, and low-level analysis. Specifications such as the granularity and material of its nearly hermetic geometry are user-configurable. The tool is supplemented with simple event processing including topological clustering, jet algorithms, and a nearest-neighbors graph construction. Formatting is also provided to visualise events using the Phoenix event display software.


Introduction
Algorithms incorporating machine learning (ML) methods are a new paradigm in reconstruction, calibration, identification and analysis of High Energy Physics (HEP) experimental data.In recent years, various ML architectures have been deployed to optimize low-level tasks such as clustering, reconstruction, fast simulation, pileup suppression and object identification [1,2].For example, ML-based fast calorimeter simulation relies on accurate target data to train a fast conditional generative model p(D|T ), where T denotes the true set of stable final state particles produced in the collision and D is the set of resulting detector hits.In particle reconstruction, on the other hand, the inverse process D → T is modelled by predicting a set of particles R(D) to approximate T as accurately as possible.The development a e-mail: sanmay.ganguly@cern.chb e-mail: patrick.rieck@cern.chc e-mail: etienne.dreyer@weizmann.ac.il of such algorithms requires a realistic, highly-granular simulation of particle detector response going beyond parameterized detector models frequently used in studies of particle physics phenomenology such as DELPHES [3].In particular due to the complexity of particle showers in calorimeters, a detailed, microscopic simulation of interactions between particles and detector material is needed in order to develop low-level ML algorithms exploiting such features.
Recent research efforts to study calorimeter shower properties using ML [4][5][6] made use of the GEANT4 [7] simulation toolkit for simple detector geometries.However an open source simulation with a realistic cylinder like hermetic geometry and realistic detector is yet to be adapted by HEP community for ML studies and beyond.Aiming to bridge this gap, the COnfigurable Calorimeter simulatiOn for Ai (COCOA) was developed, which uses GEANT4 [7] to implement detailed shower simulation for particles in a full-coverage, highly-segmented sensitive volume comparable to that of multipurpose detectors at the LHC.The program source code [8] is linked together with a technical documentation on the project website 1 .The emphasis of this software package is on realistic calorimeter simulation.No realistic digitization and electronic readouts are implemented and energy loss due to these processes are neglected in this package.For the same reason, simplified tracking is included in COCOA to model particle deflection in a magnetic field and energy depositions upstream of the calorimeter.A sophisticated open model for tracking suitable for tracking studies based on silicon hits is provided by [9].Usability for ML-based studies is a core motivation in the design of the COCOA code.Datasets generated by CO-COA have featured in two recent applications of ML to par- ticle reconstruction and fast simulation [10,11].To this end, the main parameters of the calorimeters are largely configurable, including their material, granularity, depth and the amount of readout noise.Similarly the inclusion of material interactions in the tracking region is optional.For comparisons with benchmark reconstruction approaches, output data from COCOA are conveniently interfaced to standard topological clustering and jet clustering algorithms.The output includes a record of energy contributions to each cell by truth particles for supervising cell-level predictions and edge lists for connecting cells and tracks in a graph to support geometric deep learning models.Finally, the default geometry has been formatted for rendering in the Phoenix event display software [12], along with a script to export event output files for visualization.An example is shown in Fig. 1.
The sophisticated COCOA calorimeter simulation and its data post-processing provides users easy access to datasets suitable to train models for current collider experiments or for more general algorithms development and benchmarking.On the other hand, the open-source nature of the package and its visualization support have the potential for use cases in education and science communication in HEP.

Detector design
The major components of COCOA are an inner tracking system (ITS) surrounded by an electromagnetic calorimeter (ECAL) and finally a hadronic calorimeter (HCAL).These subsystems are arranged concentrically and are symmetric in azimuthal angle −π < φ ≤ π as shown in Fig. 2. No muon spectrometer is considered in this design and muons are reconstructed as tracks with the ITS.The goal of this design is to accurately model the relevant outputs of a multipur-pose detector at the LHC while being simplified by the exclusion of detailed components like readout electronics, cabling, and support structures.The detector design is largely configurable, with its default parameter values chosen to achieve response characteristics comparable to that of the ATLAS detector.Following is a detailed description of each subsystem.
The ITS consists of hollow cylinders in the central detector part and disks at both of its ends, each of which are centered around the beamline.Each of these components consists of a silicon layer of 150 µm thickness in case of the disks and the five innermost cylinders and 320 µm in case of the 4 outermost cylinders.Each silicon layer is accompanied by an iron layer of 350 µm thickness in order to provide a simulate support material.The ITS only serves the purpose of simulating the interaction of particles with matter upstream of the calorimeter.The resulting detector hits are not used for tracking purposes.The default value of the magnetic flux density present in the ITS amounts to 3.8 T. Finally, two layers of iron totalling 4.4 cm in depth are added to represent support or cryostat material in front of the calorimeter.
The inner surface of the calorimeter system is a cylinder with a radius of 150 cm and a length of 6387.8 cm immediately enclosing the iron layers and the ITS.The calorimeters are separated into a central barrel region covering the pseudorapidity range |η| < 1.5 and two end-cap regions extending the coverage up to η = 3 by default.Both the ECAL and the HCAL are divided into 3 concentric layers, with each layer being further segmented into cells with edges of constant η and φ .The cell granularity for each layer is configurable by setting the number of equal divisions in η and (separately) φ .The depth of the cells in every layer is designed to be nearly constant in η to ensure that the fraction of a particle's energy deposited in each layer does not depend on the incident angle.This design, leading to layer shapes of the form 1 / cosh η, provides a uniform calorimeter thickness as a function of pseudo-rapidity.COCOA will thus have a more uniform response than a pure circular cylindrical shape.
The COCOA calorimeter material is a compound using an equivalent molecule approximation, mixing an absorber and scintillator material with a constant proportion.Both the materials and their proportion can be configured for the ECAL and the HCAL individually.By default, the ECAL is made of a mixture of lead and liquid argon, corresponding to the ATLAS ECAL materials.The volume proportion amounts to 1:3.83, resulting in a radiation length of X 0 = 2.5 cm.The ECAL and HCAL are separated by an iron layer with a default thickness of 80 mm.The HCAL is made of a mixture of iron and polyvinyl toluene plastic material with a volume proportion of 1.1.: 1.0, resulting in a nuclear interaction length of λ int = 26.6 cm.The integrated radiation While this calorimeter design represents a homogeneous detector, a spread in the resolution of reconstructed energies in accordance with a sampling calorimeter design is emulated by means of configurable sampling fraction parameters for the ECAL and the HCAL individually.In lieu of a complete simulation of active and passive material, the sampling is emulated by accounting only for a fraction of the GEANT energy deposits steps for all particles in the calroimeter showers.The steps to removed are chosen randomly.The sum of the total deposited energy by those steps is computed and the total energy released is estimated by inverse scaling of the total deposited energy by the corresponding fraction.Noise, as for example from electronics, is simulated by the addition of random amounts of energy following a Gaussian distribution centered around zero.The noise is independently added to each cell.Negative energies are allowed as is typically the case as a result from the subtraction of pedestals.If such downward fluctuations are significant in size, those negative energy cells can be clustered into topoclusters.The default choices of materials and smearing parameters provided in Tab. 1 are chosen in order to approximate single-particle responses of the ATLAS calorimeter system [13].

Data processing
Every event is processed according to the workflow presented in Fig. 4. First, primary particles are generated at the IP by means of the PYTHIA8 Monte Carlo event generator [14].A broad range of primary physics processes is available to the user, ranging from the generation of single particles as well as single jets up to more complicated final states with large multiplicities of jets and leptons in the final state.
The set of final state, stable particles is stored in the output file and passed on to the detector simulation described in the previous section, where the propagation of these particles and their interactions with the detector material is simulated in GEANT4 [7].The model of hadronic interactions is chosen in accordance with the ATLAS and CMS detector simulations.The sum of the energies deposited in each calorimeter cell is stored.Electronic noise is simulated by the addition of random energy offsets to each cell for which Tab. 1 provides the default values of standard deviation for each layer.
For the purpose of particle reconstruction, the origin of energy deposits in each cell is stored via a list of parent particle indices which contributed energy into the cell and a list of weights recording their relative contribution to the total cell energy.Cells which received their dominant contribution from electronic noise are assigned an index of -1.

ML algorithms
Fig. 4: COCOA workflow.Primary particles generated with the PYTHIA library are introduced to COCOA.Their interactions with the detector material is simulated by means of the GEANT4 toolkit.Calorimeter cells identified by a topological clustering algorithm are stored in the output ROOT file together with true particles, emulated tracks, and particle trajectories extrapolated from the IP through the calorimeter according to the equations of motion.A nearest-neighborsbased graph is constructed and stored via edge lists connecting source and destination nodes amongst the output cells and tracks.Jets made of true particles as well as topoclusters are stored in the output file as well.Events in the output file can be parsed for visualization in PHOENIX.
ters are seeded by single cells which are required to contain a deposited energy well above the noise level, where the threshold of this signal-to-noise ratio (SNR) is 4.6 for CO-COA by default, while a value of 4.0 is used the ATLAS experiment.This difference is chosen in order to achieve a better agreement between ATLAS and COCOA in terms of the topocluster multiplicity distribution for single charged and neutral pions as well as pure noise events (Fig. 5).Starting with the seeding cells, all neighbouring cells are added to the cluster if their SNR is above another threshold, where the default value is set to 2. Finally, all further neighbouring cells above a third threshold are added, which by default is set to 0. Cells with negative energy can be included, based on their absolute value, or excluded entirely (default configuration).Topocluster candidates containing multiple local maxima in ECAL cell energy each surpassing 400 MeV are split into separate topoclusters.
In order to support particle reconstruction studies which include high energy primary photons, electron-positron pairs from photon conversions taking place in the ITS upstream its two outermost iron layers are stored in the COCOA output file as well.Tracks emanating from photon conversions and also primary electron tracks are used to construct groups of topoclusters denoted "superclusters" associated with electron and photon showers.The superclustering procedure in COCOA follows the criteria described in [16], designed to improve electron energy reconstruction by incorporating nearby energy deposits from bremsstrahlung.It also includes criteria for grouping multiple clusters that are related by a pair of nearby tracks to a photon conversion vertex, thus improving reconstructed photon energy.In the photon conversion shown in Fig. 1, for example, the COCOA output contains a supercluster which combines the three topoclusters shown.Due to the simplified tracking, the criteria on number of track hits are not applied.The COCOA implementation does not focus on electron and photon identification; rather, superclusters are only formed using tracks linked to primary or conversion electrons.
While the event simulation based on the GEANT4 [7] toolkit determines particle trajectories according to their equations of motion and their interactions with detector material, COCOA implements a particle tracking based only on the equations of motion for the benefit of downstream tasks.These tracks are extrapolated to the entry surface as well as each layer of the calorimeter and the resulting η and φ coordinates are stored in the output file.
For user convenience, an interface to the FastJet [17] library is provided that clusters primary particles as well as topological calorimeter cell clusters into jets.The user can choose the specific jet clustering algorithm accordingly, with the anti-k T algorithm set as default.
For each event in the output data a fixed heterogeneous graph containing cells and tracks is provided by means of two lists storing the indices of source and destination nodes for each edge.The edges are created based on k nearest neighbors in angular distance with k being user-configurable per calorimeter layer and edge type.Three edge types are defined: track-to-cell, cell-to-cell inter-layer, and cell-to-cell across neighboring calorimeter layers (tracks are not directly connected).The user can configure for each of these types both how many edges to construct in a ∆ R-ordered neighborhood and also with a maximum ∆ R (where ∆ R 2 = ∆ η 2 + ∆ φ 2 ).The default values are given in Tab. 2. .
The final output file produced by COCOA stores an array of features for each event which are associated with the following sets: cells that participated in topoclusters, tracks, topoclusters, truth particles and decay record, graph edges, and jets.The output file format is ROOT but can be converted to hdf5 format using a script provided in the repository.

Detector performance
In the following, the performance of COCOA is investigated by means of single particles which are generated at the IP.For each particle type and momentum under investigation, the event generation is repeated in order to gather a statistically significant amount of events.
The correct reconstruction of particle energies is demonstrated in Fig. 5, which compares the distributions of multiplicities (Fig. 5a) and energy sums (Fig. 5b) of topoclusters for charged pions, photons, electrons and events containing only noise contributions, denoted as empty events.In most of the empty events, the cell energies do not pass the noise threshold of the clustering algorithm.For those events in which this threshold is passed, the average cluster energy sum amounts to 36 MeV in line with the low noise levels provided in Tab. 1.The photons and electrons mostly result in one cluster, while their energy is reconstructed with only a small variation.In comparison, the charged pion events result in larger variations of the cluster multiplicity and energy sum distributions due to the higher degree of variations in deposited energies for the hadronic showers.The average cluster energy sum is below the initial charged pion energy due to the involved nuclear interactions of the shower particles with the detector material, which are not counted as detectable energy.A hadronic calibration procedure is not performed within COCOA but left for downstream tasks.
Patterns of energy depositions across the calorimeter are demonstrated in Fig. 6 in terms of fractions of deposited energy per calorimeter layer for electrons, photons and charged pions.As a consequence of the material budget presented above in Fig. 3, the electrons and photons deposit most of their energy in the electromagnetic calorimeter, in particular in the second calorimeter layer, while the charged pions reach the hadronic calorimeter layers where they deposit most of their energy, in line with energy deposition patterns at collider-detector experiments.
Figure 7 shows distributions of the reconstructed energies for central electrons, photons and charged pions with different initial energies.The energy resolution provided by the calorimeter response improves as the initial particle energy increases.It is larger for charged pions compared to electrons and photons, as expected because of the existence of large sampling fluctuations for hadronic showers compared to electromagnetic showers.
Figure 8 shows the reconstructed energies of a single electron shot at different initial η.The average reconstructed energy is always lower than the initial particle energies, with the difference growing with the particle η.This is due to the energy depositions in the iron contained in the ITS upstream the calorimeter, in accordance with the material map presented in Fig. 3.
Figure 9 quantifies the energy resolution as a function of particle energy, comparing electrons with charged pions.For each particle type, the relative energy resolution depending on the particle energy is fitted using least-squares to the following common form of the resolution function: where the best-fit parameters are provided within the figure.The larger fitted coefficient of the sampling term for hadronic shower compared to electromagnetic is related to the larger value of sampling fraction f configured for the ECAL and HCAL separately (0.07 and 0.025, respectively).The values of the parameters, appearing in Equation 1, are individually evaluated for photon as a = 0.16 ± 0.01, b = 0.30 ± 0.02 and c = 0.006 ± 0.003.The same numbers for charged pions are found to be a = 0.50 ± 0.12, b = 0.32 ± 0.06 and c = 0.086 ± 0.002.The noise term is compatible with the input noise values, the sampling term is as expected from the sampling emulation.The performance of the simulated detector has been so far probed using single particles.To illustrate the detector performance in a more realistic event environment pp → W → e + ν were simulated.The electron is reconstructed using the superclustering algorithm described in Sec. 3, and its energy is calibrated in order to compensate for the the Fig. 8: Reconstructed energies of electrons with an energy of 30 GeV for different directions η.As the initial electron momentum is directed closer to the beamline, the difference between the reconstructed energy and the intial particle energy increases due the iron traversed by the electron upstream the calorimeter (dead material).loss due to scattering in the ITS and iron layers upstream the ECAL.The missing transverse momentum (MET) is calculated from the rescaled clusters, as the opposite the vector sum over visible transverse momenta in the whole event.Finally the transverse W mass, m W T , is computed from the reconstructed W four-momentum and compared with the corresponding truth level distribution in Fig. 10.

Event display
Visualization of detector geometry and examples of hits for individual events is important for communicating results, and interpreting downstream tasks such as reconstruction and event selection.The default geometry of the COCOA detector was ported into the open-source framework Phoenix, : f ECAL= 0.07 fit: a = 0.16 ± 0.01, b= 0.30 ± 0.02 , c=0.006 ± 0.003 + : f HCAL= 0.025 fit: a = 0.50 ± 0.12, b= 0.32 ± 0.06 , c=0.086 ± 0.002  chosen for its versatility and user support.An example event display is shown in Fig. 11.

Conclusion
The growing interest in ML approaches to low-level analysis tasks such as event or jet reconstruction in a realistic detector underscores the importance of leveraging the rich feature space of calorimeter showers for improving these tasks.Providing an open, configurable, and realistic calorimeter simulation, COCOA will facilitate the development of such algorithms and ultimately expand the physics reach of current and next-generation collider experiments.The thorough treatment of particle interactions in GEANT4 and the fullcoverage, highly-granular design of COCOA calorimeter system enable an accurate representation of the complex data

Event Fraction
Truth W e + Reco e+MET Fig. 10: The transverse W mass m W T distribution is plotted for leptonically decaying W events.The black curve shows the truth distribution whereas the red curve is obtained from the vector sum of reconstructed lepton momentum and the MET in the event.The peak location of the two distributions are well aligned, demonstrating that the event-level reconstructed MET is trustworthy within the COCOA framework.environment present in the ATLAS and CMS experiments at the LHC.To quantify this resemblance, an investigation of the single-particle response characteristics, in terms of topological clustering performance and energy resolution for electromagnetic and hadronic showers, has been carried out.Finally, additional aides including data post-processing, event visualization, and documentation for COCOA has been provided to further encourage use.

Fig. 1 :
Fig. 1: Visualization of a photon (dashed line) with energy 50 GeV converting to two electrons (green lines) producing three distinct clusters in the COCOA central electromagnetic calorimeter.The cluster shown in red contains an additional cell in the first layer of the hadronic calorimeter due to a noise fluctuation.Cells are shown with an opacity proportional to energy over noise ratio divided by 4.6, the threshold for topoclustering seeds.

Fig. 2 :Fig. 3 :
Fig.2: Positive quadrant scheme of COCOA.We use a right-handed orthogonal coordinate system x-y-z, where z-axis is the principal axis of the detector and a constant z refers to a circular cross-section of the detector.(a) yz-projection showing the COCOA ITS, subsequent iron layers, calorimeter system in the barrel and end-cap region, overlaid on lines marking constant pseudorapidity η.(b) xy-projection shows barrel region of the same subsystems at z = 0.

Fig. 5 :
Fig. 5: Number (a) and average energy (b) of reconstructed clusters in COCOA for events with a single charged pion, electron, or photon shot at η = 0. Results are also shown for clusters reconstructed in empty events due to electronic noise.The default topological calorimeter cell clustering settings are used.

Fig. 6 :
Fig.6: Energy deposited by electrons, photons and charged pions for each calorimeter layer.The electron and photon showers are limited to the electromagnetic calorimeter (layers 1 to 3) while the charged pion showers reach deep into the hadronic calorimeter (layers 4 to 6)

Fig. 9 :
Fig. 9: The relative energy resolution σ (E reco )/E truth is plotted as a function of E truth for eight different truth energy values and fitted with the relative-resolution function, for photon and pion, respectively.The average sampling fraction f for ECAL and HCAL are shown in the legend.

Fig. 11 :
Fig. 11: Phoenix event displays configured using the COCOA detector geometry, showing the charged particle tracks and calorimeter hits generated by (a) pp → tt and (b) pp → W → eν events simulated with PYTHIA8.In (a), a cutaway of the COCOA calorimeter volumes is shown along with the clustered cells, while in (b) only the cells are shown.The electron from the W decay in (b) is indicated by a green line.Both displays are shown in perspective view, such that nearer objects appear larger.Different shades of green and blue represent the different layers of ECAL and HCAL, respectively, while cell opacity is determined by cell signal-to-noise ratio.

Table 1 :
Calorimeter default design values regarding layer depths in terms of radiation lengths X 0 (ECAL) and hadronic interaction lengths λ int (HCAL), granularity and energy noise levels.

Table 2 :
Default k (number of nearest-neighbors) and maximum ∆ R separation used to define edges in the fixed graph creation.Edges between cells are denoted "c-c" while edges between tracks and cells are denoted "t-c".