Highly-parallelized simulation of a pixelated LArTPC on a GPU

The rapid development of general-purpose computing on graphics processing units (GPGPU) is allowing the implementation of highly-parallelized Monte Carlo simulation chains for particle physics experiments. This technique is particularly suitable for the simulation of a pixelated charge readout for time projection chambers, given the large number of channels that this technology employs. Here we present the first implementation of a full microphysical simulator of a liquid argon time projection chamber (LArTPC) equipped with light readout and pixelated charge readout, developed for the DUNE Near Detector. The software is implemented with an end-to-end set of GPU-optimized algorithms. The algorithms have been written in Python and translated into CUDA kernels using Numba, a just-in-time compiler for a subset of Python and NumPy instructions. The GPU implementation achieves a speed up of four orders of magnitude compared with the equivalent CPU version. The simulation of the current induced on 10^3 pixels takes around 1 ms on the GPU, compared with approximately 10 s on the CPU. The results of the simulation are compared against data from a pixel-readout LArTPC prototype.


Introduction
The idea of using a liquid argon time projection chambers (LArTPC) for the detection of neutrino interactions was first proposed in 1977 [1].The detection mechanism is the following: charged particles produced by neutrino interactions ionize the argon, leaving a trail of ionization electrons.
In addition, liquid argon also produces scintillation light, which provides calorimetric information and a fast timing signal (O (10 ns) [2]).A fraction of the ionized electrons recombine immediately with the positive argon ions, while the remaining ones drift towards the anode side of the detector in a homogeneous electric field applied to the argon volume, which is usually O (100 V/cm).Impurities present in the LAr (e.g.O 2 , H 2 O, N 2 ) can attach a portion of the drifting electrons.The amount of drifting electrons declines as a function of the distance from the anode, since the electrons need to travel a longer path.Typically, two or more arrays of sense wires are placed at the anode and assembled into planes.The drifting of negative charges in a constant electric field induces a signal on the wires.Each plane provides a two-dimensional image of the ionization: the position of the wire provides one dimension, and the time of the arrival provides the second one, since the drift velocity of the electrons in the LAr is known and is typically O (1 m/ms).Using multiple wire planes can help estimate the position of the ionization in three dimensions.However, ambiguities arise when drifting electrons are isochronous or parallel to a wire orientation.Unambiguous 3D imaging of LArTPC charge signals is possible using a readout system based on a pixelated array of charged-sensitive pads, which has been demonstrated in ref. [3].Both the typical distance between adjacent wires and the typical pixel pitch are in the order of few millimeters.The position on the anode plane of the involved pads provides two spatial dimensions, and the time of the induced signal provides the third one.This truly three-dimensional readout provides better reconstruction efficiency and purity than the 2D combined wire readout, as demonstrated in ref. [4].
Pixel readout requires the channel count used to be increased by a factor of 10 to 100 with respect to wire planes.Thus, with this increased channel count and granularity simulation burden, the transport of electrons in LAr and the signal induction on the pixel pads represent an ideal use case for highly-parallelized, concurrent simulation algorithms.The development of general-purpose computing on graphics processing units (GPGPU), which went hand in hand with the advances in the machine learning and deep learning fields, has driven the design of the current generation of supercomputers, such as Perlmutter at NERSC [5], which is a heterogeneous system with both GPU-accelerated and CPU-only nodes.Implementing highly-parallelized simulation chains allows full advantage of these new systems to be taken and enables the simulation of future pixelated LArTPCs, which would otherwise not be viable with current resources.
The US-based neutrino physics program relies on present and future experiments using LArTPC technology.The flagship experiment is the Deep Underground Neutrino Experiment (DUNE), which will consist of a high-intensity accelerator neutrino beam, measured by near and far detectors [6].The Far Detector will consist of four 17-kiloton LArTPC modules located deep underground at the Sanford Underground Research Facility in Lead, South Dakota, located 1285 km from the beam source [7].The Near Detector will be located at Fermilab, 574 m from the beam source, and will contain a 67 t modular LArTPC called ND-LAr [8].
There are already some software toolkits for LArTPCs [9,10] and there has been some effort towards parallelizing the reconstruction stage [11].However, the simulation stages have remained mostly sequential and their speed up has been recognized as a priority by the community [12].
In this document we will describe the implementation of a set of highly-parallelized algorithms, organized in a module called larnd-sim [13], that run on GPUs.They simulate the ionized electrons recombination and drifting towards the anode, the generation of electronics signals on the pixelated readout, and the processing of the signal by the front-end electronics.

Technical implementation
Recent rapid developments in the field of machine learning have stimulated the creation of several tool-kits for GPU-accelerated applications.In particular, the NVIDIA® CUDA platform [14] allows to use GPUs for general purpose computing via different programming languages.We opted for Numba [15], which generates CUDA computing kernels using a subset of native Python and NumPy code [16].
A CUDA kernel is a function that is executed  times in parallel by  CUDA threads.The threads can be organized in one-dimensional, two-dimensional, or three-dimensional blocks, which in turn can be organized in one-dimensional, two-dimensional, or three-dimensional grids.Blocks in the same grid contain the same number of threads, can run independently, and can be executed in any order, while threads in the same block can co-operate through shared memory.CUDA kernels typically store the result of the computation in a pre-allocated array passed to the kernel function.The CUDA programming model requires a careful design of the algorithm: the shape and size of the array where the result is stored must be known in advance and the threads must avoid race conditions during execution * , thus the result of the algorithm must not depend on the order of execution of the threads.
The software described in this document contains several CUDA kernel functions, separated into two logical categories: one for the charge simulation, described in section 3, and one for the light simulation, described in section 4. The functions simulate the detector response, including: (1) the recombination of the electrons with the argon ions, (2) the drifting of the electrons towards the anode, (3) the induction of electronic signals on the pixel pads and optical detectors, and (4) the electronics response of the charge and light readout systems.
The simulation of the passage of the initial particles through matter is performed using edep-sim [17], a wrapper around G 4 [18], which is independent from the larnd-sim package described here.The output consists of a set of short particle track segments, in the order of few millimeters, which describe the energy deposition trail of each particle.The length of the segments depends on the derivative of the stopping power /: the portion of a particle trail where the / changes abruptly will be divided in finer segments than the portion where the / is mostly constant.Thus, in a single segment, the energy deposition per unit length is assumed to be constant.This approximation is valid when the segment length is of the same order as the detector resolution -at much shorter lengths, fluctuations from the long tail of the dE/dx distribution fall outside of the applicable region of recombination models, and at much longer lengths, correlations in the dE/dx fluctuations become significant and the dE/dx width will be under-simulated.Within these broad considerations, reducing the minimum size of the segments has not shown a significant impact on the result of the charge simulation.This set   is stored in a bi-dimensional NumPy array containing the energy deposition and the spatial distribution of the segments: where ì   and ì   are four-dimensional vectors containing the spatial and timing coordinates of the segment start and end points, and  is the deposited energy.This array is used as input for our module.In order to minimize the memory transfer between the host and the device (in our case the GPU), we allocate the NumPy array directly on the device memory using CuPy [19], a GPU array backend that implements a subset of the NumPy interface.The output of the larnd-sim simulation is then saved in a HDF5 file [20].The entire simulation workflow is shown in figure 1. 3 Charge simulation

Electron recombination
The first step of our simulation is to calculate the number of electrons that remain after the recombination and start drifting towards the anode.We denote the initial charge ionized by the particle as  0 , while the charge remaining after the recombination is given by   = R •  0 , where R is our recombination factor.Two different models are commonly used to describe this phenomenon: the Birks model [21], which gives spurious values when applied to high-ionization particles [22], and the modified Box model [22], which doesn't suffer from these issues but is inadequate to describe particles at low stopping power (low /).
The recombination factor for the Birks model R Birks can be parametrized as: where   and   are free parameters that usually depend on the detector and  is the product of the electric field with the liquid argon density.The ICARUS collaboration obtained   = 0.800 and   = 0.0486 (kV/MeV) (g/cm 3 ) [23], which are the values used in our simulation.The recombination factor the modified Box model R Box is defined as: where  and  are free parameters which were measured by the ArgoNeuT collaboration to be  = 0.93 and  = 0.207 (kV/MeV)(g/cm 3 ) [22].In both cases, the typical recombination factor for a minimum ionizing particle (MIP) is around 0.7.Our simulation assumes the Birks model by default.
The implementation of the calculation of the recombination factors R Birks or R Box on the GPU is trivial: the -th thread of the  recomb (  , R) CUDA kernel takes as input the -th row of the NumPy array containing the segments  and applies the recombination formula, so eq.(3.1) or eq.(3.2).The result is stored in an appropriate column of the NumPy array.Currently, the simulation treats the recombination factor as a fixed number: although this is an approximation, the level of fluctuations associated with it is significantly lower than the expected intrinsic noise of the detector, which was measured with the prototype described in section 6.
Even if this operation is computationally inexpensive, it's instructive to take a look at the performance comparison between a sequential, interpreted Python for loop, a loop compiled on the CPU using Numba, and the GPU implementation using a CUDA kernel.Figure 2 shows the processing time needed to calculate R Birks of eq.(3.1) as a function of the number of G 4 segments given as input.The CPU-compiled version is obviously faster than the sequential interpreted loop since the function is now translated into machine code.While the CUDA kernel processing time is initially the largest, it doesn't immediately scale with the number of input segments, so it starts being the exponentially faster implementation with more than O (10 4 ) segments, where it starts taking advantage of the massive parallelization achievable by the GPU.The NVIDIA® Tesla® V100 GPU used for this study can run more than 10 5 parallel threads.To give a figure of merit, a typical neutrino beam spill in the ND-LAr corresponds on average to O (10 5 ) segments.) with the GPU implementation using a CUDA kernel (blue), the CPU implementation using Numba (orange), and a sequential Python for loop (green).For reference, a neutrino beam spill corresponds on average to O (10 5 ) energy deposition segments in the ND-LAr.This computation was performed on a node of the NERSC Cori supercomputer, which contains two sockets of 20-core Intel Xeon Gold 6148 (Skylake) at 2.40 GHz and 8 NVIDIA® Tesla® V100 (Volta) GPUs.

Electron transport in liquid argon
The electrons remaining after the recombination travel towards the anode at a constant velocity  drift (assuming perfect uniformity of the electric field), which is typically O (1 mm/µs).The time they take to reach the anode is given by: where the electron drift direction is assumed to be along the  axis and where  anode is the  coordinate of the anode.The impurities in the liquid argon, such as O 2 , N 2 and H 2 O, can attach a portion of the drifting electrons, so electrons farther from the anode will have a higher chance to be attached.This effect is usually parametrized by a negative exponential, so the charge   that effectively reaches the anode, assuming uniform impurities and a perfect electronics response, is: where  drift is the time the charge takes to reach the anode and  is a parameter that depends on the concentration of impurities and it is usually called electron lifetime, which is in the order of milliseconds for concentrations of O 2 at tens of parts per trillion.
Electrons drifting in strong electric fields do not diffuse isotropically, so it is necessary to estimate both the longitudinal and transverse components with respect to the drift direction [24].The diffusion length is given by: where  is the longitudinal or transverse diffusion coefficient, which depends on the electric field and liquid argon temperature.In our simulation we set the longitudinal and transverse diffusion coefficients to   = 4 cm 2 /s and   = 8.8 cm 2 /s, respectively.These values were obtained by a preliminary ProtoDUNE-SP [25] analysis.This information is calculated and stored in appropriate columns of the NumPy array in a way analogous to the one described in section 3.1, where each thread processes a single segment independently.

Field response
The current induced by a point charge on a given pixel within the anode is calculated using the Shockley-Ramo theorem [26]: where  is the weighting field potential, the normalized contribution of a single electrode to the overall field,  is the particle charge, and ì  is the velocity of the particle.Within larnd-sim, the induced current is pre-calculated for a point-like charge and referenced as a  pixel [, , ] look-up table (LUT).The table contains the current induced by an electron placed at discrete (, ) locations on the anode plane at a discrete time , where  = 0 is the time when the electron is at  = 0.5 cm.This value is chosen because of the observed flatness of the electric and weighting fields at this distance.
The  pixel [, , ] values are calculated as follows.In the region very close to the anode, the electric field and the weighting field are calculated numerically.The geometry of a small volume, including a central pixel with a pitch of 4.4 mm and its 8 nearest neighbor pixels, is modeled using CAD software and converted into a 3-dimensional mesh using the Gmsh package [27].The fields are then calculated using the successive over-relaxation method as implemented in the Elmer FEM [28] software package.The electric and weighting fields share the same geometry, but differ in the boundary conditions imposed on the problem.In the electric field calculation, the pixels are grounded, with the backing plane at a small offset voltage, and the field on the cathode-facing side of the volume is set to the nominal field of 500 V/cm.In the calculation of the weighting field, all electrodes other than the central pixel are grounded, the central pixel is set to unit voltage, and the voltage on the remaining conductors is set to zero.The results of these two field calculations are shown in figure 3.  Next, the idealized drift paths (ignoring diffusion and attenuation effects) are integrated using the ICARUS and Walkowiak [29,30] electron transport models evaluated within the calculated electric field.This model allows for the drift velocity to change as a function of the local electric field, as the assumption of a perfectly uniform field is not necessarily true very close to the anode.The drift paths are calculated for a grid of four hundred (, ) positions, 0.5 cm from the anode on the  axis.The resulting granularity on the  and  axes is 0.33 mm.Finally, we compute the time-derivative of the weighting field potential along each path, which yields a charge-normalized current series for a given initial charge position.The sampling in time is 0.1 µs.A select few drift paths are shown in figure 4a, with their corresponding current series in figure 4b.
Farther from the pixel at  > 0.5 cm, the weighting potential is solved by treating the pixel as point-like and using a method-of-image-charges approximation to fix the boundary conditions at the anode and the cathode.The drift velocity is assumed to be uniform.The relative normalization to the near-field calculation is then fixed by enforcing continuity in the current series at the time boundary.Some re-scaling is performed on these current series to ensure they integrate to one electron charge, as error is introduced in the drift path integration, as well as numerical error present in the FEM solutions and their interpolated values.Figure 5 shows the tabulated induced current on a pixel up to 30 µs from the time of arrival of the drift electron.The agreement of the near-field FEM model and the dipole far-field approximation across this surface seen in figure 5 demonstrates that the transition surface at 0.5 cm is sufficiently far from the pixel plane.

Induced current calculation
The electrons drifting towards the anode are diffused in the longitudinal and transverse directions, as described in section 3.2.The ionization electrons corresponding to a G 4 segment will then form a three-dimensional charge cloud.The induced current on a given pixel from this can be calculated by taking the convolution of the pre-calculated pixel response model (see section 3.3.1)and the charge density of the track segment: where (, , ) is the 3-dimensional charge density including diffusion and  pixel ( − /  , , ) is the current response model at the given position and time tick, which is stored in the LUT described in section 3.3.1.The positively charged ions also produce a current on the pixel, however the magnitude of this current is more than 7 orders of magnitude smaller than the signal from the drift electrons due to the small drift velocity of the ions, and so it is neglected in the detector simulation.The calculation of  () is the most computational-intensive step of the detector simulation.In order to reduce its complexity, we apply two approximations to the charge density: first, we discretize the track segment along the track length into  points; and second, we approximate the effect of diffusion summing the contribution from  random 3D perturbations of the sample point.where   is the total charge magnitude of the segment from eq. (3.4), () is the Dirac delta function,  is the number of sampled points along the track segment,  is the number of Monte Carlo iterations per sample point, (  ,   ,   ) is a point uniformly distributed along the track segment, and (  ,   ,   ) are normally distributed in proportion to the diffusion in that dimension.This simplifies the calculation to a sum over the pixel response model at each sampled position In our implementation,  is calculated for each track with a 10 µm sampling interval, with the option of increasing the resolution and the Monte Carlo sampling for higher precision.For reference, at this sampling interval each sample point contributes a charge of roughly 50 electrons for a MIP-like track, and the error introduced by the approximations is at the percent level and much smaller than the noise level.
The calculation of the induced current is particularly well-suited to be implemented on a highly-parallelized architecture, since it is necessary to perform the sum of eq.(3.9) for each time tick in the time window, each pixel, and each G 4 detector segment.The length of the time window depends on the orientation of the segment with respect to the anode, since the arrival time is directly proportional to the  coordinate.In order to minimize the memory usage, the length of the longest signal and the largest number of pixels per segment are calculated before execution and used to allocate the array where the induced signals are going to be stored.
In our CUDA kernel  current (  ,   ,   ), the parallel threads are organized in a three-dimensional grid, where the dimensions are the segment index   , the pixel index   , and the time tick   .
It is possible to compare the performances of this GPU algorithm with an equivalent CPU version, compiled with explicit parallelization of the loops using the Numba prange function.This function runs the loop in parallel threads, one for each core (20 in our case), and tries to merge adjacent loops together, reducing loop overhead [31].The GPU version provides a speed-up of around four orders of magnitude for O (10 3 ) simulated pixels, as shown in figure 6 for different values of the time sampling.Interestingly, in this case the GPU version is significantly faster than the CPU one even for only 10 pixels, since we are parallelizing also in two other dimensions (segment index and time).The processing time for the GPU implementation does not increase with the number of simulated pixels and with different values of the time sampling, which shows that the calculation is effectively being performed in parallel.Compilation times and memory allocation times are not taken into account here.The result of the simulation of a cosmic-ray muon interacting in the TPC active volume is shown in figure 7a.Given the effect of transverse diffusion, signals are induced also on the pixels that do not lie exactly on the  projection of the muon trajectory.The high granularity of the pixel layout shows clear imaging of features such as delta rays.Figure 7b shows the current induced on a pixel both by the cosmic-ray muon and by the subsequent delta ray.
(a) 3D event display of a simulated cosmic-ray muon.

Electronics response
In the LArPix system described in ref. [3] the pixel pads are uniquely instrumented by applicationspecific integrated circuits (ASICs), which provide a charge-sensitive amplifier (CSA) with selftriggered digitization.First, the signal from each pad is input to the CSA.Then, as signal accumulates on the pad, the voltage at the output of the CSA grows until it exceeds the discriminator threshold, which in our simulation is set to 28 mV, equivalent to the signal induced by 7 × 10 3 electron charges.The discriminator, in turn, triggers the output digitization through an 8-bit analog-todigital converter (ADC).The ADC stores the CSA output voltage and then the CSA is reset, discarding the collected charge.The CSA is then ready to collect the subsequent signal.The minimum time between two consecutive ADC counts is 11 clock cycles, which with our 10 MHz clock correspond to 1.1 µs.The time between the threshold crossing and the signal digitization is tunable in the LArPix ASIC.In practice, this is 1.6 µs, which is tuned to the typical signal size.While leakage current is negligible, sub-threshold charge could collect on the channel introducing a bias in the charge measurement.A periodic reset limits the impact of this effect.Three kinds of noise are included in the simulation: (1) a reset noise, which corresponds to a random pedestal shift after every trigger reset, (2) a discriminator noise which affects the discriminator threshold, and (3) an uncorrelated noise.These are set in the simulation 900, 650, and 500 electron charges, respectively.These values have been tuned from the data described in section 6.
In our implementation, after summing the current induced by different tracks on the same pixel   , a CUDA kernel  electronics (   ) simulates the trigger logic and the noise sources described above.The threads are organized in a one-dimensional grid, where each thread processes in parallel the current induced on a single pixel   .The CUDA kernel is exponentially faster than the CPU implementation and is more than three orders of magnitude faster with O (10 3 ) pixels, as shown in figure 8. Processing time for the simulation of the electronics response with the GPU implementation using a CUDA kernel (blue) and with a CPU implementation compiled with Numba with explicit parallelization of the loops (orange).For reference, a neutrino beam spill will induce a signal on around 5 × 10 4 pixels of the ND-LAr.The simulation has been performed on a node of the NERSC Cori supercomputer, which contains two sockets of 20-core Intel Xeon Gold 6148 (Skylake) at 2.40 GHz and 8 NVIDIA® Tesla® V100 (Volta) GPUs.
Figure 9 shows a comparison of the integral of the induced current on each pixel and the corresponding ADC counts for a cosmic-ray muon.When the integral signal on a certain pixel does not reach the discrimination threshold no ADC count is stored, so it is possible to have pixels with a non-null induced current but no ADC count.
The resulting ADC counts and their corresponding timestamps are stored in a bi-dimensional array, where one dimension is the pixel index and the other is the ADC count index.They are then saved in the same HDF5 format used by the LArPix readout system, facilitating comparison between data and simulation.

Light simulation 4.1 Incident light calculation
The scintillation light produced by the excitation of the liquid argon provides additional calorimetric and topological information and a fast timing signal.Within larnd-sim, we include GPU algorithms to model scintillation light production, timing, and the associated electronics response.
Light intensity is calculated using a self-consistent recombination model.Namely, each track segment / is converted to a total number of emitted photons   using the same recombination parameterization used for the /: where  ph is the photon production per deposited energy at zero field, and R is from one of the recombination models described in section 3.2.Light propagation and material quantum efficiencies are pre-tabulated using a dedicated G 4 simulation into a LUT containing an average visibility     ,   ,   and relative time distribution     ,   ,   ,   for each 3D voxel   ,   ,   and each photosensor  in the active volume [32].For each track segment and photosensor, the total number of photoelectrons (p.e.) is calculated as where   is the overall detection efficiency of photosensor  and   is the length of the track segment .To implement this, we use a two-dimensional grid across the track segment index and detector index, such that each thread calculates the number of photoelectrons observed on a light detector from its respective track segment.In testing, we observe a speed-up factor of about 56× by using the CUDA kernel for this task.

Photocurrent simulation
Once the total light signal is determined for each track segment, they are summed into a photocurrent time profile on each photosensor using the time distribution stored within the LUT: where  0 is the segment deposition time,  is the simulation time tick, and Δ LUT is the time bin size of the LUT.The implementation of this calculation first determines the total number of ticks needed to fully simulate the event, with a configurable time pre-and post-buffer, and then pre-allocates the photocurrent time profile  pe, [  ] array on the GPU.A CUDA kernel then iterates over the incident light array  pe, summing the contribution at each time tick.To maintain independence between threads, we use a two-dimensional grid on the dimensions of the output time profile array.Each thread is responsible for calculating the sum over all track segments for a single photodetector and time tick.We observe a factor of 45 improvement by using the GPU kernel for this calculation at scale.However below approximately 1000 channels × 1000 ticks the parallel CPU calculation becomes faster due to the copying of static configuration data to the GPU, see figure 10a.After the time profile has been created, the simulation calculates the smearing effect due to the scintillation light emission.LAr scintillation light consists of two components: a prompt component with a characteristic decay time of ∼7 ns; and a slow component produced with a characteristic decay time of ∼1600 ns [33,34].These values, which are assumed to be constant in our simulation, depend on the concentration of non-argon impurities in the liquid [35,36] and on the electric field [37].A two-component exponential model for the scintillation time profile is used to simulate the broadening of the light signals due to these processes, where   is the fraction of total light arising from the prompt component (∼0.3 [2]), and   and   are the prompt and slow decay times, respectively.The convolution is direct and truncated: where   ,min =   − 5    .The convolution kernel is truncated at 5× the scintillation slow component lifetime.Within this interval, >99% of the total light from an energy deposit will have been emitted, thus minimally impacting the accuracy of the simulation.However, by using a truncated convolution, we improve the scaling of the simulation from O ( 2 ) to O ( ×  trunc ), where  trunc is the number of truncated time ticks.This convolution is implemented in a CUDA kernel and threads are distributed in a two-dimensional grid across the light detector index  and the time  For the light scintillation calculation and the electronics response calculation, the run time does not increase with more track segments.For reference, the drift time window of ND-LAr will be around 3 × 10 3 time ticks.
tick index   , such that each thread calculates the weighted sum across the input array  pe, .The direct convolution of the scintillation light profile along with the electronics response convolution, discussed later, are the most computationally intense stages of the light simulation.Figure 10b and figure 10c show the performance of these kernels with the number of time ticks simulated and the number of light channels.Within the event sizes tested, regardless of the number of time ticks and light channels, a significant speed-up is realized by moving this calculation to the GPU.
Fluctuations in the number of photoelectrons produced by the incident light due to counting statistics is included for each time profile bin.An inverse transform sampling Poisson random number algorithm [38] is used to generate the number of observed photoelectrons per time tick.The average number of steps required by this algorithm scales in proportion to the mean of the distribution.To improve the runtime for large signals, at greater than 30 expected photoelectrons, a normal distribution ∼ N ( scint, [  ], √︁  scint, [  ]) is used.The choice of 30 photoelectrons is selected to limit the error from this approximation to less than 1%.

Electronics response
The electronics response of the light detector readout is simulated assuming a perfectly linear ADC, a user-specified unit-normalized impulse response model, and purely uncorrelated noise: where   [  ] is the simulated noise,  resp [  ] is the unit-normalized response model,  resp is the number of bins in the response model, and   is the individual photosensor gain with units of [ADC • time / PE].Noise is generated individually on each light detector by using a user-specified fast-fourier transform of the noise distribution to generate random phase sinusoids with appropriate amplitudes.Figure 11 demonstrates the simulation sequence on an example muon decay event with non-trivial time structure.
Each time profile is segmented into individual light triggers using a direct threshold on the sum of multiple light detectors.A serial algorithm loops over each light detector group and identifies each threshold crossing, accounting for dead-time between triggers.Because the timing of later triggers depends on the timing of earlier triggers, this algorithm cannot be parallelized.However, the trigger simulation only contributes ∼ 3% to the overall simulation runtime and scales with the number of time ticks simulated, thus the use of a serial algorithm for the trigger generation is not a significant contributor to the overall simulation time at the DUNE Near Detector regime of <100 triggers per spill per light detector group.
The final ADC waveform is then linearly interpolated into the desired sampling frequency and ADC resolution.Trigger information and waveforms are saved into dedicated datasets within the same output HDF5 file as the charge simulation.

Truth propagation
True particle energy deposition information can be propagated through this chain to provide the true track segment id and incident photoelectrons contributing to each ADC sample.However for typical light detectors with O (cm) spatial resolution and relatively broad response times O (µs), most energy deposits originating from an interaction overlap.Thus, the number of truth entries needed scales as ( segments,typ /) × ( samples ) × ( photosensors ) × (visible volume).Using typical numbers of O (500/event m 3 ), O (100), O (100), and O (10 m 3 ), respectively, tracking the true track segment id and intensity at each sample necessitates O (5 × 10 7 ) truth entries, using a minimum of O (400 MB/event) assuming each record consists of a 32-bit integer and 32-bit floating point number.Since the truth information for an event is highly correlated in time, we opt to by default only preserve the true number of photoelectrons and arrival time per track segment per light detector, reducing the required space by a factor of  samples and runtime by a factor of  segments /tick.We however maintain the option to track the full truth information as needed by setting a configuration flag.

Profiling
The larnd-sim code has been profiled using the NVIDIA Nsight ™ Systems and the NVIDIA Nsight ™ Compute tools v2021.3.0 with an input dataset of 250 simulated cosmic rays.The analysis   shows that the algorithm is well optimized for GPUs.The charge simulation is responsible for around 70% of the processing time.Of this 70%, 97.5% of the time is used by the induced current CUDA kernel described in section 3.3.2.This kernel, in turn, spends 99.9% of the processing time for computing operations.The time spent for memory transfers, including also the initial time to allocate the dataset on the GPU memory, can be considered negligible.
A Roofline [39] analysis performed on the induced current CUDA kernel proves that the algorithm is compute-bound, so the time needed to complete its task is determined principally by the computing speed.Figure 12 shows that kernel performance for 10 cosmic-ray events with a total of 5013 active pixels, which is a typical amount of data processed for each GPU iteration, is close to the peak FLOP/s limit (in orange) and far from the memory bandwidth limit (in blue) of the GPU being used.
Overall, for a 0.5 m 3 LArTPC operated with a surface-level cosmic ray rate, the inclusion of the light simulation increases the runtime, the memory usage, and the overall data volume of the output by 30%, 15%, and 20%, respectively.

Simulation of a cosmic-ray sample and data comparison
A single-phase LArTPC called the ArgonCube Module-0 Demonstrator was operated in spring 2021 at the University of Bern as a prototype for the ND-LAr.The detector is divided into two functionally identical TPC drift regions, sharing a central high-voltage cathode that provides the drift electric field.Each anode is equipped with an array of 2 × 4 LArPix tiles of size 31.038× 31.038cm 2 .A total of 4900 pixel pads of size 0.4434 × 0.4434 cm 2 are etched on the side of each tile facing the active volume.Each tile is instrumented by 100 ASICs placed on the opposite side of the board.The LArTPC drift length is 30.27cm.
A dataset of cosmic-ray events was acquired with an applied electric field of 0.50 kV/cm.The detector performances are analyzed in detail in ref. [40].
Here we will compare the cosmic-ray data acquired with this experimental setup to a sample generated by the larnd-sim software.As a first step a dataset of cosmic rays is produced using the CORSIKA cosmic-ray generator v7.7410 [41], using FLUKA 2011 for the low-energy hadronic interaction [42].Then, the passage of the cosmic rays through the Module-0 geometry is simulated using edep-sim, a C++ wrapper around G 4. The output of edep-sim is passed to larnd-sim, which produces a HDF5 sample in the same format of the Module-0 data.
Both the simulation and the data samples are passed to a reconstruction script, which takes as input an array of three-dimensional hits, one for each ADC count.The  coordinates are given by the position of the pixel center on the tile, and the  coordinate is given by the corresponding clock tick, using the inverse formula of eq.(3.3).The hits are clustered together using the DBSCAN algorithm [43].Hits grouped in the same cluster are fed to a principal component analysis (PCA), which returns three-dimensional segments that we define as reconstructed tracks.Figure 13 shows an example of a simulated cosmic ray crossing the cathode plane after the reconstruction stage, with the reconstructed track and the corresponding hits.
To improve the system self-trigger stability, 7.8% of the pixels in Module-0 were disabled due to an understood grounding issue, mostly near the tiles edges.The same pixels were also disabled in the simulation.Thus, reconstructed tracks may show artificial gaps due to the presence of disabled channels.Also, cathode-piercing tracks can be reconstructed as separated tracks, due to the non-null cathode thickness.In order to reconstruct the original cosmic-ray trajectories, we iteratively stitch together pairs of reconstructed tracks, based on their relative directions and positions at the cathode.To set the threshold for these metrics, a subsample of data events are reconstructed a second time after randomly translating and masking hits that fall into the disabled regions.We require that 95% of the gaps produced in this procedure are correctly identified and resolved.
Module-0 was operated principally in two channel threshold regimes, low and high, which correspond to thresholds of around 6000 and 12000  − , respectively.Here we compare the lowthreshold data to the simulated sample, which was produced with a discrimination threshold roughly equivalent to the low regime one.
The amount of charge deposited per unit length (/) is measured both in data and simulation using the reconstructed tracks, which are required to have at least 10 associated hits.In the simulation, the signal amplitude of the CSA is calibrated using a fixed gain of 250 e − /mV.The value of the gain in the data is obtained by performing a template fit of the / distribution with the simulated one, shown in figure 14.Here, the  is the sum of the hits charge associated to the reconstructed track and  is the length of the reconstructed track.The best-fit value of 219 e − /mV is compatible with the value of 221 e − /mV, obtained with a dedicated stand-alone test of the LArPix ASIC CSA.
Reconstructed tracks can be sub-divided into segments of variable length, from 10 to 400 mm.The deposited charge  is the sum of the hit charges belonging to the same segment and  is The hits are grouped in clusters and a fed to a principal component analysis, which returns a reconstructed track.This plot shows, clockwise from top left, the hits for a simulated cosmic ray on the  and  planes, a three-dimensional event display, and the hits charge as a function of time.Figure 15 shows the / data and simulated distributions, as well as their respective fits.The / distributions, normalized to the number of tracks, are in good agreement with data and simulation for all segment lengths.

Conclusions
We show that it is possible to implement the simulation of a pixelated LArTPC using highlyparallelized GPU algorithms.The algorithms are written in Python and they are compiled on the GPU using the Numba just-in-time compiler.
The software provides an end-to-end microphysical simulation for light readout and pixelated charge readout.Table 1 shows a summary of the speed-up achieved by the GPU implementation for various steps of the simulation.In general, trivial algorithms (such as the calculation of the recombination factor) require a large amount of input data before the GPU implementation starts being faster.This is expected, since there are launch and execution overheads associated with CUDA kernels.For more complex algorithms, such as the simulation of the charge readout, the GPU generally starts being may orders of magnitude faster than the CPU with only a few tens of simulated pixels in the detector.In particular, the calculation of the induced current, which is the most computationally expensive task, is around four orders of magnitude faster for O (10 3 ) pixels.Likewise, the simulation of convolution-heavy light signals is faster on all event sizes tested, typically achieving three orders of magnitude improvement.With the acceleration achieved, we are able to simulate the charge (light) signals on  (10 6 ) ( (10 3 )) individual channels with sub-microsecond (nanosecond) resolution and in timescales reasonable for large Monte Carlo simulations.
As an example, the full simulation of 10 4 cosmic rays in a ton-scale LArTPC takes in total approximately 450 s, 100 s for the simulation of the passage of the initial particles through matter with edep-sim, and the remaining 350 s for the detector simulation with larnd-sim.Highlyparallelized simulation algorithms such as the one described in this document could be adapted to speed up the simulation of the DUNE Far Detector as well, which will have O (10 6 ) readout channels [45].
We have also performed the comparison of a simulated sample of cosmic rays with the data acquired by a prototype LArTPC equipped with a large-scale pixelated readout system, called the ArgonCube Module-0 Demonstrator.Charge distributions in data and simulation are in general good agreement.
The Module-0 LArTPC was moved from Bern to Fermilab in October 2021, where it will be tested on the NuMI neutrino beamline with three other identical modules, currently under construction [40], forming the ND-LAr 2x2 prototype.The production of a dataset of accurately simulated neutrino interactions will be of fundamental importance for the analysis of this measurement.larnd-sim will also be used to simulate the detector response of the full ND-LAr detector and has been included into the DUNE Offline Computing Conceptual Design Report [46].The DUNE collaboration is already using larnd-sim on the NERSC Perlmutter supercomputer, which shares a similar GPU node architecture with Cori, to perform ND-LAr 2x2 prototype simulations.A workflow chaining together edep-sim, larnd-sim and a machine learning-based reconstruction stage is currently under development.
The choice of a high-level and largely supported programming language such as Python can enable the implementation of a fully differentiable simulator in the near future, by exploiting libraries commonly used in the machine learning and artificial intelligence fields (TensorFlow [47], JAX [48]).A differentiable model will be able to use a gradient-based optimization, such as gradient descent, to automatically infer the detector simulation input and the detector physics model parameters.This technique has already been actively explored for the simulation of the photon propagation in a LArTPC [49].

Figure 2 :
Figure 2:Processing time for the calculation of the recombination factor R Birks of eq.(3.1) with the GPU implementation using a CUDA kernel (blue), the CPU implementation using Numba (orange), and a sequential Python for loop (green).For reference, a neutrino beam spill corresponds on average to O (10 5 ) energy deposition segments in the ND-LAr.This computation was performed on a node of the NERSC Cori supercomputer, which contains two sockets of 20-core Intel Xeon Gold 6148 (Skylake) at 2.40 GHz and 8 NVIDIA® Tesla® V100 (Volta) GPUs.
(a) Drift electric potential field in the region near the pixel plane.Shown is a slice in  along on a pixel center.The pixel surfaces are set to 0 kV, while the gradient of the far field at the surface of  = 10 mm is set to 0.5 kV/cm.The color scale corresponds to the value of the drift potential.(b) Weighting potential field in the region near a "pixel of interest".The central ( =  = 0 pixel is set to a unit potential, while all other surfaces are set to 0 potential.The resulting unitless field defines the susceptibility to current induction by charges moving nearby.

Figure 3 :
Figure 3: The drift (left) and weighting (right) field potentials obtained by finite-element analysis in the region near a generic pixel.Shown are slices of the potential fields along an  −  plane which crosses through the centerline of a pixel.

Figure 4 :
Figure 4:To determine the near-field pixel response, we simulate the paths and velocities of drift electron within a 3D FEM field model.Here we show the simulated electron drift paths (left), and the corresponding induced current on the pixel (right) for a subset of the starting points used.The color of the trajectory on the left corresponds to the color of the current induced curve on the right.

Figure 5 :
Figure 5: Induced current on a 4.4mm pixel from a drifting electron at a given drift time and ,  position relative to the pixel, in units of electrons per 0.1 µs.The different calculations used for the near-field and far-field regions are described in the text in section 3.3.1 with the near-field boundary highlighted with a dotted line.

Figure 6 :
Figure 6:Processing time for the calculation of the induced current with the GPU implementation using a CUDA kernel (blue) and with a CPU implementation compiled with Numba with explicit parallelization of the loops (orange).For reference, a neutrino beam spill will induce a signal on around 5×10 4 pixels of the ND-LAr.The calculation has been performed with different values of the time sampling (0.05 µs, 0.1 µs, 0.2 µs) on a node of the NERSC Cori supercomputer, which contains two sockets of 20-core Intel Xeon Gold 6148 (Skylake) at 2.40 GHz and 8 NVIDIA® Tesla® V100 (Volta) GPUs.

Figure 7 :
Figure 7: The simulation of the induced current performs the calculation of eq.(3.9) for each active pixel.Here we show a 3D event display of a simulated cosmic-ray muon (left, in red) and the current induced on a pixel by the cosmic-ray muon and a delta ray (right).

Figure 8 :
Figure 8:Processing time for the simulation of the electronics response with the GPU implementation using a CUDA kernel (blue) and with a CPU implementation compiled with Numba with explicit parallelization of the loops (orange).For reference, a neutrino beam spill will induce a signal on around 5 × 10 4 pixels of the ND-LAr.The simulation has been performed on a node of the NERSC Cori supercomputer, which contains two sockets of 20-core Intel Xeon Gold 6148 (Skylake) at 2.40 GHz and 8 NVIDIA® Tesla® V100 (Volta) GPUs.

Figure 9 :
Figure 9: Bi-dimensional projection of a cosmic-ray muon interacting in the detector with the integral of the current induced on each pixel (left) and the sum of the ADC counts after the digitization stage (right).
Comparison of electronics response convolution

Figure 10 :
Figure10: A summary of the runtime of the CUDA-compiled light simulation kernels described in the text using 100 identical track segments in the center of a TPC.The three most computationally intensive steps of the light simulation are compared to an equivalent parallelized but CPU-based calculation.For the light scintillation calculation and the electronics response calculation, the run time does not increase with more track segments.For reference, the drift time window of ND-LAr will be around 3 × 10 3 time ticks.
2D projection of a simulated stopping muon and subsequent decay.

Figure 11 :
Figure 11: Example simulated muon decay and subsequent light simulation sequence.The region highlighted in figure 11a corresponds to the location of the light detector simulated in figure 11b and 11c.

Figure 12 :
Figure 12:Roofline plot for the induced current CUDA kernel (black square) on a GPU node of the NERSC Cori supercomputer, equipped with 8 NVIDIA® Tesla® V100 (Volta) GPUs.The kernel performances are far from the memory-bound region (in light blue) and close to the peak FLOP/s limit (in orange) of the GPU being used.

Figure 13 :
Figure 13:The ADC counts are transformed into three-dimensional hits by a reconstruction script.The hits are grouped in clusters and a fed to a principal component analysis, which returns a reconstructed track.This plot shows, clockwise from top left, the hits for a simulated cosmic ray on the  and  planes, a three-dimensional event display, and the hits charge as a function of time.

Figure 14 :Figure 15 :
Figure 14: / measured in data and simulation for reconstructed tracks with at least 10 associated hits.The distributions have been normalized to unity.

Table 1 :
Summary of speed-up factors achieved by the GPU implementation for various simulation steps.