Differentiable Simulation of a Liquid Argon Time Projection Chamber

Liquid argon time projection chambers (LArTPCs) are widely used in particle detection for their tracking and calorimetric capabilities. The particle physics community actively builds and improves high-quality simulators for such detectors in order to develop physics analyses in a realistic setting. The fidelity of these simulators relative to real, measured data is limited by the modeling of the physical detectors used for data collection. This modeling can be improved by performing dedicated calibration measurements. Conventional approaches calibrate individual detector parameters or processes one at a time. However, the impact of detector processes is entangled, making this a poor description of the underlying physics. We introduce a differentiable simulator that enables a gradient-based optimization, allowing for the first time a simultaneous calibration of all detector parameters. We describe the procedure of making a differentiable simulator, highlighting the challenges of retaining the physics quality of the standard, non-differentiable version while providing meaningful gradient information. We further discuss the advantages and drawbacks of using our differentiable simulator for calibration. Finally, we provide a starting point for extensions to our approach, including applications of the differentiable simulator to physics analysis pipelines.

High quality simulations of physics detectors are a fundamental piece of infrastructure across a diverse set of scientific disciplines, with uses ranging anywhere from physical inference to experimental design.In high energy particle physics measurements, such as those targeted by DUNE [1], detector simulation is particularly crucial to analysis and event reconstruction.However, detector simulations often have discrepancies relative to the real experiments, which can lead to systematic errors and biased results, reducing the sensitivity and accuracy of physics measurements.Dedicated calibration procedures are therefore required to minimize the differences between these two regimes.
In particle physics, a conventional approach for identifying sources of data-simulation difference is to isolate different detector modeling processes using selected control samples [2][3][4][5][6][7][8].These control samples are auxiliary datasets designed to amplify the impact of particular detector effects.Calibration using these control samples may proceed by either directly altering the simulation of the amplified underlying physics process or by performing an ad hoc correction on observables without direct modification of the simulation.However, different detector processes can affect the measured data in similar ways, meaning that an isolated approach may not capture the interplay among entangled detector processes, making it difficult to fully address sources of data-simulation difference.The ad hoc approach [2], by not including explicit correction of the physics models, may fail to capture upstream effects.
A simultaneous correction of all detector processes and physics models would be an ideal new paradigm for detector calibration.However, this has not been previously achieved due to limitations of the existing software tools and calibration frameworks in addressing the potentially high dimensionality of the associated parameter space.
Gradient-based optimization provides a pathway towards improving calibration.Optimizers based on gradients are the cornerstone of modern machine learning, used in the training of many of the most common machine learning methods, notably neural networks [9].Gradientbased methods are powerful and efficient in high dimensional optimization, supporting a simultaneous fitting of arbitrarily many parameters.Equipping existing detector simulations with gradient information enables the use of gradient-based optimization for calibration.Further, as gradients are used to fit a set of physical parameters within the same detector simulation used for physics analysis, application of this calibration is trivial -the fitted parameters are simply used within the same simulation code.Therefore, calibration results can be easily tracked and consistently applied.By adjusting physical model parameters, this calibration also provides a characterization of the detector for direct experimental feedback.
Gradient-based optimization requires efficient calculation of the necessary gradients.A variety of software packages, such as PyTorch [10], TensorFlow [11], and JAX [12], are capable of performing this calculation.These tools are commonly used to support the training of neural networks, in which gradients of a loss function with respect to neural network parameters are used for a minimization.The calculation of these gradients is done using a set of techniques called automatic differentiation [13].In automatic differentiation, computer code is decomposed into a set of fundamental operations with known derivatives.Derivatives of the full program may then be calculated using these fundamental operations and the chain rule.As automatic differentiation is implemented in these machine learning software packages, any code written using these frameworks can use automatic differentiation for calculating gradients, including physics simulation.This type of approach is broadly called differentiable programming.
Methods other than gradient-based optimization using automatic differentiation exist for solving the problem of multi-dimensional parameter fitting.Approximate gradient methods [14], e.g. based on finite difference approximations to the gradient, do not require rewriting code into an automatic differentiation framework.However, such methods become computationally intractable in terms of number of function evaluations as the number of dimensions increases, and they rely on numerical choices such as step size between finite difference function evaluations, which may introduce inaccuracies.
Evolutionary or genetic algorithms [15] are popular for black-box optimization, and require no rewriting of the simulator or any assumptions about differentiability.However such models rely on explicit exploration of the search space, which becomes exponentially large as the number of dimensions increases, leading to slow or poor convergence.Bayesian optimization [16] describes another class of black box algorithms, but requires additional assumptions, e.g. the choice of a prior and a surrogate model.The fit quality and computational complexity of this surrogate (often a Gaussian process) also often scales poorly due to the exponentially large search space as the number of fitting dimensions increases.
Gradient-based optimization using automatic differentiation requires adapting simulation code into an automatic differentiation framework.It further assumes differentiability and may become trapped in local optima in the optimization landscape.However, unlike approximate gradient methods, automatic differentiation is computationally efficient in high dimensions, with a cost incurred only by an increased number of stored intermediate results.The gradients computed by automatic differentiation are exact, avoiding numerical choices such as finite difference step size.Gradient-based optimization is able to achieve convergence to optima in high dimensional spaces; it is used for training even the largest neural networks, which can have billions of parameters, an infeasible task for genetic algorithms or Bayesian optimization.Furthermore, gradient-based optimization aids in direct interpretation of results, as gradients are directly tied to simulation output, and iterations are sequential steps in optimization parameters.
We would like to demonstrate the utility of gradientbased optimization for a detector calibration task.Liquid argon time projection chambers (LArTPCs) are an excellent candidate for this demonstration.LArTPCs are used in a variety of modern particle physics experiments.They have several appealing experimental qualities [17], and are capable of measuring both particle trajectory and energy, known as tracking and calorimetry respectively.DUNE [1] is a developing neutrino experiment which relies heavily on LArTPCs.There is significant community and governmental investment in DUNE, meaning that there is much interest and effort behind high quality LArTPC simulation targeting neutrino physics, as well as significant potential impact in improving the corresponding calibration pipeline.These factors, as well as the conceptual simplicity of the detectors, make LArTPCs a prime target for developing a gradient-based calibration.
We therefore present a differentiable simulator for a DUNE LArTPC prototype.Starting from a snapshot of the simulator presented in Ref. [18], we have made a variety of modifications to allow for the calculation of gradients via automatic differentiation.These gradients explicitly describe the dependence of simulator predictions on simulation parameters, and provide a natural, efficient, and automatic path towards detector calibration.
The physical models used in simulation may be approximate descriptions of detector processes, and there may be additional components not captured by these models.This means that after a calibration of parameters of these models, there may still be discrepancies between simulation and data.The differentiable simulator specifically targets the improvement of parameter-based calibration.Extensions to account for missing or incomplete simulation components are left for future work.
In this paper we describe the steps taken to go from a conventional to a differentiable LArTPC simulation, demonstrate the capabilities of the simulator for the calibration of simulation parameters, and analyze the com-putational bottlenecks and pitfalls to be addressed for broader adoption of this tool.

II. SIMULATOR IMPLEMENTATION
A LArTPC is a detector with an applied electric field across a volume of liquid argon.Deposition of energy can cause ionization in the liquid argon, and the ionization electrons drift under the influence of the electric field to be measured as current in an electronics readout.Our simulator uses a detector configuration corresponding to a module design for a DUNE LArTPC prototype.The rectangular detector module has a size of 60 cm × 60 cm × 120 cm, and contains two back-to-back LArTPCs.The electric field in each LArTPC is generated by an applied voltage difference between each anode and a cathode plane.The cathode plane is shared and located between the two LArTPCs.The two anode planes are parallel to the cathode plane on respective edges of the detector volume.Both anode planes consist of a pixelated charge readout called LArPix [19].We define x and y as the horizontal and vertical axes along the cathode plane, with z corresponding to the drift axis.Resolution of the LArPix readout in each dimension is determined by the pixel pitch and time sampling frequency.The LArPix pixel pitch is 4.4 mm.The spatial resolution in the drift direction is below 2 mm.
The simulator developed for the DUNE LArTPC prototype is called larnd-sim [18].Energy depositions are represented as line segments ("particle segments") defined by 3D start and end positions, (x, y, z), and an associated energy, E. The simulator models the production of ionization electrons due to these energy depositions, as well as the drifting of electrons to the pixel readout at each anode plane.A current in the pixel readout can be induced by charges approaching or being directly collected on a given pixel.Each pixel is an independent readout channel with its own setup for trigger threshold and gain.Currents in the pixel readout are digitized to analog-to-digital converter (ADC) counts before being read out.The location of induced current in the pixel plane provides an x, y measurement.Timing information provides information on z.Magnitude of the induced current (ADC counts) gives information on deposited energy.
We describe the details of the simulator in the following stages corresponding to sequential physics processes: 1) Charge quenching; 2) Electron drifting; 3) Current accumulation; 4) Electronics simulation.We highlight several of the physics models used in the simulator.A summary of information about the simulator is presented in Fig. 1.
Charge quenching.In this stage, the simulator determines the number of ionization electrons produced given the energy deposition per particle segment, dE.This number is given by  where W ion = 23.6 eV is the work function, i.e., the average energy required to produce an ionized electron in liquid argon [20,21].
The ionized electrons may subsequently recombine with nearby argon ions.The electron survival rate, describing the fraction of electrons which do not recombine, depends on the applied electric field and the local charge density.There are two commonly used recombination models in modern LArTPCs: the Birks model [6] and the modified box model [5], which yield similar results for our uses.For this work, we primarily use the Birks model to describe the process of electron recombination, and the electron survival rate takes the form where E is the applied electric field, dE dx is the energy deposition per unit length, which gives the local charge density, and ρ is the liquid argon density.A B and k B are parameters of the Birks model and are typically fit in conventional calibration procedures.
After recombination, the number of electrons is Electron drifting.Energy depositions in the simulator are represented using 3D spatial coordinates, (x, y, z).For simulation of the readout, the position, z, along the drift axis must be translated into a drift time, t drift .The electric field in the simulator is considered to be uniform and perpendicular to the readout planes.Therefore, the relationship between drift time and distance from a given anode along the drift axis is given by ∆z = v drift • t drift .In the simulation, the drift velocity, v drift , is directly associated to the electric field, E, and is given by The electron mobility, µ(E, T ), is a function of electric field, E, and liquid argon temperature, T , and is an empirical model derived from previous measurements [22].The simulated t drift is then calculated as the drift distance divided by v drift .
On their path to the readout, the ionized electrons that have survived recombination can be further consumed by electronegative impurities, such as oxygen and water [23].Assuming the impurities are distributed uniformly in the liquid argon, the survival rate depends on t drift and the impurity type and level, which can be summarized by the electron lifetime τ , the mean drift time of an electron before capture by an electronegative impurity.The number of electrons that arrive to the readout after both recombination and drifting is therefore modeled by During the drifting stage, we also calculate charge diffusion, describing the spread of the ionization electrons in 3D space, which depends on t drift .The corresponding longitudinal and transverse diffusion lengths, σ L and σ T , determine the scale of the charge spread along the drift direction (z) and in the pixel plane (x, y) respectively.The resulting distribution is an important input for the simulation of the readout.
Longitudinal and transverse diffusion lengths are given by where D L and D T are longitudinal and transverse diffusion coefficients in liquid argon.These coefficients are usually extracted via separate, targeted measurements.Current accumulation.In this stage, we model the distribution of charge on the pixel readout for each given particle segment and use that distribution to calculate the generated current on the appropriate pixels.The total charge for a given segment is determined by the number of surviving ionization electrons.This is assumed to be spread along the length of the segment, and the impact of diffusion is modeled via a 3D Gaussian with widths given by the diffusion lengths σ L,T for longitudinal (z) and transverse (x and y) components relative to the drift direction.The charge distribution is convolved with a functional current model, which depends on the electron drift time and the pixel position.Integration is done by summing the contributions from points which are sampled in a cube surrounding each particle segment.
The current integration is one of the most computationally intensive steps of the simulation.See Sec.II C for more details.
Electronics simulation.The final stage of the simulation models the charge readout electronics.It incorporates the calculation of signals on relevant pixels, including the contribution from different particle segments, as well as triggering and digitization of current into ADC counts.The LArPix readout is nominally configured so that each pixel is a separate readout channel.These readout channels have a sampling rate of 10 MHz.Pixels in the readout are "self-triggering", meaning that they continuously take data.If the integrated current passes a given trigger threshold, the corresponding readout channels will integrate the current over a window of 1.8 µs and record the result.The time when the signal first rises above the threshold is registered as the time of the corresponding pixel "hit".Readout channels are reset after this signal integration.
Stochasticity in larnd-sim is introduced to model noise in the LArPix electronics system.The simulator includes two types of noise: (1) a baseline uncorrelated noise, e.g. from thermal fluctuation, and (2) noise that occurs during the readout channel reset.Both sources of noise are additive with respect to the pixel current integration and are drawn from Gaussian distributions, each with mean at 0 and standard deviation set according to LArPix measurements.In this work, results are presented with both noise models turned off.See Sec.III for a more detailed discussion of this choice.

A. Software infrastructure overview
The primary implementation effort for this work has been focused on translating a snapshot of larnd-sim, based on CUDA kernels written using Numba [24], into an automatic differentiation framework [13].The framework chosen for this rewriting is EagerPy [25], designed to be agnostic to a user's choice of automatic differentiation backend (e.g.JAX, TensorFlow, or PyTorch), allowing flexibility in the particular choices of users building applications on top of the differentiable simulation.Usage of EagerPy code does require a choice of framework, and we use PyTorch for the results presented here.
Many of Python's automatic differentiation frameworks are designed efficiently around matrix and tensor operations, with near trivial GPU acceleration of such operations.Explicit CUDA kernels, on the other hand, are phrased in terms of individual operations on given GPU threads.In order to take advantage of the efficiency of these matrix and tensor operations, we have therefore vectorized the larnd-sim snapshot where possible.Writing the simulation in such a way allows for GPU acceleration by merely specifying the device of the simulation inputs, and takes advantage of the optimized tensor operations of these automatic differentiation tools.However, this comes at the cost of higher memory usage due to large tensor sizes.

B. Differentiable relaxations considered
Developing a differentiable LArTPC simulator requires the assumption of differentiability of all operations that depend on the variables of interest.As gradients are calculated using the chain rule, this sequence of operations runs from the introduction of the relevant variables up until the simulator output, but might not include all operations within the simulator.Even if all operations in the sequence are differentiable, cases arise where gradients either vanish or are infinite, meaning that such gradients exist but are not useful for applications such as gradient-based fitting.One way of avoiding such issues is to introduce a set of differentiable relaxations: continuous, often smooth approximations of non-differentiable or poorly conditioned operations.Two scenarios of particular relevance for this work which require such relaxations are (1) discrete integer operations and (2) hard masking operations.
1.In our simulator, several quantities are discrete (e.g.number of electrons, ADC counts).In practice, such quantities result from floating point operations which are discretized, e.g.via truncation.This poses an issue for the gradient calculation in two ways.Recall that a derivative is formulated as In the case of truncation, there is a range of input x values for which the truncated output produces the same result (e.g., for truncation to integer values, trunc(x) = 0 for all x ∈ [0, 1)).This means that the gradient is 0. Additionally, the transition between discrete output values is of a step function nature, with an undefined (infinite) derivative at the transition point (e.g.trunc(1) = 1).All resulting gradients are therefore 0 or infinite, leading to poor performance of a gradient-based optimization.
To remedy this, we apply a differentiable relaxation, in this case by removing the truncation, allowing for a continuous variation of these nominally discrete values.The continuous variation allows for nonzero gradients, providing useful gradient information for downstream use.This relaxation is only necessary during any gradient-based optimization, and truncation may be applied after the optimization is complete to restore the physical meaning of the discrete quantities.
2. Present throughout the simulator as well are a variety of masking operations based on cuts, e.g. for variable x, requiring x > a for some fixed a ∈ R.
If the quantity df (x) dx is desired, this poses the same issue as discussed above, again due to the step function nature of such a cut.Depending on the desired application, derivatives through all such masks may not be required.However, one notable example for our particular case is a truncated exponential, defined as which appears as part of the current model.To aid in the gradient computation, the hard mask of x > 0 is replaced by a sigmoid function, defined as This function smoothly transitions from 0 to 1 as x moves from −∞ to +∞, with the rate of transition controlled by parameter k, approaching a step function as k → ∞.A demonstration of how the approximate truncated exponential model changes with k is shown in Fig. 2. Higher values of k correspond to a closer match with the unsmoothed truncated exponential, and k may be tuned to tradeoff between accuracy and gradient performance.A value of k = 100 is used in the results presented here.In the latest version of larnd-sim, the current model has been replaced by a lookup table, which itself poses some challenges for differentiability.However, such a structure may be incorporated into a differentiable framework via a neural network parametrization, as in Ref. [26].
The pixel readout, represented by coordinates x and y, is itself a discrete system.Measurement of timing (t) is also done as a discrete sampling at fixed intervals.This discreteness is challenging in our simulator, as x, y, and t are represented via matrix and tensor indices, rather than standalone quantities.Derivatives with respect to x, y, and t reflect changes in simulated quantities across pixels and times.For the application considered here, these changes do not have a large impact, and the corresponding derivatives can be safely ignored.For applications beyond this paper, we have implemented a differentiable relaxation of an important use of timing information in the trigger logic, which we describe below.Differentiable relaxations of pixel systems may also be developed, which can be considered for future work.
For LArPix triggering, given some cumulative charge as a function of discrete time sampling indices, we need to determine the time at which the cumulative charge crosses a particular value threshold.As this time is stored via an integer index, we have both the same discrete problem as above, and also a framework problem, as indices must be integers, which breaks the gradient flow.To remedy this, we first recognize that the gradient only depends on the local neighborhood of the threshold intersection.We may therefore (1) find the two discrete points surrounding the intersection, (2) linearly interpolate to find a continuous time of intersection, and then (3) store this continuous value as the relevant timing information.Fig. 3 illustrates this procedure on a single pixel.
The simulated output using all of these relaxations was compared with a reference simulation from our snapshot of larnd-sim.Both simulators produced very similar results, with an average deviation of 0.04 ADC counts per activated pixel, which is two orders of magnitude below the typical noise level for the input dataset defined in Sec.IV.

C. Computational performance
To support the use of our differentiable simulator by analyzers, we provide a detailed assessment of its computational demands, both in isolation and relative to our non-differentiable snapshot of larnd-sim.We assess performance of our differentiable simulator by analyzing two main metrics: computation time and GPU memory usage.All analysis is done using single NVIDIA Tesla A100 GPUs on the Scientific Data Facility at SLAC.When using our differentiable simulator for calibration, there is a cost incurred by both the forward simulation and the gradient computation, which is done via reverse mode automatic differentiation [13], also known as backpropagation.We present metrics relevant to both the forward simulation in isolation and the forward simulation together with backpropagation.Comparison to larnd-sim is done with the forward simulation only.

Computation time
We discuss the computation time of the differentiable simulator in terms of simulation time, which is the cost of the forward simulation in isolation, and fitting duration, which includes both the forward simulation and backpropagation.Simulation time changes between events, depending on a variety of interdependent variables such as number of particle segments in the event, the length of each of those segments, and the number of pixels with current induced.We found that this information may be summarized by the total segment length (ds), defined as the sum of the lengths of a collection of particle segments (e.g. an event or a batch, see Sec.III B).This is demonstrated by the clear correlation between ds for each event and simulation time shown in Fig. 4. The average duration per event ds is 0.07 s cm −1 .
The correlation between ds and simulation time motivates the use of this variable for batching the events during a gradient-based fit (Sec.III), as such a batching balances the computation time required for fitting each batch, which is useful for both large scale computation and in setting the stage for future parallelization (e.g. with multiple GPUs).The distribution of the batch fitting durations is presented in Fig. 5 for a batch size of ds = 100 cm.The resulting distribution is grouped around 25 s.This duration includes both the forward simulation and the gradient computation.
We have also tested the impact of different batch sizes, determined by ds values, on the total computation time of fitting a given dataset with a set number of epochs.While the computation time per fitting iteration is generally correlated to the batch size, the batch size has minimal impact on the computation time of the overall fit.
Because the gradient computation involves the storage of a large number of intermediate results, memory usage quickly becomes infeasible for desirable batch sizes (such as for ds = 100 cm) on single A100 GPUs.Therefore, we take advantage of PyTorch checkpointing, which discards these intermediate results and recomputes them during gradient accumulation.This reduces memory usage at the cost of increased computation time.The impact of checkpointing is included in Fig. 5.
To assess the performance difference between larnd-sim and our differentiable rewrite, we process an identical sample of events through both our snapshot of  larnd-sim and our differentiable simulator and compare the simulation time event-by-event.Fig. 6 shows that the simulation time of the differentiable simulator is roughly 25 times longer than the non-differentiable snapshot.Importantly, this comparison is done without the gradient computation for the differentiable simulator.This is a significant performance gap.However, there are a few concrete areas for further improvement.
The first area of improvement is related to the vectorization mentioned in Sec.II A. This vectorization is necessary for the use of automatic differentiation frameworks such as PyTorch.However, it tends to increase the total number of operations, as all operations are framed as large dense matrices instead of dedicated, element-wise kernels.LArTPC simulation is often a sparse problem, as typically readout signals per event only take place in a small fraction of the detector volume.Therefore, the increase in operations is expected to add significantly to the additional computational overhead.Alternative automatic differentiation tools, such as Enzyme [27] may be able to avoid this vectorized rewrite, and can be considered for future computational development.Furthermore, larnd-sim is just-in-time (JIT) compiled [24].This compilation avoids the overhead of interpreted code, and is expected to significantly decrease computation time.Although PyTorch does have JIT compilation tools, introduction of JIT compilation would require significant additional study, and therefore was not done for this first demonstration.
Because of the deep integration of JIT compilation with Numba's CUDA library, separation of the effects of vectorization and JIT compilation is not achievable without a significant alteration of either the original larnd-sim snapshot or our differentiable version.Therefore, the performance gap between the two versions includes the combined effects of both sparse computation and JIT compilation.

Memory usage
In nearly every large scale computation, there is a trade-off between memory usage and computation time.Understanding memory usage in our application and being able to estimate it in advance allows us to do specific optimizations aimed at efficiently using allotted memory resources, thereby reducing computation time.
Peak memory usage in our simulator occurs during the computation of the signal on each pixel (the Current accumulation stage of the simulation), which is also the most computationally intensive step of our snapshot of larnd-sim [18].In our case, the high memory usage is due to the construction of a very large, dense matrix.Given a particle segment and a pixel, the signal has to be computed by sampling in four dimensions: parallel to the readout planes, along the x (1) and y (2) axes; across time of arrival at the anode (3); across time of evaluation of the current (4).We denote the number of samplings in each of these four dimensions as N x , N y , N T0 , and N T f respectively.The spacing between sampled points in x and y is identical for all pixels, with N x = N y = 30 points per pixel.The time samplings, however, are data dependent, and the following formulas give upper bounds on the required tensor dimensions: (10) and with d as the diagonal length of the pixel pad, σ T as the transverse diffusion distance, ϕ as the angle of the particle segment relative to the drift direction, σ L as the longitudinal diffusion distance, T sampling as the LArPix sampling rate, ∆ z as the z range of the segment, and T padding as the number of recorded samples before (after) the start (end) of signal on the pad.Eq. 10 shows that N T0 depends on angle ϕ.When ϕ approaches 0, cot ϕ goes to infinity.This means that N T0 , and therefore the memory usage, becomes unbounded for particle segments parallel to the drift direction.A simple angular cut can limit the memory usage, and is applied in the calibration data set (see Sec. IV).
The size of the tensor in this peak memory estimation is given by M = N segments ×N pixels ×N T f ×N T0 ×N x ×N y .Two copies of this tensor are used in the most memory intensive computation.The estimated peak memory is therefore 2 times the tensor size multiplied by the memory used per tensor element (32 bits for floating point precision).Fig. 7 shows that the estimated peak memory per batch is very close to the measured value and that all measured peak memory values fall below the estimated upper bound.
Although the current calculation is vectorized, this peak memory stage of the simulation is converted into a loop over particle segments and pixel "chunks", which are batched subsets of the full pixel set.The pixel chunks are sequentially passed through the memory intensive computation.This chunking operation directly trades-off between memory usage and computation time.A larger chunk size leads to higher memory usage, but requires fewer loop iterations, and thus takes less time.
Given the ability to predict the peak memory usage for each event, we can adapt the pixel chunk size to maximize usage of the available memory and reduce the computation time.The speed-up factor from tuning this chunk size relative to using a chunk size of 1 pixel for a typical event is shown in Fig. 8.This speed-up is not linear because of the finite number of computing units for the given GPU, which limits the number of parallel operations.In general, this technique achieves a speed-up of around a factor of 5 for most of the events, and it enables efficient use of GPU memory.

III. PARAMETER FITTING
In the following, we will apply the differentiable simulator described above to the task of calibration.In our context, this means tuning parameters of our simulator to match some given dataset, which can be either simulated or from a real experimental setup.Let f (χ, θ) represent our differentiable simulator, with input particle segments χ and parameters θ.Our focus is to optimize the param-eters θ.
For the optimization of θ, we propose an "analysis-bysynthesis" approach: 1. Choose initial values for the parameters, denoted as θ 0 ; 2. Run the forward simulation with these parameters, f (χ, θ 0 ); 3. Compare the simulation output with target data F target , using a loss function, L(f (χ, θ 0 ), F target ); 4. Update parameter values θ 0 → θ i to minimize the loss, and repeat from step 2 starting from parameters θ i and forward simulation f (χ, θ i ).The differentiable simulator enables a gradient-based update rule for θ i .
The ultimate goal for this parameter fitting approach is an application to real, measured data.For measured data, we cannot access the true particle energy depositions, and the fitting procedure must include inference of particle segments χ from the data.If the simulation f describes the measured data well, we may use an analysisby-synthesis approach to find parameters θ and particle segments χ which best describe a measured F target .We cannot know the "real" parameter values θ target or particle segments χ target which produced the measured F target , but the procedure will produce simulation outputs which closely mimic the data, and fitted parameter values and segments may provide insight on experimental settings.
To demonstrate the capability of our differentiable simulator in optimizing model parameters, we focus on a controlled case where F target is generated using our simulator.Simulated targets F target are constructed as f (χ, θ target ), where particle segments χ are known and may be used directly in the fit.Similarly, θ target are known parameter values producing F target .We can therefore benchmark the performance of our optimization by comparing fitted parameters θ with these known values of θ target .Successful fits recover fitted values θ = θ target .This procedure is therefore known as a closure test.
Because our simulator, f , is differentiable, we can update the parameter values (step 4) using gradientbased optimization algorithms.Defining a differentiable loss function L, we are able to efficiently calculate ∇ θ L(f (χ, θ), F target ).For, e.g., gradient descent, the parameter update then takes takes the form for iteration step i, where η is a learning rate which controls the size of the update.In practice, more sophisticated update rules (e.g.Adam [28]) may provide better convergence.This fitting procedure is illustrated in Fig. 9.For visual simplicity, we restrict the illustration to the 2D parameter phase space of electric field E and lifetime τ .The gold star indicates a set of target parameter values in this 2D phase space.The background color shows the loss landscape, where the loss function is evaluated with respect to the simulated target for a simulation at each sampled point.The white arrows across the loss landscape indicate the negative gradients, corresponding to the direction of a gradient descent step with our differentiable simulator.These arrows point towards the minimum of the loss at the target parameter point.The progression of 5 fits in the parameter space, labeled by color, with different initial parameter values (filled circles) is overlaid on the loss landscape.The colored arrows indicate the parameter update per iteration step.All 5 fits converge to the target parameter values.
For the demonstration of parameter fitting with our differentiable simulator, we select 6 commonly considered detector model parameters which are listed in Table I and highlighted in Fig. 1.A description of these parameters can be found in Sec.II.The nominal values of the parameters are set to the default parameter settings of our snapshot of larnd-sim.We define a relevant physical range for each parameter which covers a range of possible parameter values from a variety of measurements.These ranges are typically larger than the uncertainties on a single, specific measurement.Among these 6 parameters, the Birks model parameters, A B and k B , and the diffusion parameters, D L and D T , are generic model parameters for the general LArTPC system, and therefore are not expected to vary significantly between experiments.In contrast,the electric field, E, and electron lifetime, τ , are operational LArTPC model parameters that depend on particular experimental processes and settings.This means that the range of τ , for example, is relatively wide, as it depends on how well the liquid argon is purified, which has minimal external constraint.With the differentiable simulator, all six parameters of interest are optimized simultaneously.This has not been achieved in conventional calibration due to the limitations of the associated methodology.We briefly describe the conventional approach for setting each of these parameters.
In conventional calibration, the Birks model parameters, A B and k B , are jointly fit using a control sample.This control sample allows for estimation of dE dx as a prior for the fit, and the electric field E is set to a fixed value (Eq.2).
The value of the electric field E is often calculated using the measured voltage on the cathode divided by the maximum detector drift distance (the distance from cathode to anode), assuming the cathode and anode are perfectly parallel planes.Deviations from this assumption, as well as shrinkage of the detector under cryogenic temperatures during operation, can shift E away from the nominal value.Furthermore, the actual electric field across the detector may not be perfectly uniform.However, this is often a sub-leading effect.
The electron lifetime τ is commonly measured with a muon control sample, which has relatively uniform dE dx .In order to fit τ , the recombination model parameters, A B and k B , and the electric field, E, are set to fixed values.
The measurement of the longitudinal diffusion coefficient D L often uses a control sample of muons due to their uniform detector signal.Nominally D L affects the signal extension in terms of drift time (width of the readout waveform).In order to extract D L , we need to separate out the effects from drift velocity and drift time.The electron lifetime and the electric field can both affect the shape of the readout waveform, and therefore need to be well understood.However, they typically are not explicitly treated in D L measurements.The transverse diffusion coefficient, D T , is usually extracted based on D L , assuming a fixed value of the electric field and a given electron mobility model µ(E, T ).
In these conventional approaches, the determination of particular parameters is often done assuming fixed values of the other parameters.In practice, different detector processes can effect the measured data in similar ways, meaning that this assumption may cause incorrect results.This is demonstrated in Fig. 10, which shows that if a bias exists in a model parameter which is fixed in the calibration, other parameters of interest may converge to biased values.In the figure, nominal values of the parameters are denoted by θ nom .∆ down and ∆ up are the distances from θ nom to the lower and upper boundaries of the range shown in Table I.For each parameter respectively, ∆θ is the average of ∆ down and ∆ up .Each panel shows the biases of the fitted parameter values in units of ∆θ resulting from a shift of one selected parameter to its lower (blue) or upper (red) range in the fitting target while fixing that same parameter to its nominal value during the fit optimization.All other parameters are set to their nominal values in the target.In several cases, fitted parameters deviate significantly from their target values.Therefore, optimizing all detector parameters simultaneously is important for avoiding calibration biases and achieving precision physics modeling.

A. Loss function
Optimization of fitting parameters requires the construction of an appropriate loss function, L. The properties of this loss function have a significant impact on the optimization.We describe several options and require-  ments in order to motivate our chosen approach.
The loss function must be differentiable with respect to the parameters of interest.It must be well-behaved when comparing two arbitrary simulation outputs.For the closure test, with no electronics noise, the loss function should give the global minimum when the fitted value θ = θ target , i.e. when two simulation outputs are identical.Similarly, loss values should increase as simulation outputs become more different, which should correspond with θ deviating from θ target .
The output of the simulation is the charge q, in ADC counts, read out on relevant pixels of coordinates x and y at time t.Pixel coordinates and readout times are discrete; it is possible to use these discrete bins to voxelize the entire LArTPC readout space in 3D.However, considering the number of pixels and the readout sampling frequency, a full resolution binning of the LArTPC readout would correspond to O(10 9 ) floating point numbers.Building a loss function, such as a mean-squared error, on these full resolution voxels involves comparing two such dense tensors, which is a very costly operation.
Typically, only a small fraction of the 3D space has recorded charge in the readout.One can, for instance, select rectangular regions of voxel space that only contain the active 3D voxels and their neighbors.This is less expensive, but still may be expected to contain empty voxels.Furthermore, the selected voxels for comparison must contain both the target readout and the output of each optimization step, and may still require a large number of voxels for sufficiently populated events or to contain sufficiently separated particle segments.
It is more efficient to carry out a "sparse" comparison by only calculating a loss where charge readout occurs.Instead of a rectangular voxel grid, we can frame the readout as a sequence of points, {(x i , y i , t i , q i )} n i=1 , where n is the total number of the active readout points.Pixels and times with null readout are excluded from the loss calculation.x, y, z, and q are treated as continuous variables in this construction -voxelization never explicitly enters the calculation.
A natural way to construct the loss function, L({(x, y, t, q)}(θ), {(x target , y target , t target , q target )}(θ target )), is to match readout points between the output of an iteration and the target.Compared to approaches that operate on the entire set or distribution of points in aggregate, this approach has the advantage of high granularity and sensitivity to changes in the readout.However, the point matching faces two major challenges: (1) {(x target , y target , t target , q target )} and {(x, y, t, q)} may not have the same number of points.
(2) There may not be a direct association between points (x target , y target , t target , q target ) i and (x, y, t, q) j , even though they are simulated from the same set of particle segments.
In time series analysis, a dynamic time warping discrepancy (DTW) [29] is used for a very related problem.Due to variations in speed during the sequences, direct comparisons between points at particular times may miss shared patterns between sequences -e.g.two identical sequences offset by 1 second may have large differences at a given set of times.DTW addresses this challenge by computing an optimal alignment between two time series, under a chosen distance metric and a set of algorithmic constraints.The two sequences can have different lengths, and multiple points from one sequence can be matched to a single point from the other sequence.It is therefore very well suited for our task, and we adopt it as our loss function for the results presented here.
DTW requires a choice of ordering for the input sequences of points.For our loss function, we use the default ordering of our simulator outputs, which is a perevent ordering based primarily on pixel x coordinates, with y (t) used to break ties when x (y) coordinates are shared between points.This ordering largely keeps the structure of the input segments and corresponds well to the structure of the detector readout.This is not a unique choice of ordering.As the readout planes are in (x, y), instead choosing y as the primary axis can be expected to yield similar results.Although t is a special axis in LArTPCs, an ordering with time as the primary axis would also retain the particle segment structure, and therefore should produce compatible fitting results.Other options, such as a non-hierarchical combination of x, y, and t coordinates, are also possible, but we expect little improvement relative to the current approach.
Once the points are ordered, there is a further choice of what features to compare between the two sequences.In principle it is possible to compare multi-dimensional features using the DTW loss.However, as geometric information is encoded into the ordering of sequences, and the primary impact of parameter variation for our chosen parameters is on the pattern of measured charge, it is sufficient to use only ADC counts.The inputs to the DTW loss function are therefore sequences of ADC counts, hierarchically ordered by the corresponding values of x, y, and z.
Dynamic time warping is not nicely differentiable by default.We therefore employ Soft-DTW [30], a smoothed version of DTW, using the implementation from Ref. [31,32].More concretely, Soft-DTW uses a special "min" operator with a smoothing parameter γ ≥ 0. As γ approaches 0, this operator converges to the unsmoothed min.We set γ = 1 in our application.An absolute difference is used as the DTW distance metric in this work.

B. Fitting considerations
For this demonstration, we fit 6 parameters using our differentiable simulator: the Birks model parameters A B , k B ; electric field E; lifetime τ ; and longitudinal and transverse diffusion coefficients D L and D T .The physical values of each of these parameters range across several orders of magnitude relative to each other (Table I).This can be challenging for the optimization, as it results in a large range of magnitudes for the corresponding gradients with respect to each parameter.To balance relative scales among the parameters, we normalize the parameters using their nominal values, θ = θ θnom .On each fit iteration, gradient calculation is done with respect to the normalized parameters, resulting in more balanced updates among the parameter set.We therefore avoid needing to carefully tune individual parameter learning rates for simultaneous parameter fits.The physical parameter value is needed for the simulation itself.We therefore undo the normalization before running the forward model simulation at each iteration, f (χ, θi • θ nom ).
We use Adam [28] as our gradient-based optimizer.With the normalization, we can set a single learning rate, 5 × 10 −2 , for all parameters.In addition, an exponential learning rate scheduling with a decay rate of 0.95 is adopted to aid stability of convergence, and the learning rate is updated using the scheduler after each epoch (one pass through the full dataset).We clip the norm of the vector of all normalized parameter gradients at 1 to avoid large iteration steps.
Each fitting iteration is performed using a mini-batch of simulated data, which is a subset of the entire input dataset.The mini-batches are distinct subsets that in total cover all of the input data, and the division of data across mini-batches remains the same across every epoch in a given fit.To construct the mini-batches, the full dataset of particles, with their constituent particle segments in order, are randomly shuffled.We do not mix the particle segments so that we maintain the impact of induced current from neighboring charges (see Current accumulation in Sec.II).As discussed in Sec.II C, total particle segment length ds provides a good measure of mini-batch computational time.In mini-batch construction, segments from the shuffled particles are sequentially added to each batch until the total length of segments in the mini-batch reaches a chosen value of ds.For this study, we choose to make mini-batches of ds = 100 cm.
Losses are computed event-by-event and then averaged across events in each mini-batch.Gradients are calculated based on this mini-batch loss, and parameters are updated correspondingly.Because each iteration is done with a single mini-batch, the parameters of interest can be updated multiple times per epoch, depending on the total number of mini-batches.This aids computational efficiency of the optimization, as the amount of data processing required per iteration is greatly reduced.
Because parameter calibration using our differentiable simulator relies on the correspondence between parameters θ target and the resulting detector readout, we pay special attention to potential degeneracies among parameters -cases where the detector readout is the same, even if parameter values are different.We focus here on degeneracies within the physical models.Degeneracy can also be introduced during other pieces of the simulation (e.g.due to electronics noise).We highlight two potential model degeneracies: 1.For the Birks model in isolation, there is a degeneracy between E and k B due to the term k B /E (Eq.2).This means that simultaneous fits of k B and E will not necessarily converge to their target values -as long as the ratio k B /E converges to its target, the individual parameter values do not matter.In the context of the broader simulation, however, this degeneracy is broken due to the relation v drift = µ(E) • E, which results in impacts from the electric field in parts of the simulation outside of the Birks model.Note that a similar degeneracy exists in the Box model, which is an alternative recombination model.In conventional calibration methods, E is usually considered as "well-measured" and E is therefore not fit, breaking this degeneracy.
2. Another degeneracy exists in the Birks model.Assume that the readout can be described with an electron survival rate α * recomb from the recombination.For a given particle segment dE dx and fixed E and ρ, there is a set of values of A B and k B which will produce the same value of α * recomb .This set is given by the linear relation This degeneracy only holds for a single, fixed dE dx .As illustrated in Fig. 11, changing dE dx changes the slope of this degenerate line in A B and k B .The only parameter values that result in α * recomb for all dE dx will be the "observed" values, the A B , k B point where all lines cross.An input dataset with a finite spread in dE dx is therefore required to break the degeneracy in A B and k B .Further, the impact of this degeneracy will be reduced by fitting on a dataset with large variation in dE dx .The simulated closure test presented here is meant as a first demonstration of calibration fits using our differentiable simulator.We provide a starting point for two paths towards extending this procedure in support of the ultimate goal of application to real, measured data.
With real data, we do not have access to true particle segments or the corresponding true energy depositions.However, suitable input data for an optimization using our framework may be particles which typically produce track-like topology in LArTPCs, within an energy range of interest.With this topology, it possible to model particle segments by fitting lines to particle tracks and breaking them into segments.The dE dx of these segments can be inferred using the readout charge and the segment length.The performance of the parameter optimization will depend on the quality of the input segment estimation, and the segment estimation may be iteratively refined along with the parameters.
The studies presented here are done in the absence of electronic noise.This is because electronic noise introduces stochasticity in the correspondence between simu-lation parameters and the corresponding outputs, meaning that, even with the same input particle segments χ and parameter values θ, we will get a different readout every time we run the forward simulation f (χ, θ).Changes in the value of the loss therefore are not solely from changes in the parameters of interest, and target parameter values may not achieve a global minimum for a given sample of the electronics noise.
In a real experimental setup, electronics noise cannot be easily disentangled from the measured data.It is therefore important to define a procedure for optimizing detector models using noisy data and to characterize the impact of the noise relative to the impact of parameter variation.
In data, a noisy target, F target , is unavoidable.In simulation, however, we have full control over the noise model.We therefore suggest using the simulated forward model, f (χ, θ), with the noise turned off for a fitting procedure with a noisy target.This reduces the overall stochasticity, and may aid the fit.
In Fig. 12, we present studies on the impact of electronics noise with respect to parameter variation using the default noise magnitude included in larnd-sim.Both sub-figures show the estimated impact of variations of individual parameters on the simulation output.Fig. 12a represents the case without electronic noise, while Fig. 12b shows the case with electronic noise in the target.For both cases, we do not include electronic noise in the simulations with varied parameter values, corresponding to the configurations of the closure test and our suggested procedure for fitting with noise respectively.
In Fig. 12a, a set of simulations is run for each parameter in which the parameter value is swept across the range defined in Table I, while all other parameters are kept at their nominal values.These parameter variations are shown as percentages of their corresponding nominal values.A target simulation is run with all parameters at their nominal values.For Fig. 12a, this target does not include electronic noise.Each of these simulations is run across all mini-batches, and mini-batches are the same across all simulations.Loss values are calculated per mini-batch using the unsmoothed DTW described in Sec.III A. These losses are translated into an absolute percentage difference relative to the target simulation using a reference simulation which provides a constant scaling factor per mini-batch.The solid lines show the median value of this parameter impact across all minibatches for each parameter, and the corresponding band shows the interquartile range across mini-batches, expressing batch-to-batch variation.All parameters have sharp and clear minima at their nominal values, where they identically match with the target.For the majority of their respective ranges, D L and τ have relatively small impact, and may be expected to have the least sensitivity in the fit.The impact of τ is very asymmetric with respect to the nominal value.
Fig. 12b shows the same impact of parameter variation on the loss with respect to targets that include electronic noises.For each mini-batch, we make 10 corresponding targets with different samples of electronic noise.For every target, we then produce the same plot as in the no noise case.Translation of loss values to percentage impact is done using the same reference value as in the no noise case so that Fig. 12a and Fig. 12b can be compared directly.Similarly, the same mini-batches are used for both cases.The solid lines and band limits shown in Fig. 12b are the average of the corresponding median lines and their band limits across the 10 noise samples.The batch-to-batch variation expressed by the bands includes the impact of batch-to-batch variation in the noise.
For reference, a noise baseline is included in Fig. 12b, which expresses the difference in readout values between simulations with and without noise with no parameter variation.The line is included across a range of parameters to guide the eye, but it is only calculated once, at the nominal parameter values.Calculation of this baseline is done across 10 noise samples identically as in the varied parameter case.The impacts for a range of variations in D L and τ have flat shapes and poorly defined minima, showing a high degree of overlap with this noise baseline.This suggests that the impact of these parameter variations is not well distinguishable from the impact of noise.
Electronic noise depends on the simulated detector configuration and may be reduced with post-processing steps.Mini-batch size and fitting data sample will also have an effect on the impact of noise.Closure test studies, such as the one presented in this work, demonstrate what is possible in an ideal scenario, and may provide a good baseline for analyzing loss of resolution due to effects such as electronic noise.We therefore suggest such studies as the starting point for particular applications.

IV. SAMPLES USED
Typically, control samples, or samples with well known properties, are used to understand detector modeling.Muons are a good candidate for this purpose.As minimum ionizing particles, their energy depositions in liquid argon ( dE dx ) are narrowly peaked around 1.6 MeV cm −1 .There is, however, sufficient enough spread in dE dx to break the degeneracy discussed in Sec.III B. In an experimental setup, samples of cosmic muons are also relatively easy to obtain.To mimic a muon control sample, we simulate 100 events with about 10 muons per each event, injected into the LArTPC volume from the detector border.Simulation is done using DLPGenerator [33] and edep-sim [34].All muon kinetic energies are set to be 1 GeV, and muon injection angles are sampled randomly from an isotropic distribution.Fig. 13 shows an example of one event, where the lines indicate the muon trajectories (particle tracks).The particle tracks are composed of segments.Each segment is modeled as a straight line of constant dE dx , and the length of the segment can vary based on the rate of change of the energy deposition.See 12.The lines show the absolute impact of each parameter on simulation output with respect to changes in the parameter value.The width of the band indicates variations of the impact across mini-batches.(a) The parameter impact is compared to a noiseless target.For all the parameters, the impact shows sharply defined minima at the nominal parameter values.(b) The parameter impact is compared to noisy targets.Minima are less well defined, and some parameter variation is indistinguishable from noise.
Sec. II for the description of how the segments are used in the simulation.
For detector calibration, O(1000) muons is a very small sample compared to what is commonly used.It would have a negligible impact on the regular data taking to acquire such a small control sample, even if this acquisition was done frequently.However, Sec.V shows that this seemingly small muon sample is sufficient for calibrating detector model parameters.This simulated muon sample is therefore used as the default sample to benchmark the performance of the differentiable simulation.
We have also produced a mixed particle sample that is composed of muons, pions, and protons with kinetic energies uniformly sampled from 0.1 GeV to 2 GeV.This is to test whether we can relax the requirement of muons as calibration data and to see how robust the parameter optimization is to different particle types and energies.
To ensure a stable convergence with a reasonable computational allocation, we apply four selection cuts to the input data.These cuts are not optimized to maximize data usage, and may be stricter than necessary.
We remove tracks that are almost colinear to the drift axis z by requiring abs(cos ϕ)< 0.966, where ϕ is the angle between the track and z.This cut helps to control memory usage, as discussed in Sec.II C.
We only include segments whose centers are within 15 cm of drift distance (z) to the anodes.The maximum drift distance for our detector configuration is 30 cm (Sec.II), so this corresponds to excluding segments that lie in the region near the cathode.We also remove tracks that have any segment whose center is within 2 cm of either of the two anodes.
Muon interaction with the detector volume can trigger radiation photons and produce delta rays and Michel electrons.We require track length to be longer than 2 cm to eliminate most of the delta rays, Michel electrons, and radiation photons (see Fig. 14), which often are topologically associated and even attached to the muon tracks.This aids fitting performance, and prevents computation time from being dominated by many short, low energy particle segments.
It is possible to do a calibration fit without the above selections.However the selections help to constrain computational requirements and to improve the optimization landscape.As an example, we found that including particle segments with longer drift lengths introduced local minima, perhaps due to local degeneracies in readout hit topological patterns across multiple values of the diffusion parameters.These local minima decrease the robustness of the parameter optimization, as fits which encounter a local minimum become "stuck" and do not con- verge to the target parameter values.Such fits are often distinguishable by a higher corresponding loss value and may be addressed by varying initial fit conditions and picking the result or results with the lowest loss.This latter process is robust, but requires additional computation and attention to the fit.We therefore adopt the above selection cuts, which avoid such issues.

V. RESULTS
We present results of simultaneously fitting six detector model parameters using the differentiable simulator: A B , k B , E, τ , D L and D T , as listed in Table I and introduced in Sec.II.All fits are run on single NVIDIA Tesla A100 GPUs.
The results are from a simulation closure test, as described in Sec.III B. To select parameter targets, we draw a uniform random value for each parameter from the ranges shown in Table I.Each target then corresponds to a single point sampled uniformly from the considered 6D phase space.All fits begin with parameters initialized at the nominal values shown in Table I.This setup mimics a realistic experimental procedure, as our best initial guess for each parameter is based on previous measurements.
In Fig. 15, we show closure test results for fits to 10 different parameter targets using the muon sample described in Section IV.Each fit is labeled by a different color, and convergence of the 6D simultaneous fit for each parameter is shown in a separate panel.The targets cover a wide range of phase space.All fits converge well to their corresponding targets.Fig. 16 shows a combined convergence metric across all 6 parameters, with the distribution calculated across the same 10 fits as in Fig. 15.For each fit, the convergence level is defined as the max- imum over all parameters of the relative distance of each parameter to its corresponding target value.The band shows the maximum and the minimum relative distances to the targets among the 10 fits.All fits converge to within 1 % of their target values after 5000 mini-batch iterations.
An additional 5 fits are run using the mixed particle sample described in Sec.IV.These fits follow the same closure test procedure as the muon sample, again taking randomly sampled parameter targets and initial values from Table I.The mixed particle fit results are shown in Fig. 17, where each fit is labeled by a different color, and convergence of the 6D simultaneous fit for each parameter is shown in a separate panel.Results are comparable to Fig. 15.This demonstrates that optimization with the differentiable simulator does not heavily depend on the particle types and energies in the input sample.This Convergence level FIG.16.For each 6 parameter fit, the convergence level is defined as the maximum relative distance to the corresponding target value across all the parameters.The red line shows the average convergence level per iteration across the 10 fits.The band boundaries represent the minimum and maximum convergence level per iteration across all of the fits -all 10 convergence levels fall within the band.
is different from conventional calibration, and suggests that our method may introduce significant flexibility in calibration fits.
The results above demonstrate that a robust calibration can be performed in a multi-dimensional phase space using gradient-based optimization.This novel technique enables us to simultaneously optimize a set of model parameters across the detector simulation.In addition, the number of particles required for this calibration would have a negligible impact on data-taking, and could be collected in a very short amount of time.This would allow for frequent verification of the calibration.Furthermore, this procedure is extensible to larger or different sets of parameters, making it suitable for generic use.The optimized model parameters can be immediately applied within the same simulation used for calibration.This benefits experiments by automatically ensuring consistent application across the simulation and analysis chain.
While we have focused on the calibration aspect of the differentiable simulator, the success of the simulator in this area also provides a validation of the resulting simulator gradients.The ability to calculate gradients through the simulation provides opportunities for a variety of extensions.Highlighting one example, with a standard, non-differentiable simulator, training of neural networks operates on fixed simulation samples -as the network trains using gradients, it cannot be directly linked to the simulation, but rather must learn what the simulation is doing from simulation outputs and inputs.In contrast, with a differentiable simulator, the simulator may be directly linked with the neural network -gradients are able to propagate through the simulation code for learned network parameter updates, providing a di- rect physical constraint on learned models.Application of such a structure to learning an "inverse detector simulation" which maps from detector outputs to dE dx segments is one particular topic for future work.

VI. CONCLUSIONS AND OUTLOOK
We have presented a novel detector calibration technique for LArTPCs using differentiable detector simulation.Our simulator sets a new paradigm for a physicsmodel based, high dimensional, automatic calibration, allowing for direct extraction of physics information from calibration fits, trivial application of calibrated information via the setting of simulation parameters, and full accounting of all relationships between parameters in a pre-viously inaccessible dimensionality of parameter space.This simulator is based on the configuration of a DUNE LArTPC prototype.However, we expect much of this work to be applicable beyond this particular detector configuration.
Our simulator has been demonstrated to robustly fit targets across a wide range of parameter space using multiple physics samples, and therefore provides a strong proof-of-concept demonstration of the utility of differentiable detector simulation for the calibration task.We have identified a variety of areas for future work to bring this idea from proof-of-concept to integral part of the operation of experiments using LArTPCs.This future work can be summarized in two categories: 1. Physics considerations.The presented calibration results are simulation closure tests.In this setup, we use the exact same particle segments when generating the target data and running the calibration fit.With real data, we will have to estimate the particle segments event-by-event.For track-like particle samples, such as muons, pions, and protons, this can be done by fitting the readout hits with lines and breaking those lines into segments.We will then need to reconstruct dE dx using the charge readout and the reconstructed segment length.However, this reconstruction has some dependence on detector parameters, and will likely require iterative updates of dE dx alongside the parameter optimization fit.Furthermore, there will likely be some associated degradation in fit quality due to a mismatch between reconstructed segments and the true particle energy deposition.This procedure needs to be studied and validated.
We have shown the impact of electronics noise on simulation output with respect to a range of parameter values in Fig. 12.This noise may degrade the performance of parameter optimization, and should be characterized for application of our calibration in experiments.
We have applied a variety of cuts to create a curated calibration sample.These cuts are on true quantities known from the simulation record, but will have to be inferred in real data via a reconstruction.
We have presented results for the most commonly calibrated parameters.We expect our method to be easily extensible to larger parameter sets, but this should be explicitly evaluated for specific cases of interest.
A variety of updates have been made to larnd-sim after the snapshot used as a base for our simulator.Incorporating these changes, either via updates to our simulator or a fresh rewrite of larnd-sim, is necessary for broader adoption within DUNE.larnd-sim includes nondifferentiable structures such as lookup tables, and differentiable parametrizations as in Ref. [26] may be necessary.Unifying this LArTPC simulation with work on differentiable photon propagation as in Ref. [26] is an additional area of development.
2. Computational considerations.We have identified two areas of software optimization which may decrease the computational requirements of our simulator, inspired by the computational gap between our simulator and larnd-sim.These include keeping a notion of sparsity in our simulator, as opposed to the current vectorized code using dense tensors, and introducing JIT compilation.JIT compilation tools are available in the supported backends for EagerPy, and exploration of these tools may prove fruitful.Pure comparison of these backends may also offer performance gains relative to our choice of PyTorch.
Though sparse libraries exist for PyTorch, JAX, and TensorFlow, moving beyond libraries designed around tensor operations may be promising.This is enabled by, for example, Enzyme [27] and is baked into the structure of non-Pythonic languages such as Julia [35], which also automatically includes JIT compilation.
In addition to optimizing computational performance, we have pointed out that there are areas of our code which remain non-differentiable, such as the discrete pixel structure.We expect that the parts that are differentiable will be sufficient for most applications.However efforts to incorporate differentiability in the pixel plane (e.g. via relaxation with kernel density estimation [36]) may be interesting for some applications.
The calibration tests presented here are also a test of the validity of parameter gradients through our simulator.This work therefore further sets the stage for the use of differentiable detector simulation within a broader machine learning context, allowing for e.g.explicit feedback of detector simulation on neural network training, a rich area with many applications, including learning to remove detector effects.
In summary, our work is a first step towards a broader differentiable physics program in particle physics.This differentiable physics program has potential for significant impact in the way physics analysis is performed, and there is a broad set of interesting future tasks towards integrating differentiable toolkits within particle physics to expand analysis capability and improve the quality and output of new physics results.

FIG. 1 .
FIG. 1. Flow diagram of the simulator, highlighting inputs and outputs of each stage (blue) as well as commonly calibrated model parameters (red).

FIG. 3 .
FIG.3.Illustration of the interpolated trigger timing index procedure for a single pixel.
FIG. 4. Simulation time of events as a function of ds, the total segment length.The red line shows a linear fit to the data.

FIG. 5 .
FIG. 5. Distribution of the processing time (simulation and backpropagation) for batches of ds = 100 cm.

FIG. 6 .
FIG.6.A comparison of the simulation time per event between the original larnd-sim and the differentiable version.

FIG. 7 .FIG. 8 .
FIG.7.The estimated batch peak memory use in comparison with the measured batch peak memory.The red line shows the equality relation.Points being above it mean that the measured memory is consistently smaller than the estimated one.

FIG. 9 .
FIG. 9. Loss landscape in a 2D parameter space of electric field E and lifetime τ , averaged across batches.The gold star labels the target parameter values.The negative gradient, shown in the white arrows, points towards the loss landscape minimum at the target.Five example fit trajectories, starting from a variety of different initial points (filled circles) are shown respectively in different colored lines.All fits converge to the target parameter values (the gold star).

(FIG. 10 .
FIG. 10.Demonstration of interdependence between parameters.Parameters with hatched fill correspond to those which are shifted to their lower (blue) or upper (red) range limits in the target simulation and fixed to their nominal values in the corresponding fit.The colored bars show the corresponding bias from the fitted result.

FIG. 11 .
FIG. 11.Degeneracy between the Birks model parameters AB and kB if the energy deposition per unit length dE dx is a single value.With a range of dE dx , the degeneracy breaks.
FIG.12.The lines show the absolute impact of each parameter on simulation output with respect to changes in the parameter value.The width of the band indicates variations of the impact across mini-batches.(a) The parameter impact is compared to a noiseless target.For all the parameters, the impact shows sharply defined minima at the nominal parameter values.(b) The parameter impact is compared to noisy targets.Minima are less well defined, and some parameter variation is indistinguishable from noise.

FIG. 13 .
FIG. 13.A display of an event containing ten 1 GeV muons in the detector.The track color indicates the dE dx along the muon trajectories.The grey plane in the middle is the cathode, and the two blue planes are the anodes of the two TPCs.

FIG. 14 .
FIG. 14. Particle trajectory length (sum of particle segment length ds) for simulated muons, electrons, and photons in the input muon sample.Cutting at ds = 2 cm removes a significant fraction of electrons and photons.

FIG. 15 .
FIG.15.Results from a simultaneous fit of six detector model parameters: AB, kB, E, τ , DT , DL using the default muon sample.Each color indicates a sampled 6D target parameter point, and all fits start from the same initial parameter values, mimicking a realistic fitting scenario.The dashed lines label the target values of each parameter for each respective fit.The solid lines show the evolution of fitted parameter values with respect to the iteration number in the fit.

FIG. 17 .
FIG.17.Results from a simultaneous fit of six detector model parameters: AB, kB, E, τ , DT , DL using an alternative sample with mixed muons, pions, and protons.Each color indicates a sampled six dimensional target parameter point, and all fits start from the same initial parameter guess, mimicking a realistic fitting scenario.The dashed lines label the target values of each parameter for each respective fit.The solid lines show the evolution of fitted parameter values with respect to the iteration number in the fit.