Exploration of differentiability in a proton computed tomography simulation framework

Objective. Gradient-based optimization using algorithmic derivatives can be a useful technique to improve engineering designs with respect to a computer-implemented objective function. Likewise, uncertainty quantification through computer simulations can be carried out by means of derivatives of the computer simulation. However, the effectiveness of these techniques depends on how ‘well-linearizable’ the software is. In this study, we assess how promising derivative information of a typical proton computed tomography (pCT) scan computer simulation is for the aforementioned applications. Approach. This study is mainly based on numerical experiments, in which we repeatedly evaluate three representative computational steps with perturbed input values. We support our observations with a review of the algorithmic steps and arithmetic operations performed by the software, using debugging techniques. Main results. The model-based iterative reconstruction (MBIR) subprocedure (at the end of the software pipeline) and the Monte Carlo (MC) simulation (at the beginning) were piecewise differentiable. However, the observed high density and magnitude of jumps was likely to preclude most meaningful uses of the derivatives. Jumps in the MBIR function arose from the discrete computation of the set of voxels intersected by a proton path, and could be reduced in magnitude by a ‘fuzzy voxels’ approach. The investigated jumps in the MC function arose from local changes in the control flow that affected the amount of consumed random numbers. The tracking algorithm solves an inherently non-differentiable problem. Significance. Besides the technical challenges of merely applying AD to existing software projects, the MC and MBIR codes must be adapted to compute smoother functions. For the MBIR code, we presented one possible approach for this while for the MC code, this will be subject to further research. For the tracking subprocedure, further research on surrogate models is necessary.


Introduction
The option of treating cancer using beams of high-energy charged particles (mainly protons) is becoming more and more available on a world-wide scale.The main advantage over conventional x-ray radiotherapy lies in a possibly lower dose deposited outside the tumor, as the energy deposition of protons is concentrated around the so-called Bragg peak.The depth of the Bragg peak depends on the beam energy and the relative stopping power (RSP) of the tissue it traverses.Thus, treatment planning relies on a three-dimensional RSP image of the patient.In the state of the art calibration procedure of x-ray CT for proton therapy, scanner-specific look-up tables are used to convert the information retrieved from x-ray (single-energy or dual-energy) CT acquisitions into an RSP image: This approach comes with an uncertainty of the Bragg peak location of up to 3% of the range (Yang et al., 2012;Paganetti, 2012;Wohlfahrt and Richter, 2020).On the other hand, the direct reconstruction of a proton CT image using a high-energy proton beam and a particle detector has been shown to be intrinsically more accurate (Dedes et al., 2019;Yang et al., 2010).
To this end, the Bergen pCT collaboration (Alme et al., n.d.) is designing and building a high-granularity digital tracking calorimeter (DTC) as a clinical prototype to be used as proton imaging device in existing treatment facilities for proton therapy.Its sensitive hardware consists of two tracking and 41 calorimeter layers of 108 ALPIDE (ALICE pixel detector) chips (Aglieri Rinella, 2017) each.After traversing the patient, energetic protons will activate pixel clusters around their tracks in each layer until they are stopped, as shown in figure 1.In each read-out cycle, the layer-wise binary activation images from hundreds of protons are collected and used to reconstruct the protons' paths and ranges through the detector and thus their residual direction and energy after leaving the patient.Based on this data from various beam positions and directions, a model-based iterative reconstruction (MBIR) algorithm reconstructs the three-dimensional RSP image of the patient.The reconstructions of proton histories and of the RSP image are displayed as two main subprocedures in figure 2a, along with a Monte Carlo simulation subprocedure to generate the detector output instead of a real device for testing and optimization purposes.
Algorithmic differentiation (AD) (Griewank and Walther, 2008;Naumann, 2011) is a set of techniques to efficiently obtain precise derivatives of a mathematical function given by a computer program.Such derivatives have been successfully used for optimization problems in various contexts, such as machine learning (Baydin et al., 2017) and computational fluid dynamics (Albring et al., 2016), and AD is currently also adopted in the fundamental physics community for detector optimization (Baydin et al., 2021;Dorigo et al., 2022).Besides, particular methodologies for uncertainty quantification (UQ) involve derivatives.However, algorithmic derivatives can only be useful for optimization and UQ if the differentiated function, i. e. the pCT software pipeline, is sufficiently smooth.For the work presented in this article, we therefore study the reaction of three representative substeps of the pCT reconstruction process on changes in a single input parameter, keeping the other inputs fixed.
In section 2.1, we summarize all the computational steps of the software pipeline in greater detail.Section 2.2 is a general introduction into the purpose and calculation of derivatives of computer programs.The numerical experiments are outlined in section 2.3 and their results are stated in section 3.In section 4, we analyze the observed discontinuous or non-differentiable behaviour and propose ways to mitigate it, and close with a summary and conclusions in section 4.

Simulation of Proton CT Data Acquisition and Processing
2.1.1.Foundations When energetic protons pass through matter, they slow down in a stochastic way, mainly due to inelastic interactions with the bound electrons.The average rate ∂E ∂s of kinetic energy E lost per travelled length s is called stopping power.We denote it by S(E, x), indicating its dependency on the current energy E of the proton and the local material present at location x.Using the symbol S w (E) for the stopping power of water at the energy E, the RSP is defined as The dependency on E has been dropped in the notation because in the relevant range between 30 MeV and 200 MeV, the RSP is essentially energy independent (Hurley et al., 2012).Separating variables and integrating, one obtains This integral value is called the water-equivalent path length (WEPL).
In list-mode, or single-event, pCT imaging, millions of protons are sent through the patient, and their positions, directions and energies are recorded separately, both before entering and after leaving the patient.In the setup conceived by the Bergen pCT collaboration and sketched in figure 1, the exit measurements are performed by the tracking layers of the DTC, through the processing steps outlined in section 2.1.3.While many prototypes reported in the literature use an additional pair of front trackers (Hurley et al., 2012;Meyer et al., 2020;Esposito et al., 2018;Scaringella et al., 2013;Saraya et al., 2014;Naimuddin et al., 2016), the setup at hand infers the positions and directions of entering protons from the beam delivery monitoring system.The employment of a single pair of tracking layers has been shown to be sufficiently accurate for dose-planning purposes through MC simulation studies (Sølie et al., 2020).(Jan et al., 2004) for simulations in medical imaging and radiotherapy, based on the Geant4 toolkit for the simulation of the passage of particles through matter (Agostinelli et al., 2003;Allison et al., 2006Allison et al., , 2016)).Based on a description of the relevant physical properties of the complete detectorpatient setup, i. e., • the shape and material composition of the detector as well as the proton beam characteristics (detector parameters), • the shape and material composition of the patient (original RSP ), and • physics parameters like models and cross sections that define probability distributions for the relevant interactions between particles and matter, as well as output parameter setup (physics parameters), GATE produces stochastically independent paths of single particles through the described setup.Whenever interactions occur, with a certain probability distribution, the turnout for the proton at hand is decided by a random number from a pseudorandom number generator (RNG).
The epitaxial layers of the ALPIDE chips are modelled as single crystalSD volumes in GATE, so the positions and energy losses of all charged particles passing these volumes are recorded.In the real DTC, the energy deposition in each layer is estimated from the number of pixels activated by electron diffusion around the track, i. e. the magnitude of a pixel cluster (Tambave et al., 2020).To reproduce this diffusion effect for the MC simulation, a random cluster, whose size corresponds to the recorded energy loss, is retrieved from a library containing the observed cluster shapes and their occurrence probability sampled from experimental data (Pettersen et al., 2019).As approximately 100 protons pass through the patient during one read-out cycle of the real detector (depending on the detector and beam parameters), a union of all activated pixels is formed to obtain the final binary image for each read-out cycle.2c displays the computational steps to convert the binary activation images per layer and read-out cycle, either produced by the real detector or by the Monte Carlo procedure in section 2.1.2,back to continuous coordinates and energies of protons.

Proton Track Reconstruction Figure
As mentioned above, neighbouring (vertically and horizontally) activated pixels are grouped into clusters per layer and read-out cycle.The proton's coordinate is given by the cluster's center of mass and its energy deposition is related to the size of the cluster (Pettersen et al., 2017;Tambave et al., 2020).
In the tracking step, a track-following procedure (Strandlie and Frühwirth, 2010) attempts to match clusters in bordering layers likely belonging to the same particle trajectory.The angular change between the extrapolation of a growing track and cluster candidates in a given layer is minimized in a recursive fashion (Pettersen et al., 2019(Pettersen et al., , 2020)).
Based on the cluster coordinates in the two tracking layers of the DTC, the position and direction of the proton exiting the patient can be inferred.A vector x (PD) stores this data together with the position and direction of the beam source.
The proton's residual WEPL before entering the detector can be estimated by a fit of the Bragg-Kleeman equation of Bortfeld (Bortfeld, 1997;Pettersen et al., 2018) to the energy depositions per layer (Pettersen et al., 2019).Its difference to the initial beam's WEPL is stored in x (W) .Failures of the tracking algorithms are usually due to pairwise confusion between tracks from multiple Coulomb scattering (MCS), to the merging of close clusters, or to high-angle scattering.To this end, tracks with an unexpected distribution of energy depositions are attributed to secondary particles or mismatches in the tracking algorithm, and filtered out (Pettersen et al., 2021).
2.1.4.MBIR Subprocedure Model-based iterative reconstruction algorithms repeatedly update the RSP image to make it a better and better fit to the measurements of x (PD) , x (W) , according to the model equations ( 2).
While the paths of the protons in the air gaps between the beam delivery system, the patient and the DTC can be assumed to be straight rays, inside the patient they are stochastic and unknown due to electromagnetic (MCS) and nuclear interactions with atomic nuclei.Using the model of MCS by Lynch and Dahl (Lynch and Dahl, 1991) and Gottschalk (Gottschalk et al., 1993), and given the positions and directions x (PD) from section 2.1.3,the most likely path (MLP) can be analytically approximated in a maximum likelihood formalism (Schulte et al., 2008).The extended MLP formalism (Krah et al., 2018) also takes uncertainties of the positions and directions into account: this is especially important when the front tracker is omitted, since the beam distribution is modeled using this formalism.Alternatively, a weighted cubic spline is a good approximation of the MLP (Collins-Fekete et al., 2017).The WEPL of the proton is the integral of the RSP along the estimated path, and has also been reconstructed (as described in section 2.1.3)as x (W) .This leads to a linear system of equations for the list y of all voxels of the RSP image, (3) The entry (i, j) of the matrix A = A x (PD) stores how much the RSP at voxel j influences the WEPL of proton i; this value is related to the length of intersection of the proton's path and the voxel's volume, and thus depends on x (PD) .As each proton passes through a minor fraction of all voxel volumes, A is typically sparse.Instead of the approximate length of intersection, a constant chordlength, mean chord length or effective mean chord length (Penfold et al., 2009) can be used to determine the matrix elements.As an alternative, we also study a "thick paths" or "fuzzy voxels" approach where points on the path are assumed to influence a wider stencil of surrounding voxels, with a weight that decreases with distance.
The matrix A is not a square matrix as the number of protons is independent of the dimensions of the RSP image.Therefore "solving (3)" is either meant in the leastsquares sense, or additional objectives like noise reduction are taken into account via regularization or superiorization (Penfold et al., 2010).
The two main computational steps of the MBIR subprocedure, generating the matrix A x (PD) and solving the system (3), are displayed in figure 2d.Several implementations of x-ray MBIR algorithms (Biguri et al., 2016;van Aarle et al., 2015) actually do not store the matrix in memory explicitly, because even in a sparse format it would be to large (Biguri et al., 2016).Rather, they implement matrix-free solvers like ART, SIRT, SART, DROP (diagonally relaxed orthogonal projections) (Penfold and Censor, 2015) or LSCG (least squares conjugate gradient).These access the matrix only in specific ways, e. g. via (possibly transposed) matrix-vector products or calculating norms of all rows.The matrix elements can thus be regenerated on-the-fly to perform the specific operation demanded by the solver.In pCT, it is best to structure these operations as policies to be applied for each row of A: Due to the bent paths of protons, it is more efficient to make steps along a path and detect all the voxels it meets, rather than the other way round.Parallel architectures like GPUs can provide a significant speedup for operations whose policies can be executed concurrently for many rows.

Derivatives of Algorithms
2.2.1.Differentiability Any computer program that computes a vector y ∈ R m of output variables based on a vector x ∈ R n of input variables defines a function f : R n → R m , that maps x to y.Typically, f is differentiable for almost all x, because we may think of a computer program as a big composition of elementary functions like +, •, sin, √ , | • |, and apply the chain rule.Possible reasons why f could not be differentiable at a particular x include the following: • An elementary function is evaluated at an argument where it is not differentiable, like the abs function | • | at 0.
• An elementary function is evaluated at an argument where it is not even continuous, like rounding at k + 1 2 for integers k. • An elementary function is evaluated at an argument where it is not even defined, like division by zero.
• A control flow primitive such as if or while makes a comparison like a = b, a > b, a < b, a ≥ b or a ≤ b between two expressions a, b that actually have the same value for the particular input x.
2.2.2.Taylor's Theorem Although differentiability and derivatives are local concepts, they can be used for extrapolation by Taylor's theorem.A precise statement (Anderson, 2021) in first order is that if f is twice continuously differentiable, for each x there are constants C, D ∈ R such that the error of the linearization The constant C depends on how steep f is.The Taylor expansion (4) is frequently used as a heuristic even if the differentiability requirements are violated.This is only meaningful if the "amount" or "density" of the non-differentiable points x listed above is low.To assess the range in which the linearization ( 4) is valid, one can plot f (x) − f (x) against |x − x| and compare to the graph of a proportionality relation.We call it the linearizability range.
The two applications discussed in the next sections 2.2.3 and 2.2.4 rely on (4).

Quantification of Uncertainty
If the value of an input x is approximated by x and the resulting uncertainty of f (x) is sought, f (x) contains the relevant information to measure the amplification of errors: according to the Taylor expansion (4), the deviation of x from x gives rise to an approximate deviation of f (x) from f (x) by f (x) • (x − x).
For small local perturbations, a more specific analysis is possible when we model the uncertainty in the input x using a Gaussian distribution with mean x and covariance matrix Σ x , and replace f with its linear approximation (4).For a linear function f , the output f (x) is a Gaussian distribution with mean f (x) and covariance matrix (5) We are interested in uncertainties of the reconstructed RSP image as a result of the MBIR subprocedure, or the uncertainties of results of further image processing like contour lines (Aehle and Leonhardt, 2021).Possible input variables of known uncertainty are the detector output or reconstructed proton paths.All steps in the pipeline from this input to the output must be differentiated in order to apply (5).Since the above input variables are defined after the Monte Carlo subprocedure (see figure 2a), the differentiation of this step is not required.
Uncertainties in the input should be "unlikely" to go beyond the linearizability range.Otherwise, f is not approximated well by its linearization for a significant amount of possible inputs x and (5) cannot be used, as illustrated in figure 3.
No matter how non-smooth f is, Monte Carlo simulation can be used alternatively to propagate a random variable x through a computer program f and estimate the statistical properties of the random variable f (x).The more samples are used, the better the Monte Carlo estimate becomes, but the more run-time must be spent.

Gradient-Based Optimization
In the case that f has a single output variable (m = 1), optimization seeks to find a value for x that minimizes (or maximizes) this objective function.The idea behind gradient-based optimization is that as the gradient points into the direction of steepest ascent according to (4), shifting x in the opposite direction should make f (x) smaller.Neural network training algorithms like ADAM also  belong to this category, with the training error of the neural network as the objective function.
In order to perform an end-to-end optimization of the detector parameters to minimize an objective function based on the reconstructed RSP, all parts of the pipeline in figure 2a would have to be differentiated, including the Monte Carlo simulation.
Optimization algorithms can only find local minima, which are optimal among "similar" designs but possibly worse than the global minimum.Lack of differentiability and consequently, lack of explicit gradients preclude the guarantee that a gradientbased optimization algorithm can converge to a local minimum.Instead of using the gradient to find a descent direction, one could alternatively iterate through all coordinate directions, use random choices or a surrogate model.Many derivative-free optimization algorithms have been proposed in the literature (Larson et al., 2019).However, few of them are able to deal with 300 unknowns or more (Rios and Sahinidis, 2013).

2.2.5.
Algorithmic Differentiation The classical ways to obtain derivatives of a computer-implemented function f , required for the applications described in sections 2.2.3 and 2.2.4,are analytical (using differentiation rules, possible only for simple programs) or numerical (using difference quotients, inexact).Ideally, algorithmic differentiation combines their respective advantages being exact and easy applicable.AD tools facilitate the application of AD to an existing codebase; specifically, operator overloading type AD tools intercept floating-point arithmetic operators and math functions, and insert AD logic that keeps track of derivatives with respect to input variables (forward mode), or records an arithmetic evaluation tree (reverse mode).As examples for such tools, we may cite ADOL-C (Walther and Griewank, 2012), CoDiPack Figure 4: Three value-wisely good approximations (blue thin line) to a smooth function (black thick line).In (a), the derivative of the approximation is also a good approximation for the derivative of the smooth function.Such a statement can however be wrong, e. g. if there are jumps as in (b) or noise as in (c).(Sagebaum et al., 2019), the autograd tool (Maclaurin et al., 2015) used by PyTorch (Paszke et al., 2019), and the internal AD tool of TensorFlow (Abadi et al., 2015).The machine-code based tool Derivgrind (Aehle et al., 2022a,b) may offer a chance to integrate AD into cross-language and partially closed-source software projects.For an overview of tools and applications, visit https://www.autodiff.org.
While these tools can often be applied "blindly" to any computer program to obtain algorithmic derivatives in an "automatic" fashion, further program-specific adaptations might be necessary.For example, the new datatype of an operator overloading tool might break assumptions on the size or format of the floating-point type that were hard-coded in the original program.Concerning complex simulations, techniques like checkpointing (Dauvergne and Hascoët, 2006;Naumann and Toit, 2018) or reverse accumulation can reduce the memory consumption of the tape in reverse mode but require manual modifications of the primal program.
Another major reason for revisiting the primal code is given by the fact that it is usually only an approximation of the real-world process.Good function approximation does not guarantee that the corresponding derivatives are also well-approximated (Sirkes and Tziperman, 1997).To illustrate this, figure 4 shows three value-wisely good approximations to a smooth function.In figure 4b, the derivative of the approximation is zero everywhere except where the approximation jumps.This kind of behaviour could be the consequence of intermediate rounding steps.In figure 4c, a low-magnitude but high-frequency error adds high-magnitude noise to the derivative.In both cases, the exact AD derivative of the approximation is entirely unrelated to the derivative of the real-world function, and therefore cannot be of any use to the propagation of uncertainties through it, or its optimization.Adaptations of the computer program might be necessary to ensure that it also a good approximation derivative-wise, as in figure 4a.

Numerical Checks of Algorithmic Differentiability
In the work presented in this article, we determined the potential of employing AD for gradient-based optimization or UQ of the pCT reconstruction pipeline.
Our main methodology was based on plots that show the dependency of a single output variable f (x) with respect to a single input variable x for three representative substeps of the pipeline, keeping all the other inputs fixed.More details on the three setups can be found in sections 2.3.1 to 2.3.3.
From a plot of f (x) with a linearly scaled abscissa, we identified whether f had isolated discontinuities ("jumps") or noise.Differentiability (from one side) at a particular x could be verified with a log-log plot of |f (x) − f (x)| with respect to x − x.Inside the linearizability range, it should look like the log-log plot of a proportionality relation, i. e. a straight line with slope 1. Deviations for very small |x − x| can often be attributed to floating-point imprecisions and have no implications on the differentiability.
If we choose several points x without special considerations in mind, and f turns out to be differentiable at all of them, we take this as an indicator that the function is differentiable "almost everywhere".This means that the function may have e.g. jumps or kinks, but those are unlikely to be encountered with generic input.

Setup for GATE
We used GATE v9.1 to simulate a single proton of initial energy x around x = 230 MeV (as well as other arbitrary values 220.123MeV, 229 MeV, 231 MeV, 240 MeV, 256.987MeV) passing through a head phantom (Giacometti et al., 2017) and several layers of the DTC.Four output variables were extracted from the ROOT file produced by GATE: the energy depositions f E,1 (x), f E,2 (x), and a position coordinate f pos,1 (x), f pos,2 (x), in the first and second tracking layer of the DTC, respectively.The seed of the RNG has been kept constant.

Setup for Tracking
We analyzed the percentage of correctly reconstructed tracks of a track-following scheme implemented by Pettersen et al. (2019Pettersen et al. ( , 2020)).During the reconstruction, a threshold determined the maximal accumulated angular deflections allowed for continued reconstruction.This threshold was then identified as the input variable to test for differentiability.A batch of 10 000 tracks was used for this purpose, where correctly reconstructed tracks were identified on the basis of their Bragg peak position relative to the MC truth (±2 cm) (Pettersen et al., 2021).

Setup for MBIR
We considered a setup with about 250 000 proton histories from a GATE simulation of the CTP404 phantom (The, 2006) (an epoxy cylinder of radius 75 mm and height 25 mm, containing cylindric inserts of various other materials), and reconstructed an RSP image with 5 slices (8 mm thick) of 80 × 80 voxels (2 mm × 2 mm) using our own prototypical implementations of DROP and LSCG using C++, CUDA and Python.The setup was small enough to allow for many repetitive evaluations with  modifications in a single input variable within a reasonable computing time.Both DROP (with a relaxation factor of 0.1) and LSCG seem to converge for the over-determined linear system (3).Figures 5a and 5c show the central slice of the reconstructed RSP image after 400 iterations of either solver.
In terms of the root mean square error of the linear system (3), the more noisy LSCG solution is better than the DROP solution by about 7 %.In fact, LSCG seems not to be widespread in CT image reconstruction, but we included it in our study for the purpose of comparison.
The solutions in figures 5b and 5d were reconstructed from about 25 million proton histories; they indicate that the overall bad image quality of figures 5a and 5c is an artifact of the low number of proton histories.It should be irrelevant for the qualitative statement on differentiability.
We separately considered two input variables: the WEPL (i.e. a component of the right-hand side x (W) of ( 3)) and a coordinate of the beam spot position of one particular proton history (i.e. a component of x (PD) , influencing the matrix in ( 3)).
To study the dependency on the WEPL, we applied the DROP algorithm with a relaxation factor of 0.1 as well as the LSCG algorithm, using a mean chord length approach to compute the matrix elements.We observed the reconstructed RSP of a voxel that the input proton history passed through.For DROP, geometric information was used by zeroing voxels outside a cylindrical hull of radius 75 mm after each iteration.
To study the dependency on the position coordinate, we applied the DROP algorithm with a relaxation factor of 0.1, using a mean chord length approach as well as a fuzzy voxels approach to compute the matrix elements.In the fuzzy voxels approach, voxels in a 3 × 3 neighbourhood around points on the path received a weight that decreased exponentially with the distance of the center of the voxel, across all slices.We observed the reconstructed RSP of a voxel that the unmodified input proton history passed through.

Recording of RNG Calls
To understand the cause of jumps observed in the setup of section 2.3.1, we additionally performed the following analysis.After identifying the precise location of the jump via bisection, we used a debugger to output the backtrace of every call to the RNG.After masking floating-point numbers and pointers, we obtained a medium-granular record of the control flow in the program for the particular input used to run it.We produced four of these records in close proximity to one particular jump, two on each side.

Monte Carlo Subprocedure
In the top row of figure 6 for the GATE setup of section 2.3.1, f E,1 , f E,2 , f pos,1 , f pos,2 appear as piecewise differentiable functions with around one jump per 0.1 MeV.
For low perturbations x − x of the beam energy around x = 230 MeV, the log-log plot in figure 7 shows a straight line with slope 1, indicating that the four functions were differentiable at x.After some threshold perturbation given by the distance to the next discontinuity, the approximation error of the energy depositions f E,1 , f E,2 rose suddenly.Analogous observations were made for the other test values of x listed in section 2.3.1.
Zooming in around the jump at 230.106 MeV (bottom row of figure 6), we observe that it is actually a cluster of many discontinuities.We further investigated two discontinuities using masked records of backtraces of RNG calls (section 2.3.4).Choosing four input values close to 230.106 187 089 MeV by appending a digit 4, 5, 6 or 7 to this decimal representation, we obtained two inputs on each side of a discontinuity.The records for inputs on the same side agreed.When inputs from both sides were used, the control flows differed at some point because the physical interaction length (PIL) compared differently to the current physical step size, leading to different processes being selected as PIL "winners".Similarly, the records on both sides of the discontinuity at 230.106 225 175 MeV started to deviate at a comparison between step lengths for "soft" and "hard" scattering in the Wentzel VI model (Fernández-Varea et al., 1993;Urbán, 2006;Ivanchenko et al., 2010).The comparison was sensitive to perturbations of the input because the step lengths happened to take values very close to each other.The

Track Reconstruction
The global behaviour in figure 8a shows that the investigated input parameter offers potential for optimization of the tracking accuracy (see section 2.3.2) by choosing it sufficiently large.
Figure 8b deals with medium-sized modifications.The high number of steps and the noisy behaviour of the plot in figure 8b indicate that the code uses non-differentiable operations very frequently, so probably linearizability ranges of other output variables are very small as well.
Figure 8c displays the effect of very small modifications of the input parameter.As the percentage of correctly reconstructed tracks is an inherently discrete quantity, we expect to see steps here, instead of a gradual transition.

RSP Reconstruction
As shown in figure 9a, the RSP computed by the DROP algorithm depended linearly on the WEPL in the setup of section 2.3.3.This statement is accurate up to floating-point accuracy.The non-linear LSCG algorithm introduced noise, as reported in figure 9b.
When a beam spot coordinate of the first track was modified, changes in the set of voxels traversed by the MLP lead to jumps in the DROP-reconstructed RSP, as shown in figure 10a.Between these discontinuities, the graph is almost linear (figure 10b), and the log-log plot in figure 10c numerically verifies that it was differentiable at x = −63.55.Tangents at this point were almost horizontal in figure 10a, so the reconstructed RSP changed much more via jumps than it did in a differentiable manner.
Figure 11 corresponds to figure 10, but used a fuzzy voxels approach to compute the matrix elements.The graph is still discontinuous wherever the set of voxels traversed by the MLP changes.However, the jumps were much smaller and in between, the functions changed significantly in an almost linear (figure 11b) and differentiable (figure 11c) manner.Figure 9: Dependency of the reconstructed RSP at a particular voxel on the WEPL of a particular track.While a negative WEPL would never be measured, it can serve as an input to both MBIR algorithms without problems.

Discussion: Challenges for Differentiating the Pipeline
As detailed in the end of section 2.2.5 based on figure 4, algorithms designed without an AD option in mind might need manual adaptations to make sure that their derivatives approximate the "true" function's derivatives.Besides, the linearization (4) is only helpful for the quantification of uncertainties and for optimization if the true function is "sufficiently smooth", as discussed in sections 2.2.3 and 2.2.4.
In this section, we discuss these aspects for the software pipeline outlined in section 2.1.

Differentiation of Randomized Code
No meaningful information could be gained by differentiating GATE with respect to its random input, i. e. the seed of the RNG.Regarding the other partial derivatives (w.r. t. detector parameters etc.), the RNG seed was kept constant in this study.
In section 3.1 we found that four particular outputs of GATE were piecewise differentiable with respect to the beam energy, but did also involve jumps.Two of the jumps were further investigated and it was found that the control flow of the program changed at this point because a floating-point comparison flipped, changing the number of calls to the RNG.Such a change severely affects the subsequent computations because the program then receives a shifted sequence of random numbers.
We therefore conjecture that the observed discontinuities are, at least partially, an artifact of how the RNG is used, and not of any physical significance.Restarting the RNG with precomputed random seeds at strategic locations in the code might remove some of this "numerical chaos".
We should recall here that frequent discontinuities do not deteriorate the accuracy of AD for computing derivatives if the derivatives exist for the respective input.However, they diminish the accuracy of the linearization formula (4), and therefore the value of the derivatives for applications in sections 2.2.3 and 2.2.4 regarding stand-alone GATE.For  Figure 10: Dependency of a particular voxel's RSP on a particular track's beam spot coordinate.Elements of the system matrix A were calculated using the mean chord length.
the pipeline as a whole, the issue might be less pronounced, as subsequent computational steps combine simulations of many independent protons, possibly averaging out the chaotic behaviour while keeping systematic dependencies.Further research in this direction may perturb inputs of a GATE simulation as part of a complete pCT software pipeline with a higher number of protons, and observe the reaction of, e. g., the reconstruction error of the RSP image.Instead of applying AD to the particle physics simulator itself, simulation data can be used to fit a surrogate model, which is then differentiated instead of the simulator (Dorigo et al., 2022).In general, we expect such an approach to reduce chaotic  Figure 11: Dependency of a particular voxel's RSP on a particular track's beam spot coordinate.Elements of the system matrix A were calculated using the a "fuzzy voxels" approach.
behaviour, evaluate faster, and reduce the workload needed to apply the AD tool, but it can be less accurate.Surrogate models for calorimeter showers are a very active field of research, see e. g.Dorigo et al. (2022).

Discrete Variables Related to Detector Output
The detector output consists of pixel activations that are either 0 or 1, i. e. take a discrete value, as opposed to continuous coordinates, energies etc.The continuous output of GATE is mapped into discrete values by the charge diffusion model's choice of which pixels to activate.The calculation of cluster centers preceding the track reconstructions maps the discrete pixel data back into a continuous range.
As the local behaviour of any function into a discrete set is either "constant" (not interesting) or "having a step" (not differentiable), we cannot make any use of derivative information here.Expressed differently, any discrete intermediate result comes from rounding of continuous coordinates, which erases all derivative information and is sometimes discontinuous.This is what we observed in section 3.2.Figure 8c shows the jumps of figure 4b, and figure 8b shows the noise of figure 4c.One idea to fix this is to replace the continuous-to-discrete-to-continuous conversion by a surrogate model.In the easiest case, one might just carry over the "ground truth" hit positions and energy depositions from the Monte Carlo subprocedure to the RSP reconstruction, bypassing the charge diffusion model, clustering and tracking subprocedures.

Numerical Noise of the MBIR Solver
Even if the error of a least-squares solution provided by an approximative numerical solver is small, a noisy error as illustrated in figure 4c, and observed for LSCG in figure 9b in section 3.3, has large derivatives.The noise probably results from stopping the iterative solver before it reaches full convergence.DROP performs only linear operations on the right hand side and hence the reconstructed RSP depends on the WEPL in a linear way, making it a better choice for further investigations.

Discrete Variables Related to the Matrix Generation in MBIR Algorithms
While stepping along a path and determining the current voxel, an affine-linear function is applied to the current coordinates and the result is rounded; in the end, a certain path either intersects, or does not intersect, a certain voxel.This discrete choice introduces discontinuities w. r. t. track coordinates, as can be seen in figure 10a in section 3.3, where the matrix element of an intersected voxel was set to a mean chord length.If a perturbation of a track coordinate does not change the set of voxels intersected by the path, this value only changes very little, due to its dependency on the tangent vector of the path.Therefore the reconstructed RSP is nearly a "step function" whose gradients exist by figures 10b and 10c, but are useless for optimization and UQ.
In figure 11 a fuzzy voxels approach was used, which leaves some discontinuities but makes the gradients represent the overall behaviour very well.This however comes at the price of a longer reconstruction time and blurrier result.
Either way, output variables of the full pipeline might be smoother because they combine the RSP values of many voxels.

Conclusions
We presented the algorithmic substeps of the Bergen pCT collaboration's incipient software pipeline, with special focus on linearizability as a prerequisite for gradientbased optimization and UQ.
Both the Monte Carlo and MBIR subprocedure's central steps compute piecewise differentiable functions with discontinuities.For the MBIR subprocedure, we identified the cause of discontinuities and proposed a way to mitigate it.Our study indicates that both codes are ready for the integration of AD.
The proton history reconstruction subprocedure involves many discrete variables, which present a huge obstacle to (algorithmic) differentiability.We investigated the tracking step as an example and found a very noisy behaviour.It is probably the best approach to "bridge" this subprocedure, carrying over the ground truth.

Figure 1 :
Figure 1: Schematic figure of the scanning process.
Computational steps of the MBIR subprocedure.

Figure 2 :
Figure 2: Computational steps in the pCT reconstruction pipeline.
Bad approximation; f (x) is far from being a Gaussian distribution.

Figure 3 :
Figure 3: A function f , shown by the dash-dotted plot in the central box, is applied to a Gaussian random variable x whose probability density function (PDF) is shown in the bottom graph.The orange graph in the left box conveys the true distribution of f (x), whereas the normal distribution with variance according to (5) is shown in blue.Cited from Aehle and Leonhardt (2021).

Figure 6 :Figure 7 :
Figure6: Numerical check whether GATE is well-linearizable: The energy deposition f E,j (x) (left) and a position coordinate f pos,j (x) (right) in the first (j = 1, black) and second (j = 2, gray) tracking layer are plotted against the beam energy x in a wide (top) and narrow (bottom) interval.
Zooming in even further, we can see steps.

Figure 8 :
Figure 8: Dependency of the tracking accuracy in % on the vertical axis w. r. t. a threshold x used by the track-following scheme.Figures 8a to 8c only differ in the range of the horizontal axis.
(a) Global perspective for various numbers of DROP steps.Changes in the set of traversed voxels are marked by vertical lines.The selected voxel is traversed by the selected track precisely if x lies in the interval marked with thick vertical lines.At x = −63.55,f (x) is differentiable as verified in figure10c(in the case of 400 iterations), and the almost horizontal tangents are indicated by dashed lines.Zoom into on of the steps of figure10a, with 400 iterations.10 −14 10 −9 10 −4 10 1 Logarithmic plot of the numerator and denominator of the difference quotient in the same range as figure 10b.
(a) Global perspective for various numbers of DROP steps.At x = −63.55,f (x) is differentiable as verified in figure 11c (in the case of 400 iterations), and the dashed tangent lines at x reflect the global behaviour very well.Logarithmic plot of the ator and denominator of the difference quotient in the same range as figure 11b.