L2LFlows: Generating High-Fidelity 3D Calorimeter Images

We explore the use of normalizing flows to emulate Monte Carlo detector simulations of photon showers in a high-granularity electromagnetic calorimeter prototype for the International Large Detector (ILD). Our proposed method -- which we refer to as"Layer-to-Layer-Flows"(L$2$LFlows) -- is an evolution of the CaloFlow architecture adapted to a higher-dimensional setting (30 layers of $10\times 10$ voxels each). The main innovation of L$2$LFlows consists of introducing $30$ separate normalizing flows, one for each layer of the calorimeter, where each flow is conditioned on the previous five layers in order to learn the layer-to-layer correlations. We compare our results to the BIB-AE, a state-of-the-art generative network trained on the same dataset and find our model has a significantly improved fidelity.


Introduction
In order to study Nature at the fundamental level and rigorously test the Standard Model (SM) of particle physics, current and future collider experiments need accurate and plentiful simulations of the detector response.The most precise simulation toolkit in high-energy physics is Geant4 [1][2][3]; however its precision comes at an enormous computational cost.The bulk of this cost is borne by the simulation of individual particle showers in the calorimeter -so much so that it is currently a major bottleneck at the LHC, and is forecast to overwhelm the available computational resources without further R&D [4,5].This has motivated, in recent years, a growing interest into using deep generative models as fast and accurate emulators of Geant4 simulations [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23].For this purpose, a variety of generative architectures, including generative adversarial networks (GANs) [24], variational autoencoders (VAEs) [25], normalizing flows (NFs) [26], and score-based generative models [27][28][29] have been considered.1 Especially NFs have shown promising fidelity when applied to the simulation of comparatively low-dimensional ( ∼ 500) calorimeter datasets [14,15,22].NFs are diffeomorphisms between the data space R  and a latent space with a tractable distribution such as a Gaussian in R  .They are trained by minimizing the negative log-likelihood (NLL), which gives them a more meaningful loss function than the commonly used GANs or VAEs, and tends to result in higher-quality generated samples for calorimeter simulations.However, the drawback of NFs is that they are very memoryintensive, requiring many parameters to encode a sufficiently expressive invertible transformation between the data space and a latent space of the same dimensionality.This has made it challenging to generalize the CaloFlow approach of [14,15,22] to higher-dimensional calorimeter datasets.Going to higher dimensionalities in calorimeter read-outs is desired to improve the accuracy of particle reconstructions and to aid in separating overlapping showers.This motivates future detector concepts such as the International Large Detector (ILD) [31,32], which is one of the proposed detectors at the International Linear Collider (ILC); or the CMS HGCAL at the HL-LHC [33].
This work explores the steps needed to adapt NFs to a higher-dimensional calorimeter dataset, resulting in the new L2LFlows architecture2.Although the methods we devise are fairly general and should have many potential future applications (including to datasets 2 and 3 of the CaloChallenge [36], which are higher-dimensional than the considered dataset in this work), we will focus on using photon showers in an electromagnetic calorimeter (ECal) prototype for ILD as a testbed, for concreteness.These showers are simulated using Geant4, and projected to a regular grid of 30 × 30 × 30 voxels (there are 30 layers in total, each layer having 30 voxels in -and -direction respectively).Unlike previous works based on this dataset [12,16], here we further reduce the transverse dimensionality to 10 × 10 (by retaining only the central voxels in each layer) resulting in a 30 × 10 × 10 dataset shape.This is appropriate, since the particle impinges on the center of the 30×30×30 cube.The reduction in dimsensionality is done to shorten the computation times needed for this first proof-of-concept demonstration of NFs for higher-dimensional calorimeter simulation.As we discuss further in Sec. 5, we expect the generalization to the full 30 × 30 × 30-dimensional dataset to be straightforward.
As in the CaloFlow [14,15,22] approach, we choose a two-step strategy for the architecture, where we generate the total energy depositions and the shower shapes in each layer separately.The first step -total energy depositions per layer -is extremely lightweight and essentially unchanged from original CaloFlow.We will refer to this first step as energy distribution flow (in [14,15,22] it was called "Flow I").The second step -describing shower shapes in each layer -is where we have innovated beyond the original CaloFlow algorithm.There, a second NF (called "Flow II") was trained to generate the full shower across all layers, but in our experiments this did not generalize well to the higher-dimensional setting in terms of memory consumption.So instead, here we choose to train 30 separate NFs, where each NF generates the shower in one specific calorimeter layer, but is conditioned on the voxel energies in the five previous layers.We refer to this step as causal flows and this is the key innovation that allows us to generalize to higher-dimensional datasets.By splitting it into 30 separate NFs, with conditioning from layer to layer, we keep both the memory requirements and the fidelity of the generated showers commensurate with original CaloFlow.
We will see that this new approach yields superior performance along several performance metrics compared to the state-of-the-art Bounded Information Bottleneck Autoencoder (BIB-AE) [12,16,18] architecture.In addition, this approach generalizes naturally to more irregularlyshaped detector voxelizations and allows for parallel training on multiple GPUs.
The structure of this paper is as follows: In Sec. 2, the dataset is introduced in more detail; Sec. 3 describes our architecture; Sec. 4 shows our results; and finally Sec. 5 concludes and gives an outlook for future work.

Dataset
The ILD [31,32] is one of two proposed detector concepts for ILC.As a modern detector concept, the ILD is specifically optimized for particle flow algorithms (PFAs) [37,38], which aim at the correct reconstruction of every individual particle created in the event.One key requirement for PFA is a precise and highly granular set of hadronic and electromagnetic calorimeters.
The ILD ECal that is used as the basis of this study is a sampling calorimeter with 30 alternating layers with passive tungsten absorbers and active silicon sensors.The first 20 absorber layers have a thickness of 2.1 mm with the subsequent 10 layers being twice as thick.Each silicon layer features individual cells with a size of 5 × 5 mm 2 .
We utilize a dataset containing 950k photon showers simulated in the detailed and realistic detector model of ILD, implemented in DD4hep [39] in the iLCSoft [40] framework.This dataset was used in past work on generative calorimeter simulation [12], where it is described in more detail.For this work, the 950k showers are split into 760k training, 95k validation and 95k test showers.These showers originate from photons with energies uniformly distributed between 10 and 100 GeV and were simulated using Geant4 version 10.4 (with the QGSP_BERT physics list).All photons hit the calorimeter at the same position and perpendicularly to the calorimeter layers.The coordinate system used in this work defines the -axis to be parallel to the trajectory of the photons, with the -plane being parallel to the calorimeter layers.For the classifier tests in Sec.4.2, further 665k independent test showers are available.
In addition, for some comparison plots, we also use independent test sets containing 4k showers with discrete energies between 20 and 100 GeV in 10 GeV steps.These discrete incident energies will be used to study the linearity and width amongst other quantities.
While previous generative projects on this dataset [12,16] used a data shape of 30×30×30, this work focuses on the core of the showers located in a 10 × 10 cell region in the -plane around the impact point to reduce the computation times and the memory footprint of the generative models.The cores of the showers still contain 92% of the shower energy.This results in a data shape of 30 × 10 × 10 where the first dimension indicates the depth along the propagation direction of the shower.

L2LFlows
Our approach uses NFs [26] to learn the probability density of showers in the calorimeter conditioned on the incident energy, (shower|incident energy).NFs efficiently learn a change-of-variables transformation where   and   are probability density functions in data and latent space, respectively,  : R  → R  is a diffeomorphism between the two spaces,  =  () and   −1 / denotes the Jacobian matrix of  −1 .Since Eq. (3.1) gives us access to the negative log-likelihood (NLL) of data points, NFs can be trained by minimizing the NLL directly.
In our case,  will be the voxelized energy deposits of individual calorimeter showers and  a multivariate Gaussian distribution.In order to compute the Jacobian determinant efficiently, we use autoregressive transformations realized as masked autoregressive flows (MAFs) [41], built with rational quadratic splines (RQS) [42,43] and Masked Autoencoder for Distribution Estimation (MADE) blocks [44].The NFs are implemented with the help of the nflows package [45] in PyTorch [46].In the following subsections, we will describe the two parts of L2LFlows, the energy distribution flow and causal flows, which generate the layer energies and shower shapes respectively.

energy distribution flow
The task of the energy distribution flow is to learn the total energy depositions per ECal layer3, which is described by the following conditional PDF: where  inc denotes the incident particle energy, and   is the energy deposited by the shower in layer , obtained by summing over all voxels in the given layer .Within a sampling calorimeter, it is necessary to apply an energy threshold to account for the fact that calorimeters have inherent electronic noise, and thus depositions that are too small become unreliable.We, therefore, apply a cutoff to the individual voxel energies with a threshold of 10 −4 GeV before calculating the layer energies   .This threshold corresponds to half the energy loss of a minimum-ionizing particle in the ILD ECal [12].
The energy distribution flow is lightweight and, as such, does not present a bottleneck on computation times.Therefore, the model closely follows the original CaloFlow approach.The most noteworthy change is the modified preprocessing.Since the ILD ECAL is a sampling calorimeter, only a fraction of the energy of a particle is recorded.Thus, it was not necessary to enforce a strict energy upper limit    ≤  inc through preprocessing, as was the case in the original CaloFlow.Instead, we choose a simpler preprocessing, outlined in App.B.
The energy distribution flow architecture is shown in Fig. 1 and details of the model and its training can be found in App. A. By construction, generation happens recursively over the dimensionality of MAFs.In total, the energy distribution flow has about 200k parameters and was trained for 200 epochs on a single NVIDIA ® V100 ® with 32 GB VRAM, which took less than 8 hours.We subsequently use the validation NLL to select the best checkpoint among the 200 epochs.
3Note that we start counting the ECal layers from 0 instead of 1.

causal flows
Next, we turn to the second step of the generation process: generating shower shapes conditioned on the total incident energy and the total deposited energies in each layer.Our overarching goal here, as in the original CaloFlow, is to learn where the ECal voxel energy depositions of layer  are denoted by I  ∈ R 100 .Unlike in Sec.3.1, no cutoff is applied to the voxel energy depositions used in the causal flows training.This prevents potential sharp edges in the voxel data, which would be caused by the cutoff, from interfering with the training of the causal flows.(For the energy distribution flow, this issue was already circumvented, as each layer energy is the aggregate of multiple voxels, lessening any potential edges.)The voxel energy depositions are preprocessed similarly to the layer energies used in the energy distribution flow.The precise nature of the preprocessing is outlined in App.B.
In the original CaloFlow, a single NF was trained on all the calorimeter voxels of every layer together, to directly learn (3.3).Since the number of parameters of a single NF scales quadratically with the dimensionality  of the samples, the single-NF approach of original CaloFlow applied to the ILD dataset (which has  = 3000) would lead to a prohibitive number of parameters (> 1B).One can attempt to reduce the number of parameters by decreasing the number of MADE blocks as well as RQS bins, but this leads to a significantly reduced fidelity.
To reduce the number of parameters without sacrificing quality, our key idea here is to instead train one NF per ECal layer.Since the evolution of a shower in layer  depends on what happened in the previous layers, NF  has to be conditioned on the voxel energy depositions of the previous layers.In other words, we endeavor to train 30 separate NFs to learn the distributions: If each distribution   could be learned perfectly, then they could be multiplied together to reconstruct the full joint distribution (3.3).This would be in effect its own kind of autoregressive model.However, in later layers, there are a lot of conditioning features, and we observed that attempting to model the full conditional likelihood (3.4) resulted in suboptimal performance.
Instead, we found it beneficial to approximate the full conditional distribution (3.4) with: (I  |I − cond , . . ., I −1 ,  0 , . . .,  29 ,  inc ) ( i.e. to truncate the conditional at  cond previous layers.Due to the computational cost, a complete scan over  cond was not possible, but some small trials convinced us that  cond = 5 gave a reasonably good balance between number of parameters and performance.In an effort to further reduce the number of parameters in the NFs, the models are not directly conditioned on the full previous layers.Instead, these layers are passed through an embedding network   : In an ablation study, we found the performace difference between conditioning on only   and  inc and conditioning on all  0 , . . .  1 shows the context features for each NF.For NFs 0 to 4, there are less than 5 preceding ECal layers, thus they have less than 502 context features.4Because of this conditioning scheme, generation happens recursively; however, training can happen in parallel on multiple GPUs, since all required context features are derived from the training data.For example, to generate the voxel energies in layer 2, those of layers 0 and 1 must be generated first, and then NF 2 is conditioned on the voxel energies of the previous layers as well as  2 and  inc .In this way, the whole calorimeter can be traversed.The architecture of the causal flows is visualized in Fig. 2. A more detailed description of the model can again be found in App. A.
During generation, it turns out that the conditioning on the   is not sufficient to guarantee that the energies per layer by the sampled showers equal   .Hence, some postprocessing like rescaling to the   of the energy distribution flow and a thresholding of low-energy voxels is necessary.We detail our method in App.B and illustrate its effect in Fig. 13.
The causal flows have in total 44.8M parameters, and they were trained on a single NVIDIA ® V100 ® with 32 GB VRAM for about 55 GPU-hours.As for the energy distribution flow, we use the validation NLL to select the best checkpoint among the 200 epochs.

Results
We now evaluate the performance of the L2LFlows approach.We benchmark it against a stateof-the-art shower generation model based on the BIB-AE framework, adapted from Ref. [18] and 4NF 0, which learns the distribution of the voxel energies of layer 0, does not use an embedding network, since it is only conditioned on  0 and  inc .
Table 1: For the conditioning on the previous 5 ECal layers, i.e.  cond = 5, this table shows the context features each NF gets and their shape before being fed into an embedding network.Here,  denotes the batch size used during training or sampling.
modified to operate on the photon showers with shape 30 × 10 × 10 by retraining it.The BIB-AE consists of an encoder and a decoder pair, which is trained using a set of adversarial critics.The BIB-AE generation process employs an additional post-processing step and a Kernel-Density-Estimation-based latent sampling, as described in Ref. [18].The BIB-AE model and PostProcessor model have a combined total of 9.3M parameters, while the critics used to train them have an additional 3.7M parameters.

Distributions
Figure 3 shows a single test shower of Geant4 as well as a generated shower from the BIB-AE and L2LFlows each.All single showers have an incident energy  inc ≈ 50 GeV.We see that the individual shower from L2LFlows looks reasonable, with a broadly realistic morphology of voxels and energy depositions.
Figure 4 shows the overlay of 95k showers, i.e. the mean of the voxel energies of 95k showers.In order to create two-dimensional plots, the voxel energies are summed over the -, -or -axis.For Geant4, the 95k test showers are used.To highlight potential differences for the BIB-AE and L2LFlows, we show the absolute relative deviation to Geant4 for both generative networks per voxel: where  and  denote voxel positions.We observe that in general the generative models capture the overlay quite well, with L2LFlows having smaller deviations from Geant4 than the BIB-AE.
To compare the performance of the generative models in more detail, we start by looking at the showers on the voxel level.Figure 5 shows the distributions of voxel energies as well as the sparsity, i.e. the number of non-zero voxels per shower.One characteristic that repeats itself in several histograms is that the BIB-AE is not capable of capturing the full Geant4 distribution, which can e.g.be seen in the sparsity plot.L2LFlows is much better in this regard.Further, the energy deposited around the energy of a minimum ionizing particle (MIP) in the voxel distribution is better modeled by L2LFlows in comparison to the BIB-AE, which slightly overshoots it.While L2LFlows does not learn the Geant4 distribution perfectly, it learns the distributions much better than the BIB-AE.
For  inc ∈ {20, 80} GeV, Fig. 6 shows the energy profiles in -, -and -direction.As can be seen, the larger the incident energy  inc , the more the maximum in the energy profiles shifts to later layers, which both the BIB-AE and L2LFlows are able to learn.Deviations for both simulators mainly exist in a few initial and final layers.
The distributions in Fig. 7 show the total energy depositions ( depos :=    ), both for continuous incident energies uniformly distributed in [10,100] GeV (left) and for discrete incident energies  inc ∈ {20, 50, 80} GeV (right).In both of these distributions we observe that L2LFlows is much closer to the Geant4 distribution than the BIB-AE.
Figure 8 shows the linearity5 (and its relative deviation to Geant4) as well as the width (again with its relative deviation).6For the linearity, the relative deviation is for the BIB-AE maximally 5This does not correspond to the actual calorimeter linearity or resolution, as the increased thickness of the last 10 ECal layers is not calibrated for.It is, however, still a vital means for determining the performance of the generative approaches.
6The linearity  90 is defined as the mean deposited energy over the ECal for discrete  inc of a 90% subset of the samples that have the smallest range.The width  90 is defined as  90 :=  90 / 90 , where  90 is the standard deviation of the 90% subset of the energy deposition samples that have the smallest range.Relative deviation to GEANT4 .In all plots, the mean over the number of showers is taken.For Geant4, the shown colormap is the energy scale, whereas for the BIB-AE and L2LFlows, the colormap (both generative networks make use of the same one) corresponds to the relative deviations to Geant4, defined in Eqs.4.1 and 4.2.
about 1%, for L2LFlows the deviation is everywhere below 0.75%.For the width plot,7 the relative 7One might be tempted to call  90 the "resolution", but because of the different thicknesses of the tungsten absorber layers, cf.Sec. 2, this is not the case [12].deviation for L2LFlows is everywhere below 5%, whereas for the BIB-AE, the maximum deviation is about 15%.
It is also interesting to examine the ratio of  depos over  inc plotted as a function of  inc .The upper row of Fig. 9 shows that the functional form of the ratio is not constant for Geant4.While a perfect calorimeter would yield a constant ratio for Geant4, in practice, because of leakage and the increased thickness of the last ten absorber layers, the curve falls off over the range.The fact that the ratio of the deposited over the incident energy is only O (1%) is expected, as the ILD ECal is a sampling calorimeter.As becomes apparent from Fig. 9, L2LFlows learns the functional form much better than the BIB-AE.In particular, the BIB-AE has problems at the edges.At the left edge, i.e. for  inc ≈ 10 GeV, ratios of 2% and more are too populated compared to Geant4, yet ratios of around 1.5% and less are too thinly populated.At the right edge, i.e. for  inc ≈ 100 GeV, the functional form falls off too quickly.Further, in the middle row of Fig. 9, we show the sparsity plotted against  depos .The BIB-AE learns a distribution that is thinner compared to the one from Geant4, and its core has too many occurrences.For L2LFlows, the agreement to the Geant4 distribution is much better, and differences are barely visible by eye.Finally, the last row of Fig. 9 shows the 2D correlations for the center of gravity in -direction versus the total deposited energy.It can be seen that the BIB-AE is yet again not capturing the full distribution, as its 2D plot is more compact compared to Geant4.In contrast, L2LFlows exhibits a superb performance.
In addition, Fig. 10 shows correlation matrices for pairwise Pearson correlation coefficients between several high-level observables for Geant4 and the difference of Geant4 to the BIB-AE and L2LFlows.The observables are, in order of appearance, the first and second moments along the , , and  directions, the visible energy sum, the incident photon energy, the number of hits, and the energy fractions in the three thirds of the calorimeter along the -directions.More details can be found in Ref. [12].It can be seen that both generative models correctly describe a large number of the investigated pair-wise correlations.Both models do, however, struggle with specific correlations, involving the second moments in the -and -direction.Mean energy [GeV] 80 GeV GEANT4 BIB-AE L2LFlows Finally, Fig. 11 shows the total energy depositions per layer for four selected layers.In layers 2, 8, 14 and 26, L2LFlows is at least comparable to the BIB-AE, if not better.
Judging from the histograms and plots that have been shown so far, L2LFlows seems to outperform the BIB-AE in almost every single physics quantity, however it does slightly worse in capturing pairwise correlations.
In order to judge the performance more comprehensively in the full multivariate phase space, various metrics have been suggested in [14,47,48].For this comparison, we turn to classifier-based tests described in [14,48] in the following subsection, and leave the exploration of other metrics suggested in [47] as a future research direction.2D histograms comparing correlations between selected sets of variables for Geant4, the BIB-AE and L2LFlows.The upper row of plots shows the ratio of the deposited energy  depos to the incident energy  inc as a function of  inc .While a perfect calorimeter would have a constant ratio for Geant4, in practice, because of leakage, the curve falls off over the range.The center row shows the number of voxels in which energy was deposited versus the total deposited energy.The lower row shows the center of gravity in -direction versus the total deposited energy.All plots are shown for the full spectrum with 95k showers for every model.

Classifier Tests
As in [14,15,22], we now turn to a classifier-based metric to evaluate the quality of the generated showers in the full 3000-dimensional phase space.In total, two binary fully-connected classifiers are trained, one on Geant4 vs BIB-AE generated showers, the other on Geant4 vs L2LFlows generated showers.Both classifiers have the same architecture and make use of the same hyperparameters; details can be found in App. C. The idea of the classifier metric is that if the classifier is optimal, then by the Neyman-Pearson lemma it directly computes the likelihood ratio  generated ()/ reference ()        in the full phase space.A perfect generative model should have  generated =  reference and optimal classifier scores that are identically 0.5.8For an imperfect generative model, the optimal classifier should be the most powerful detector of any deviations from  generated =  reference .
Of course, it is never possible, given finite samples and finite model capacity, to learn the truly optimal classifier.Therefore, the classifier metric we evaluate here is at best an approximate measure of model quality.At most, we could expect the classifier AUC score we obtain here to be 8Indeed, we find an AUC of 0.5 when training on Geant4 vs Geant4 samples.a lower bound on the true AUC score that would be given by the optimal classifier.However, given identical model architectures and training set sizes, we expect the relative comparison of binary classifier scores between Geant4 vs BIB-AE and Geant4 vs L2LFlows to still be meaningful and informative.
The results of 10 classifier trainings are shown in Tab. 2. As can be observed, the BIB-AEgenerated showers allow for almost perfect classification, which reflects itself in an AUC close to 1.The L2LFlows-generated showers, on the other hand, are much better able to fool such a classifier.However, we note that there is still some separation power to Geant4-generated showers, as the mean AUC of the classifiers is far away from 0.5.
In Tab. 2, we have also gone beyond previous works, to study the dependence of the classifier metric on training sample size.(We only studied the dependence on training sample size for L2LFlows, since the BIB-AE is very close to 1 when trained on only 95k showers.)As becomes apparent, the mean AUC of L2LFlows worsens with more showers, which is unsurprising, as with more statistics, the classifier can find more differences between the Geant4-and L2LFlowsgenerated showers.At an even larger number of showers used for classifier training, we would expect the finite size of the generator training set to become an issue, too [49][50][51].Nevertheless, we observe that for a given number of showers, the BIB-AE showers are more separable from Geant4 than the L2LFlows showers, indicating a better performance of L2LFlows.Also, even though the AUC scores for Geant4 vs. L2LFlows are worsening with more training data (and may be asymptoting to 1, there is insufficient training data to say for sure), the fact that they are not immediately close to 1 (as is the case for Geant4 vs. BIB-AE) is a further indication that the L2LFlows showers are of higher quality.
To further test the relative quality of L2LFlows vs. BIB-AE, we use the new Multi-Model Classifier Metric proposed in [48].Instead of training separate binary classifiers between each generated model and the reference data, which can be constrained by limited amounts of the latter, we instead train a classifier (potentially multi-class) between the different generative models.This learns the probability that a shower came from each model.Then we evaluate this classifier on Geant4, BIB-AE and L2LFlows showers, and see which model the classifier prefers.Note that while there are some limitations to the use of classifiers as an absolute metric discussed earlier, we expect the interpretation in a relative sense (as is done here) to be more straighforward.
For this test, we use 760k showers sampled from the BIB-AE and L2LFlows each.Just as for the Geant4 vs BIB-AE/L2LFlows classifier, we make a 60% : 20% : 20% split to obtain training, validation and test showers of the classifier.The AUC of the classifier on the test dataset (where we evaluate the checkpoint with highest validation accuracy) is 1.0000, implying that a fully-connected classifier has no trouble distinguishing between BIB-AE and L2LFlows showers.The architecture and hyperparameters of this BIB-AE vs L2LFlows classifier are identical to the Geant4 vs BIB-AE/L2LFlows classifiers; with the exception of the number of training epochs, see details in App. C.
For evaluation, we consider the test sets of the classifier containing 152k showers for the BIB-AE and L2LFlows each and use 152k Geant4 test showers to compare the classifier outputs.The means of the output probabilities (L2LFlows|) for  coming from BIB-AE, L2LFlows, and Geant4 are 0.03%, 99.91%, and 98.84% respectively.This indicates that Geant4 and L2LFlowsgenerated showers are much closer to each other than Geant4 and BIB-AE-generated showers are.
To further visualize this result, we plot the predictions of the classifier on the test showers in Fig. 12.This also shows us that Geant4 showers are on average more likely to be identified as coming from L2LFlows than BIB-AE by the classifier.All of this strenghtens our conclusion that L2LFlows captures the underlying shower distribution of Geant4 much better than the BIB-AE.

Number of showers GEANT4 BIB-AE L2LFlows
Figure 12: Predicted probability that showers come from L2LFlows for Geant4 (grey), the BIB-AE (orange) and L2LFlows (blue) showers as input.Plot is shown for the full spectrum with 152k showers for every simulator.

Shower Generation Timings
Table 3 shows the mean sampling time per shower for Geant4, the BIB-AE and L2LFlows.For L2LFlows, the sampling times of the energy distribution flow are not accounted for, as they are negligibly small compared to the mean shower generation times of the causal flows.For Geant4, the same number as in Ref. [12] is taken, as cropping the dataset from 30 × 30 × 30 to 30 × 10 × 10 was done once the Geant4 showers were simulated in the full ECal prototype.9We note that, in contrast to Geant4, shower generation times for L2LFlows and the BIB-AE do not depend on the incident energy.Since the generation times of a MAF scale with the dimensionality  of the input samples, one can expect the sampling times for L2LFlows to worsen by a factor of 9 when going from the 30 × 10 × 10 to the full 30 × 30 × 30 data, while the Geant4 run time would stay the same10.The main bottleneck is not our autoregressive treatment of the ECal layers, but more the MAFs with which we model every single ECal layer.
The speedups obtained on the cropped dataset are for L2LFlows up to a factor of 200 slower than the BIB-AE (with a batch size of 1 on the CPU), and in comparison to Geant4, L2LFlows 9Simulating the showers in 30 × 10 × 10 would be unphysical, since this would not take into account backscattering for example.Also, we do not expect a large difference between generation timings for showers simulated in a 30 × 30 × 30 cube or a 30 × 10 × 10 cuboid, since we focused ourselves on the core of the showers, where most energy depositions happen.
10For the BIB-AE, the mean sampling times on the full dataset can be found in Ref. [12].3: For 25 runs, the mean and the standard deviation of the sampling time per shower as well as the obtained speedup in comparison with Geant4 are shown for different batch sizes and hardware during sampling for Geant4, the BIB-AE and L2LFlows.The GPU is an NVIDIA ® A100 ® with 40 GB VRAM.For the CPU, an Intel ® Xeon ® E5-2640 v4 was chosen, and the value for Geant4 is taken from Ref. [12], where the simulated showers have a shape of 30 × 30 × 30.
is only a factor of 3 faster on the CPU (with a batch size of 1000).On the GPU, L2LFlows is about 470 times faster than Geant4 (with a batch size of 128000), whereas the BIB-AE can obtain a speedup of about 16000 (with a batch size of 2000).Reference [15] also observed mean sampling times that were much slower than their GAN baseline network from Ref. [6,7], and to combat this, a MAF-IAF setup using probability density distillation, inspired by Ref. [52], was used.The obtained speedup was a factor of O (), with a negligible loss in shower quality.Here, IAF refers to the inverse autoregressive flow [53], an alternative architecture for autoregressive flows that we detail in App. A. Applying the same MAF-IAF concept to this work is an interesting future research direction; if it works, a speedup O (100) can be expected.This implies that L2LFlows has the potential to outperform the BIB-AE not just in the fidelity, but also in the speed with which the generated showers are obtained.

Conclusions and Outlook
This work built on Ref. [14] and demonstrated for the first time that NFs can be used to generate high-fidelity showers in a highly-granular sampling calorimeter.Showers were generated in a twostep approach, where the energy distribution flow first learned the energy depositions per ECal layer.Then, 30 NFs (one per layer) -which we dubbed causal flows -were used to learn the voxel distributions, while being conditioned on the total deposited energy in that layer, the incident energy of the photons, and the voxel energies of the previous 5 layers.The use of fully-connected embedding networks, which distill the conditioning features, cf.Tab. 1, further reduces the number of parameters with no loss in performance.It was found that for all considered distributions in Sec.4.1, L2LFlows either outperforms the BIB-AE or is as good as it, with the exceptions of correlations, where the BIB-AE performs slightly better.
Further, L2LFlows has a much better AUC than the state-of-the-art network on the dataset -the BIB-AE -in the classifier tests.The classifiers used in this work took as input both Geant4-simulated and neural network-generated showers as well as the incident energies of the photons: The BIB-AE yields an AUC of 0.9947 ± 0.0025, whereas L2LFlows leads to an AUC of 0.8518 ± 0.0042.We also trained a classifier directly on BIB-AE and L2LFlows showers.As shown, when taking Geant4 showers as input, such a classifier is much more likely to label it as an L2LFlows instead of a BIB-AE shower, further indicating the superior quality of the proposed approach.It was further shown that L2LFlows outperforms the BIB-AE in almost every considered physics distribution.
The two-step approach, which was first introduced in Ref. [14], can also be applied to other generative networks.For example, adding an energy distribution flow to the BIB-AE approach may also improve the fidelity of the generated showers there.
One bottleneck of the developed approach, however, is the required sampling time per shower.The problem is generally that a MAF is slow during generation, as it sequentially calculates each output dimension of a sample during generation.If the MAF-IAF approach from Ref. [15] also succeeds in training the much faster IAF for L2LFlows, then a potential speedup factor of O (100) could be obtained.This would result in an NF architecture that could be used to sample faster than the BIB-AE.
Although the dataset has a uniform number of voxels in each layer, L2LFlows could straightforwardly generalize to non-uniform cases.In addition, splitting the learning of the shower shape into several NFs also has the advantage that the training can be parallelized on several GPUs.However, one pays a price that, even when employing an IAF setup, 30 individual NFs evaluations are required.This limits the speedup of the proposed IAF setup to O (100), as opposed to the O (3000) speedup that could be achieved by a single-flow approach.
Further, we believe our NF architecture to have applications beyond the use in high-energy physics.L2LFlows could in principle also be studied for image or video generation.For example, Ref. [54] uses an NF to generate high-fidelity images, yet for large images, a batch size of 1 was used during training.To mitigate these memory constraints, it might be possible to use not only a single NF, but several of them, where each NF sees only a subset of the pixels, yet is conditioned on the previous pixels.The NFs could learn the full image in a top-bottom approach, where the first NF learns the first set of pixels, the second NF learns the second set while being conditioned on the first set, and so on.Image or video generation would then happen sequentially.To the best of our knowledge, such an approach has not been considered in the literature yet, and since each NF can be trained separately, a higher batch size can be chosen.This proposed approach is very similar to autoregressive models such as PixelRNN [55], PixelCNN [56], PixelCNN++ [57] or PixelSNAIL [58], but instead of generating the image pixel by pixel, chunks of pixels would be generated at once.
As a proof of concept that NFs also scale to higher-dimensional datasets, this work cut down the 30 × 30 × 30 projection to a 30 × 10 × 10 projection.An extension of this work to the full projection is believed to be straightforward, as every NF would then have to learn a 900-instead of a 100-dimensional PDF, which should be feasible to tackle computationally with L2LFlows.In addition, this work can be extended by not just studying photon showers in the ILD ECal, but also pion showers in an HCal prototype for the ILD, which was done for the BIB-AE in Ref. [18], where the HCal was projected to a cuboid of size 48 × 25 × 25.With L2LFlows, this would result in 48 NFs, where each NF has to learn a 625-dimensional distribution.
It is also important to perform angular conditioning studies in the future, as the dataset used in this work shot the photons perpendicularly into the ECal.And just as Ref. [18] considered the output of state-of-the-art reconstruction algorithms on the output of the neural network-generated showers, it would be interesting to do the same once L2LFlows has been extended to the full 30 × 30 × 30 dataset of the ILD ECal prototype.Last but not least, L2LFlows can be studied for the three different datasets from the CaloChallenge [36].coordinate, since these do not depend on any other   .Using these parameters to correctly invert   , we can get the correct parameters to invert  2 after a second pass through the MADE block.In total we therefore need  passes through the MADE block to fully construct the inverse transformation.The MAF now uses the fast pass to compute the log-likelihood, allowing for a fast training at the price of slow sampling.The IAF would allow for faster sampling, but only at the expense of much slower (or even impossible because of memory constraints) training.More details about MAFs and IAFs and their use for calorimeter simulations can be found in [15].
The details of the energy distribution flow are summarized in the left column of Tab. 4. The MADE blocks make use of fully-connected layers, and their total number per MADE block is given by the input layer, the number of hidden layers and the output layer.The hidden layers inside the MADE blocks make use of the ReLU activation function.The minimum bin width and height refer to the minimum values of an RQS bin, and the minimum derivative to the minimum derivative value at the knots of a bin; and the permutation hyperparameter refers to the permutation of the input features before they are passed to the MADE block.Since the transformations parameterized by each MADE block are autoregressive in nature, the permutation layers help to increase the expressivity of the normalizing flow.Here we differentiate between "random", which denotes a randomly determined permutation, and "reverse", where the permutation inverts the ordering of the data dimensions.During generation, double precision parameters are used, as we found that using only float precision parameters leads to numerical instabilities, which arose when the RQS solved a quadratic equation for the inverse [42,43].During training, however, there is almost no advantage of using double precision, as we checked in an ablation study.Since up to 50% of the memory can be saved that way, we use single float precision during training.
The hyperparameters of the causal flows can be found in the right column of Tab. 4. Just as for the energy distribution flow, an ablation study showed barely any advantage of double precision parameters for training, hence we use only float precision (generation still happens with double precision parameters).

B.1 Preprocessing energy distribution flow
The training input of the energy distribution flow consists of the energies per layer   .These energies are preprocessed before they are passed to the energy distribution flow.In the first step, the layer energies   are smeared using an additive Gaussian noise term, with mean  = 1 keV and standard deviation  = 0.2 keV11.This helps the energy distribution flow learn the marginal distributions of   and results in a noticeable performance increase compared to training without the added noise.
The smeared energies per layer   , are then further processed using with  = 10 −6 .During generation, the   come from the energy distribution flow and they are also processed according to Eq. (B.7).

B.3 Postprocessing causal flows
The postprocessing used in Ref. [14] ensures via a renormalization of the I  that the energies per layer of the returned showers are approximately those that the energy distribution flow dictates.Usually, one is interested in showers that have an energy threshold applied, since the inherent electronic noise of the detector implies that too small energies cannot be converted into a signal that can be read out.This thresholding reduces the deposited energy per layer below the energy dictated by the energy distribution flow.However, the study of Ref. [14] is based on the dataset from Ref. [7], where the energies both from the passive absorber layers and the active detector layers are assumed to be available, and applying an energy threshold on the generated showers barely made a difference in the energies per layer.This work uses a realistic sampling calorimeter, where the energies from the passive absorber layers are unavailable.For this reason, the voxel depositions are much smaller, and the cut makes a non-negligible difference.Hence, a new postprocessing was needed for this work.In a nutshell, the new postprocessing checks how many of the dimmest voxels need to be set to zero such that the renormalized remaining voxels are all above the threshold.
We define (I  ) :=  I  , the sum over voxels I  in layer  and I , ≥ as the voxels I  thresholded by , i.e. all voxels less than  in layer  are set to zero.The postprocessed voxels are then given by I As a result, all generated voxel energies of layer  sum to   after the threshold cut is applied.Figure 13 illustrates the effect of the postprocessing.There, we see the distribution of energies in layers 2, 8, 14, and 26 as given by different algorithms: In light green, we see the distributions as given by the energy distribution flow; in orange, the distributions of raw, generated showers without a threshold cut (which only scatters around the   of the energy distribution flow, even though it was conditioned on it); in red, the distributions of raw showers after the CaloFLow postprocessing (renormalization and subsequent threshold cut); and in blue, the distributions of the same showers using our postprocessing.We clearly see that a simple application of the threshold cut after the renormalization distorts the distribution towards lower energies.This effect is stronger in the outer layers of the calorimeter, where the overall scale of energy depositions is smaller.

C Classifier Tests: Architectures and Details
Both the Geant4 vs BIB-AE as well as the Geant4 vs L2LFlows classifiers are fully-connected neural networks with the same architecture and hyperparameters: They consist of four hidden layers with 4096, 512, 64 and 8 nodes with the LeakyReLU activation function, using a slope of 0.01 for input that is smaller than 0. The output layer has 2 nodes, and we use the cross entropy loss.Each output node can be interpreted as the likelihood of a given sample belonging to Geant4 or the BIB-AE/L2LFlows.Therefore, the likelihood ratio is available in the binary classification setup.In total, every classifier has 14.4M parameters.The classifier is trained on input with double precision.The learning rate is set to 10 −4 , the batch size to 256 and both classifiers are trained for in total 50 epochs.The final model is chosen that has the highest validation accuracy.Unlike Ref. [14], which considers the ECal voxel energies, (log-transformed) deposited energies per ECal layer as well as the (log-transformed) incident energy as input to the classifiers, we believe the use of the deposited energies per layer to be redundant, as they are already encoded in the ECal voxel energies.Thus, we make use of 3001 input features, where the incident energies are log-transformed as in Eq. (B.3).The voxel energies have half the MIP cutoff applied.
The only difference between the Geant4 vs BIB-AE/L2LFlows and BIB-AE vs L2LFlows classifier is that for convergence reasons of the validation accuracy, the latter is trained for 100 epochs.

Figure 1 :
Figure 1: Architecture of the energy distribution flow.

Figure 2 :
Figure 2: Architecture of the causal flows.As mentioned in the main text, NF 0 does not make use of an embedding network for the conditioning.The postprocessing is explained in detail in App.B.

Figure 4 :
Figure 4: Overlay of 95k showers for all simulators for the full spectrum, where the voxel energies are summed along the -(top), -(middle) and -axis (bottom).In all plots, the mean over the number of showers is taken.For Geant4, the shown colormap is the energy scale, whereas for the BIB-AE and L2LFlows, the colormap (both generative networks make use of the same one) corresponds to the relative deviations to Geant4, defined in Eqs.4.1 and 4.2.

Figure 5 :
Figure 5: Distributions comparing Geant4 (grey), the BIB-AE (orange) and L2LFlows (blue).Left: Distribution of voxel energies with shower incident energies uniformly distributed between 10 and 100 GeV, based on 95k showers for every model.Right: Number of voxels above half the MIP cutoff for 4k showers of 20, 50, and 80 GeV photons each for every model.

Figure 6 :Figure 7 :Figure 8 :
Figure 6:Comparisons of the energy profiles of Geant4 (grey), the BIB-AE (orange) and L2LFlows (blue).The upper row of plots shows the energy profiles in the -direction, the center row shows the profile along the -direction, and the lower row shows the profile along the -direction.The left-hand side plot shows the profile for showers caused by 20 GeV photons, and the right-hand side shows the profile for 80 GeV photons, covering both the high-and low-energy regions of the data set.At each discrete energy, Geant4, the BIB-AE and L2LFlows are plotted with 4k samples each.
in z-dir.

Figure 9 :
Figure9: 2D histograms comparing correlations between selected sets of variables for Geant4, the BIB-AE and L2LFlows.The upper row of plots shows the ratio of the deposited energy  depos to the incident energy  inc as a function of  inc .While a perfect calorimeter would have a constant ratio for Geant4, in practice, because of leakage, the curve falls off over the range.The center row shows the number of voxels in which energy was deposited versus the total deposited energy.The lower row shows the center of gravity in -direction versus the total deposited energy.All plots are shown for the full spectrum with 95k showers for every model.

Figure 10 :
Figure 10: Pearson correlation matrices for Geant4 (upper left), L2LFlows (upper right), the difference between the Geant4 and BIB-AE correlations (bottom left) and the difference between the Geant4 and the L2LFlows correlations (bottom right).For every simulator, 95k showers are used.A description of the variables can be found in the text.

Figure 11 :
Figure 11: Energy depositions per layer (summed over all ECal voxels) for Geant4 (grey), the BIB-AE (orange) and L2LFlows (blue) in different layers.All plots are shown for the full spectrum with 95k showers for every simulator.
proc  := ℓ  (  ) ∀ ∈ {0, . . ., 29}, (B.1) 11We clip all negative values to 0. are preprocessed.The first step consists of adding noise (sampled uniformly from [0, 1] 100 keV) to each I  .The resulting values are then normalized and logit-transformed in accordance with I logit  := ℓ  I  max ∀ , (B.4) where max denotes the maximum voxel energy taken over all training data and ℓ  is the logit transformation defined in Eq. (B.2).As described in Sec.3.2, NF  is conditioned on the voxel energies of the previous 5 layers, as well as on the energy in their respective layer   and the energy of the incident particle  inc .The preprocessing of  inc is identical to what is given by Eq. (B.3).During training, the layer energies   are derived from the voxel energies in the -th layer I  using   := ∑︁  (I cut  + noise) , (B.5) where I cut  are the voxel energies after the threshold cut, defined by I cut  := (I  ≥ threshold) , (B.6) and the noise in Eq. (B.5) refers to the Gaussian noise added during training, cf.Tab. 4. The   are then further processed according to  proc  := log 10 (  + ) + 1 , (B.7) pp  :=   • I , ≥ /(I , ≥ ), (B.8)where  is set by requiring   • /(I ,≥ ) = desired threshold.(B.9) 29 and  inc to be small, and hence for simplicity, we only condition on   and  inc ; hence, our PDF from Eq. (3.6) simplifies to  (I  |  (I − cond , . .., I −1 ,   ,  inc )).The embedding network   takes in the context features I − cond , . . ., I −1 ,   ,  inc and learns a representation of them that minimizes the NLL loss of the NFs.It is trained jointly with the NF, (3.7)with Eq. (3.5).Table

Table 2 :
Classifier results for different number of showers, where the left column shows the number of showers per simulator used for the classifier tests (a 60% : 20% : 20% split is made to obtain training, validation and test showers of the classifiers).The middle and right columns show the mean and standard deviation of the AUC of 10 independent runs for Geant4 vs L2LFlows and Geant4 vs BIB-AE classifiers.Since the mean AUC of the BIB-AE in 10 independent runs is already very close to 1 for 95k showers, more showers are only used for the Geant4 vs L2LFlows classifiers.