CaloClouds II: ultra-fast geometry-independent highly-granular calorimeter simulation

Fast simulation of the energy depositions in high-granular detectors is needed for future collider experiments at ever-increasing luminosities. Generative machine learning (ML) models have been shown to speed up and augment the traditional simulation chain in physics analysis. However, the majority of previous efforts were limited to models relying on fixed, regular detector readout geometries. A major advancement is the recently introduced CaloClouds model, a geometry-independent diffusion model, which generates calorimeter showers as point clouds for the electromagnetic calorimeter of the envisioned International Large Detector (ILD). In this work, we introduce CaloClouds II which features a number of key improvements. This includes continuous time score-based modelling, which allows for a 25-step sampling with comparable fidelity to CaloClouds while yielding a 6× speed-up over Geant4 on a single CPU (5× over CaloClouds). We further distill the diffusion model into a consistency model allowing for accurate sampling in a single step and resulting in a 46× speed-up over Geant4 (37× over CaloClouds). This constitutes the first application of consistency distillation for the generation of calorimeter showers.


Introduction
Accurate simulations of particle physics experiments are crucial for comparing theory predictions with experimental results.With the planned high luminosity upgrade to the Large Hadron Collider (LHC) [1] and other envisioned collider experiments like those at the International Linear Collider (ILC) [2], experimental data is going to be taken at ever increasing rates.The amount of simulated events needs to keep up with these rates, which is difficult to achieve with current Monte Carlo simulations and the projected computing budgets at large experiments [3,4].
Most previous generative calorimeter models rely on a fixed data geometry, representing calorimeter showers as 3-dimensional images with the energy as the "color" channel and each pixel representing a calorimeter sensor.Modern high granularity calorimeters consist of many thousands of sensor cells or more (e.g. 6 million for the planned CMS HGCal [64]), but a given shower often deposits energy in only a small fraction of cells resulting in very sparse 3d image representations.Hence, it is much more computationally efficient to only simulate the actual energy depositions with a generative model.This can be achieved by describing the shower with only the coordinates and energies deposited -i.e. a point cloud.Such a multidimensional calorimeter point cloud can be represented by four features, the three-dimensional spatial coordinates and the cell energy, with the number of points equivalent to the number of cells containing hits.
In addition to computational efficiency, such point cloud showers have the major advantage that they can represent not only cell energies, but also much more granular Geant4 step information, i.e. simulated energy depositions in the material, not accessible in experiments.Such Geant4 step point clouds are largely independent of the cell structure within a layer of a given calorimeter, effectively allowing the translation-invariant projection of the shower into any part of the calorimeter, regardless of cell type.These projections with Geant4 step point clouds are less likely to produce artifacts due to gaps or cell staggering than cell-level point clouds would, resulting in a largely geometry-independent description of the calorimeter shower.This approach is complimentary to a geometry-aware model [65], which is trained with a dataset containing various calorimeter geometries.
Previous point cloud and graph generative models explored in particle physics [55, 58-61, 63, 66] were only used for relatively small numbers of points.However, energetic calorimeter showers in high granularity calorimeters consist of O (1000) points.To generate such showers, we recently introduced CaloClouds [40] a generative model able to accurately generate photon showers in the form of point clouds with several thousands of points (namely clustered Geant4 steps), in order to achieve geometry-independence.Since then, a specific comparison between a generative model for fixed geometry and a generative model for point cloud structured calorimeter showers on cell-level was performed in Ref. [41].
This CaloClouds architecture consists of multiple sub-models with a diffusion model (see Sec. 3.1 for details) at its core.Most diffusion models, including the one used in CaloClouds, are currently held back by their slow sampling speed, as many evaluation steps have to be performed to generate events.However, recent advances in computer vision achieve very high generative fidelity on natural image data with O (10) model evaluations using advanced training paradigms and novel ordinary and stochastic differential equation solvers [67][68][69][70].In this work, we first leverage recent advances in the training and sampling procedure of diffusion models in order to generate samples with the CaloClouds II model1 using much fewer model evaluations than the original CaloClouds model, by following the diffusion paradigm introduced in Ref. [68].
Another research direction to speed up generative models is the distillation of diffusion models into models which require significantly fewer function evaluations during sampling than the original model [71][72][73][74][75].Recently, consistency models have been introduced as a novel kind of generative model allowing for single and multi-step data generation [76].These consistency models can either be trained ab-initio or distilled from an already trained diffusion model.We demonstrate the ability to distill our diffusion model into a consistency model, thereby allowing data generation with a single model evaluation leading to further speed-ups.
In summary, the proposed CaloClouds II contains the following adjustments: 1.The previously used discrete-time diffusion process is replaced with the continuous-time diffusion paradigm introduced in Ref. [68].This allows for fewer diffusion iterations during sampling.
2. The common latent space is removed as we have noticed no advantage for the generative fidelity when generating photon calorimeter showers.This removal yields a simplified model architecture and improved training and sampling speeds.
3. We add a calibration to the energy per calorimeter layer as well as applying a calibration to the center of gravity in the -and  -direction of the generated point cloud showers.This replaces the previous total energy calibration and improves the generative fidelity in the longitudinal energy distribution.
4. Further, we apply consistency distillation to distill the diffusion model into a consistency model [76], allowing single step generation and therefore greatly improved sampling speed.
We refer to this model as CaloClouds II (CM).
In Sec. 2 we describe the point cloud dataset used for training and evaluation.The diffusion paradigm and model components of the CaloClouds II model are explained in Sec. 3. We compare the generative fidelity of CaloClouds II and its variant to the original CaloClouds model in Sec. 4 and draw our conclusions in Sec. 5.

Data Samples
To compare the performance of our improved CaloClouds II model we use the same dataset as in Ref. [40].The data describes a calorimeter shower in the form of a point cloud.Each calorimeter shower consists of energy depositions of photons showering in a section of the high-granular electromagnetic calorimeter (ECAL) of the envisioned International Large Detector (ILD) [77].As a sampling calorimeter, it consists of 30 layers with passive tungsten material and active silicon sensors.All individual silicon layers consist of small 5 × 5 mm readout cells with a thickness of 0.5 mm.Between the first 20 active layers in the longitudinal direction there are passive layers with a thickness of 2.1 mm and between the remaining 10 layers the passive layers have a thickness of 4.2 mm.We simulated the dataset with Geant4 Version 10.4 (using the QGSP_BERT physics list) implemented in the iLCSoft framework [78].The simulated geometric model is implemented in DD4hep [79] and includes realistic gaps between the sensors and position dependent irregularities.More simulation details can be found in Ref. [40].
During the full Geant4 simulation up to 40,000 individual energy depositions originating from secondary particles traversing the active sensor material are registered (depending on the incident photon energy).These energy depositions are commonly referred to as Geant4 steps.All steps that fall into the volume of the same sensor are subsequently summed, resulting in the energy deposited in a cell hit.These cell hits (up to 1,500 at 90 GeV) are then used for downstream analysis as it is the same low-level information that is measurable in a real experimental setting.
Ideally, a generative model should produce cell-level hits to make the full Geant4 simulation more computationally efficient.Cell-level information is also generated in all other approaches for fast calorimeter shower simulation with generative machine learning models.However, generating discrete cell hits directly in the form of a point cloud is challenging, as minor imperfections such as generating multiple points in the same calorimeter cell can heavily impact the generative fidelity in various high level observables like the total number of cell hits  ℎ  .
Therefore, it could be advantageous to generate point clouds not on hit-level but on Geant4 step-level, i.e. many simulated very granular energy depositions per cell, resulting in much larger point clouds where points are continuously distributed in space (as opposed to discrete cell hits).Yet, we found generating a point cloud with up to 40,000 steps prohibitively expensive from a computational point-of-view.Additionally such a high resolution is not necessary for good generative fidelity.Therefore in Ref. [40] we introduced a middle ground: we cluster the up to 40,000 Geant4 steps into up to 6,000 points.For this clustering, the steps are grouped into their layer and their energy is binned in an ultra-high granularity grid with 36× higher granularity than the cell resolution, resulting in a square grid size of 0.83 × 0.83 mm 2 .This results in a clustered point cloud of up to 6,000 points -sufficiently small to be generated with the CaloClouds model, yet distributed in discrete positions with sufficiently small separation so as to be approximately a continuous point distribution in 3d space.
In addition to a computationally efficient simulation, this makes the generated calorimeter point cloud largely geometry-independent of the actual cell layout of the calorimeter, unlike point clouds based on cell-level energy depositions.This ultra-high granularity calorimeter point cloud can be projected into any part of the calorimeter (except changing its depth), without introducing reconstruction artifacts due to for example gaps and cell staggering, as successfully shown in Ref. [40].
To produce the training set, a total of 524,000 showers were generated with Geant4, with an incident energy uniformly sampled between 10 and 90 GeV.Additionally, multiple test sets were generated: 40,000 showers uniformly distributed in energy for the figures shown in Sec.4.1; 2,000 showers for the single energy plots at 10, 50, and 90 GeV; and 500,000 showers for calculating the evaluation metrics and the classifier score in Sec.4.2 and Sec.4.3.
Each point of the point cloud has four features: the -and  -position (transverse to the incident particle direction), the -position (parallel to the incident particle direction), and the .As a pre-processing step, the passive material regions are removed such that the point locations in the dataset also become continuous in the longitudinal -axis.The position features, ,  , and , are each normalized to the range [−1, 1].The energy feature of the 4d point cloud is given in MeV.
As it is important for downstream analyses to accurately simulate the behaviour of photon showers on the level of the physical geometry, i.e. at cell level, all results shown in Sec.4.1 to 4.3 are on cell-level.To this end, the calorimeter point cloud -with either up to 40,000 points for Geant4 or with up to 6,000 points for those generated with CaloClouds/ CaloClouds II-are binned to the realistic ILD ECAL layout (including detector irregularities and gaps) with 30×30×30 calorimeter cells.

Generative Model
The CaloClouds II model is an improved version of the original CaloClouds architecture from Ref. [40].First, we revisit the main model components of the CaloClouds model, afterwards we outline the improvements made in CaloClouds II.
CaloClouds is a combination of two normalizing flows [80], a VAE-like encoder [81], and a discrete time Denoising Diffusion Probabilistic Model (DDPM) [37].Specifically, it consists of the Shower Flow, a normalizing flow generating conditioning and calibration features; the EPiC Encoder, an encoder based on Equivariant Point Cloud (EPiC) layers [59] to encode calorimeter showers during training into a latent space for model conditioning; the Latent Flow, a normalizing flow trained to model the encoded latent space during sampling; and a diffusion model, called PointWise Net, which is a DDPM-based diffusion model generating each point independent and identically distributed (i.i.d.) based on a common latent space, incident energy and number of points conditioning.The models are implemented using PyTorch 1.13 [82].
In the following, we outline the differences between CaloClouds and CaloClouds II.The largest conceptual difference is the change of the diffusion paradigm.We move from a discrete time diffusion process (DDPM), in which the training and sampling is performed with the same number of diffusion steps, to a continuous time diffusion paradigm based on Ref [68], sometimes referred to as EDM diffusion or k-diffusion.This EDM diffusion allows for training a continuous time score function, which can be used to denoise any noise level, thereby separating the training and sampling procedure and allowing for sampling with various ordinary differential equation (ODE) and stochastic differential equation (SDE) solvers and different step sizes.Crucially, it allows to trade off sampling speed and sampling fidelity without retraining.We find good performance with the 2 nd -order Heun ODE solver and the step size parameterisation suggested in Ref [68].Additional details on the diffusion paradigm is given in the following Sec.3.1.
As a second change, CaloClouds II simplifies the original model.We noticed that for the photon calorimeter shower point clouds we are generating in this study, the shared latent space between points is not necessary for high generative fidelity.Therefore the latent dimensionality can be set to zero, so the EPiC Encoder and the Latent Flow are removed.By discarding it we achieve a simpler model as well as improved training and sampling efficiency.
Next, the Shower Flow for generating conditioning and calibration features is expanded to generate the total number of points, total visible energy, the relative number of points and energy of each calorimeter layer in the -direction, as well as the center of gravity in the -and  -direction.This flow is conditioned on the incident particle energy only.The total number of points generated per shower is used -together with the incident particle energy -for the conditioning of the PointWise Net diffusion model.
Overall, the Shower Flow is composed of ten blocks, each with seven coupling layers [83, 84] conditioned on the incident particle energy.It is implemented using the Pyro package [85].The Shower Flow is trained once for 350k iterations and used for all three models (CaloClouds, CaloClouds II, and CaloClouds II (CM)) compared in Sec. 4.
The post-diffusion calibration expands upon the calibration in Ref. [40]: The number of hits per layer is calibrated by ordering all points in the -coordinate and setting iteratively the first  ,=1 points to   = 1 (first layer), the second  ,=2 points to   = 2 (second layer) and so on until the 30  ℎ layer.Afterwards, we calibrate the total layer energy by re-weighting each point energy to sum up to the predicted layer energy  pred, .Finally, we calculate the center of gravity in  and  −direction of the point cloud and subtract its difference in comparison to the predicted center of gravity from each point's − and  − coordinate to calibrate the overall point cloud center of gravity in these two dimensions.Note that we further set points with negative generated energy to zero.

Generated Shower Calibration diffusion steps
During sampling, the number of points predicted by the Shower Flow is calibrated before being used for the conditioning of the Latent Flow and the PointWise Net.The calibrated number of points is given by  cal =  uncal •  gen (  data ( uncal )), where  data is a cubic polynomial fit of the ratio of the number of points  uncal, G4 to the number of cell hits  cell, G4 of the Geant4 showers and  gen is a fit of the ratio of number of cell hits  cell, gen to the (uncalibrated) number of points  uncal, gen of a given model.Hence, this polynomial fit  gen is performed for each model separately.More details on the model components and the calibrations can be found in Ref [40].A schematic overview of the training and sampling procedure is shown in Fig. 1.
In the following Sec.3.1 we describe the continuous time diffusion paradigm implemented in the CaloClouds II model and in Sec.3.2 we outline its distillation into a consistency model, referred to as CaloClouds II (CM).Both models use the same model components outlined above.Details on the training and sampling hyperparameters are outlined in Sec.3.3.

Diffusion Model
The diffusion model [34] used in the CaloClouds model is a Denoising Diffusion Probabilistic Model (DDPM) with the same discrete time steps during model training and sampling [37,86].Since the introduction of DDPM, subsequent works, i.e.Refs.[38,68,87], have shown that it is advantageous to train a diffusion model with continuous time conditioning.This allows for a more flexible sampling regime for which various SDE and ODE solvers with either a fixed or an adaptive number of solving steps can be applied.
In the following, we outline the key parts of a diffusion model based on the paradigm outlined in Ref. [68].The training of a diffusion model starts by diffusing a data distribution  data () with an SDE [38] in the forward direction ("data" → "noise") via where  is a fixed time step defined in the interval  ∈ [0, ] with  > 0 as a hyperparameter.(•, •) and (•) denote the drift and diffusion coefficients, and w  ∈ [0, ] is the standard Brownian motion.The distribution of x  ∼   (x) =  data (x) * N (0,  2 ) ( * as the convolution operator) and at time step zero it is identical to the data distribution  0 (x) =  data (x).When reversing this diffusion process ("noise" → "data"), a so called probability flow ODE emerges with a solution trajectory sampled at time step  given by with ∇ log   (x  ) as the score function of   (x).As suggested in Ref. [68], we set the coefficients in the SDE in Eq. 3.1 to (x, ) = 0 and () = √ 2 to ensure that   (x) is close to the distribution N (0,  2 ).Since the exact analytical score function is usually unknown, we train a neural network with weights  as a score model   (x, ) ≈ ∇ log   (x  ) to get the empirical probability flow ODE: For the purpose of numerically stable scaling behaviour, we follow Ref. [68] and actually train a separate network   with -dependent skip connections from which   is derived: The coefficients are time dependent and control the skip connection via  skip =  2 data /  2 data +  2 , the input scaling via  in =  •  data / √︃  2 data +  2 and the output scaling via  out = 1/ √︃  2 data +  2 .The hyperparameter  data corresponds roughly to the variance of  data (x) and is set to  data = 0.5.During training a random time step is drawn from the continuous noise distribution ln() = N  mean ,  2  std , with  mean = −1.2 and  std = 1.2 (the default parameters chosen in Ref. [68]), and the loss is given by: An illustration of this training process can be found in Fig. 1a.
For sampling from the trained score model, one samples from noise at time step  as x ∼ N (0,  2 ) and integrates the probability flow ODE in Eq. 3.2 over discrete time steps backwards in time using a numerical ODE solver.This results in a sample x0 which provides a good approximation of a sample from the data distribution  data (x).In practice the solver is usually stopped at a small positive value  > 0 to avoid numerical instabilities resulting in the approximate sample x ≈ x0 .For our sampling, we use the suggested values and step scheduling from Ref. [68] with  = 80 and  = 0.002, and apply the 2 nd order Heun ODE solver.

Consistency Model
Consistency Models (CM) [76] are a recently introduced generative architecture.They allow for single-step or multi-step generation with the same model and can be trained standalone or distilled from a diffusion model that has already been trained.A consistency model  Φ with weights Φ is trained to estimate the consistency function  from data.The consistency function is defined as  : (x  , ) → x  and is self-consistent in the sense that any pair of (x  , ) belong to the same probability flow ODE trajectory.This means that the result of a function evaluation at any point on this trajectory leads to the same result, i.e.  (x  , ) =  (x  ′ ,  ′ ) for all ,  ′ ∈ [, ].The time interval describes the minimum noise at time step  and the maximum noise at time .
For sampling from a trained consistency model in a single model pass, one initializes x ∼ N (0,  2 ) and performs one function evaluation to get x =  Φ (x  , ).It is also possible to sample with multiple model passes by first evaluating  Φ (x  , ), and then adding noise again from N (0,  2 ) to denoise a second time.This can be done in an alternating fashion for an arbitrary number of steps.Often multi-step generation appears to improve sample fidelity [63,68], however we are able to achieve comparable fidelity to the original diffusion model with only a single model evaluation and therefore limit ourselves to this most efficient scenario.
In line with Ref. [76], we found improved training fidelity when distilling the consistency model from a diffusion model instead of training it individually.For this purpose we distill the consistency model  Φ (x, ) from the diffusion model   (x, ) based on the PointWise Net of CaloClouds II introduced in the previous Sec.3.1.The distillation is performed by separating the continuous time space [, ] into  − 1 sub intervals (we use  = 18).The interval boundaries are determined by the same step size parameterisation as in the diffusion model sampling formulation [68].During training a random boundary time step  ∈ [1,  ] is chosen to perform the distillation.We refer to the original diffusion model here as the teacher model   (x, ) and to the distilled consistency model during distillation as the student model  Φ (x, ).Additionally, we call the final distilled consistency model the target model  Φ − (x, ).We use the self-consistency property of the consistency model for training since it requires a well trained model to obey  Φ (x  +1 ,  +1 ) =  Φ (x   ,   ).The neighboring points (x  +1 , x   ) on the probability flow ODE trajectory are obtained by sampling x ∼  data , adding noise to it to get x  +1 ∼ N (x,  2 +1 ) and performing one ODE solver step with the teacher diffusion model to compute x   =   (x  +1 ,  +1 ).This allows the student consistency model  Φ (x, ) to be trained with the loss: The target model  Φ − (x  , ) weights Φ − are updated after every iteration as a running average of the student model weights Φ.An overview of the distillation procedure is illustrated in Fig. 2.

Training and Sampling
The diffusion model in CaloClouds II was trained for 2M iterations with a batch size of 128 using the Adam optimizer [88] with a fixed learning rate of 10 −4 .As the final model, we use an exponential moving average (EMA) of the model weights.We scan several values for the number of ODE solver steps  and find  = 13 optimal as with fewer steps than this, the generative fidelity as probed by the correct learning of physically relevant shower shapes with CaloClouds II deteriorates.This results in 2 − 1 diffusion model evaluations since the last step of the Heun ODE solver does not perform a 2nd order correction.Compared to CaloClouds with 100 function evaluations this already hints at a significant computational speed-up.
The diffusion model used in CaloClouds II was distilled into a consistency model for Calo-Clouds II (CM) by using the Adam optimizer with a fixed learning rate of 10 −4 for 1M iterations with a batch size of 256.Notably, only a single training is necessary for distilling a model which is able to perform single step generation, as opposed to the multiple trainings required for e.g.progressive distillation [42,66,72].

Results
In the following, we compare the original CaloClouds model with the improved CaloClouds II model and its distilled variant CaloClouds II (CM).To achieve a fair comparison between the three models, we use the same training of the Shower Flow and the same calibration procedure for all three models.Hence, the Shower Flow from the CaloClouds II model was also used for generating samples with the CaloClouds model -a slight modification compared to the original CaloClouds model in Ref. [40].This means that the samples generated with the CaloClouds model also include the energy per layer and center of gravity in  and  calibration.For the Latent Flow and the PointWise Net of CaloClouds the same model weights as in Ref. [40] were used.
We first show the performance of our generative models based on the same observables as discussed in Ref. [40] in Sec.4.1.Next, in Sec.4.2, we quantify the performance of the models with multiple Wasserstein-distance-based scores for the usual set of calorimeter shower observables and in Sec.4.3 we use a classifier to distinguish between simulated Geant4 showers and generated showers based on the calculated shower observables.Finally in Sec.4.4 we benchmark the computational efficiency of our models and compare them to the baseline simulation timing with Geant4.All distributions are calculated with 40,000 events sampled with a uniform distribution of incident particle energies between 10 and 90 GeV.The bottom panel provides the ratio to Geant4.The error band corresponds to the statistical uncertainty in each bin.

Physics Performance
In this Section, we compare various calorimeter shower distributions from Ref. [40] between the Geant4 test set and datasets generated using CaloClouds, CaloClouds II, and Calo-Clouds II (CM).First, we compare various cell-level and shower observables calculated from the model generated showers to Geant4 simulations with samples of incident photons with energies uniformly distributed between 10 and 90 GeV (also referred to as full spectrum).In Fig. 3 we investigate three representations of the energy distributed in the calorimeter cells, namely the percell energy distribution (left), the radial shower profile (center) and the longitudinal shower profile (right).The per-cell energy distribution contains the energy of the cells of all showers in the test dataset.The peak of the distribution at about 0.2 MeV corresponds to the most probable energy deposition of a minimum ionising particle (MIP) in the silicon sensor.For downstream analyses a cell energy cut at half a MIP is applied, since below this threshold the sensor response is indistinguishable from electronic noise.Hence this cut was applied to all showers when calculating the shower observables and scores in this section.All models describe the cell energy distribution reasonably well.For most of the range the CaloClouds II models perform better than CaloClouds, however there are a few outliers with energies which are too high produced by CaloClouds II compared to the other two models.
The radial shower profile describes the average radial energy distribution around the central shower axis (in -direction) of the ECAL.Below a radius of about 180 mm, the distribution is well described by all three models, above 180 mm, the models deviate from Geant4.Overall the CaloClouds II (CM) model represents the Geant4 distribution most closely.Note that this is a distribution that is not directly impacted by any of the post-diffusion calibrations performed and is therefore a good benchmark for the effectiveness of the point cloud diffusion approach alone.
The longitudinal shower profile describes how much energy is deposited on average in each of the 30 calorimeter layers.In the previous iteration of CaloClouds it was not well modeled, but since we now calibrate the energy per layer with the improved Shower Flow for the generated point clouds it is well modelled.However, we observe deviations in the first few layers for all three models.Since they share the same Shower Flow, we expect future improvements in this model to translate to an improved longitudinal profile.Further, a small outlier can be seen for the CaloClouds II model around the 10  ℎ layer.The alternating higher and lower energy depositions per layer are due to the fact that for technical reasons, pairs of silicon sensors surrounding one tungsten absorber layer and facing opposite directions are installed into a tungsten structure with every other absorber layer.This results in the observed pair-wise difference in the sampling fraction between consecutive layers.In Fig. 4 we show the center of gravity distribution  1, ∈ {, , } (the energy weighted shower centroid) in the -,  -, and -directions.Note that in the -and  -directions these distribution are calibrated for the original point cloud, before the cell-level observables are calculated.While the  1, distribution is very well modelled by all three generative models,  1, is slightly shifted to lower center of gravity values for all models with the CaloClouds distribution additionally being marginally too narrow.The centers of gravity in  and  behave slightly different as a magnetic field is simulated in the  -direction and the active sensors are staggered in the  -direction while they are all aligned in the -direction.Due to the number of hits and energy per layer calibrations applied, the distribution of  1, is very well modelled.Only in the region around 1925 mm is the CaloClouds II model slightly worse than the other two models.Overall, the three models are reasonably close to the Geant4 simulation in all six observables.Next, we investigate the models' fidelity for single incident photon energies of 10, 50, and 90 GeV.In Fig. 5 we show the distributions of the total visible energy (left) and the distributions of the number of hits (cells with deposited energy above the half MIP threshold) for the three single energy datasets of 2k showers each.The total energy is well represented by all three models.The number of hits on the other hand is one of the most difficult distributions to represent well with a point cloud generative model.Here high fidelity is still achieved by applying the number of points calibration discussed in Sec. 3. Overall the CaloClouds distributions are slightly too wide as was observed already in Ref. [40].In comparison, CaloClouds II represents the shape of the distribution better, yet in particular for 10 and 90 GeV showers the mean is a bit too large for the CaloClouds II (CM) generated events.This is explainable due to the nature of the polynomial fit used for the number of points calibration.The fit does not perform very well at the edges of the incident energy space.It is known that extrapolation is difficult for generative models, therefore we conjecture that with a training set including lower and higher energies, the fidelity at 10 and 90 GeV would approach the performance at 50 GeV.Overall the CaloClouds II models perform very well and are comparable in fidelity to the CaloClouds model.

Evaluation Scores
We next investigate the performance of all three CaloClouds models by calculating scores from the high level calorimeter shower observables considered in the previous Section.This allows us to put a number on the fidelity observed in plots presented in the previous Sec.4.2 and not only rely on comparing distributions by eye.
The following observables are considered in order to calculate the one-dimensional scores: The number of hits (cells with energy depositions above the half MIP threshold)  hits , the sam- Geant4 0.7 ± 0.2 0.8 ± 0.2 0.9 ± 0.4 0.7 ± 0.8 0.7 ± 0.1 0.9 ± 0.1 1.1 ± 0.3 0.9 ± 0.3 pling fraction (the ratio of the visible energy deposited in the calorimeter to the incident photon energy)  vis / inc , the cell energy  cell , the center of gravity in the -,  -, and -directions  1, ∈ {, , } , and ten observables each for the longitudinal energy  long, ∈ [1,10] and for the radial energy  radial, ∈ [1,10] .The ten observables for the longitudinal (radial) energy depositions are computed with the energy clustered in consecutive layers (concentric regions) such that on average all 10 observables  long, ∈ [1,10] and  radial, ∈ [1,10] are computed with the same statistics.Further details on these in total 20 observables can be found in Appendix A.
To compare the distributions of these observables between Geant4 and the three generative models, we calculate the 1-Wasserstein distance  1 -also known as the earth movers distance -between each pair of distributions.The advantages of the Wasserstein distance are that it is an unbinned estimator, for one-dimensional distributions it is computationally efficient to calculate, and no hyperparameter choices have to be made apart from the number of events used for comparison.Following earlier works using Wasserstein distance based model evaluation scores to compare generative models [55,58] , we calculate the distance between observables calculated from 50k Geant4 and 50k model generated showers.This is done 10× for independent uniformly distributed samples and we report the mean and standard deviation of the scores in Tab. 1.For this purpose, we simulated 500k Geant4 samples and generated 500k showers with each CaloClouds model.To have all scores in a similar order of magnitude, we standardize each observable before we calculate the  1 score.For the layer energy and radial energy scores,   long 1 and   radial 1 , we report the average Wasserstein distance over all ten bins.The hit energy score   cell 1 is calculated for 50k cell hits.In addition to the generative model scores, we also calculate the scores for Geant4 itself, comparing 50k Geant4 showers to a separate set of 50k Geant4 showers.
As can be seen in Tab. 1, most model scores are quite close together.We observe a few outliers, i.e. in the sampling fraction score   vis / inc 1 the CaloClouds and CaloClouds II models are much better CaloClouds II model and in the radial energy score   radial 1 the CaloClouds II models outperform CaloClouds, which is in line with the histogram shown in Fig. 3. Overall, CaloClouds II (CM) appears to produce higher fidelity showers than the other two models, since it has the best score in four of the scores and does not exhibit any large outliers compared to the other two models.However, as can also be seen in the histograms in Sec.4.1, none of the scores -with the exception of   1, 1 -quite reaches the fidelity of the Geant4 truth itself.Hence we conclude that while all three models generate high fidelity ECAL showers, they should be further improved to match Geant4 exactly in the future.
As a side note, the Wasserstein distance can be heavily impacted by outliers in the distributions.Therefore it does not always correlate well with the distribution shape observed in histograms.However, the scores complement the visual inspections of histograms and distributions shown in Sec.4.1 well.
While useful for comparing generative architectures, 1-Wasserstein distances only consider each dimension of the problem individually.Of course, a successful generative model should also accurately describe higher order correlations.We investigate this in the next Section.
We further compare the model generated showers to the Geant4 simulation by training a fully connected high-level classifier using the shower observables discussed in the previous Sec.4.2 to distinguish between model generated and Geant4 simulated showers.The 25 input shower observables are the ten radial and longitudinal energy observables, as well as the three center of gravity variables and the number of hits and total visible energy.For the datasets, we use 500k Geant4 showers and 500k showers generated by each generative model.A 80%, 10%, 10% data split is applied, resulting in a training set of 800k showers and a validation and test set with 100k showers each.
The classifier is implemented as a fully connected neural network with three layers (containing 32, 16, 8 nodes respectively) with LeakyReLU [89] activation functions, and one output node with a Sigmoid activation.It is trained with the Adam optimizer [88] for 10 epochs for each dataset using a binary cross-entropy loss.The final model epoch is chosen based on the lowest validation loss.
To evaluate the classifier we use the area under the receiver operating characteristic curve (AUC) score calculated on the test set.This kind of classifier score is also used in other publications evaluating generative models in high energy physics such as Ref. [24,27,28,30,33,58,90].In case the classifier can perfectly separate the Geant4 and model generated datasets, it will result in an AUC = 1.0.For a generated dataset that is indistinguishable from Geant4 simulation, we expect a confused classifier with an AUC = 0.5.Values in between are difficult to interpret in absolute terms, but can give a rough indication of how well the generative models are performing compared to each other.Note that its already not trivial to implement a generative model that achieves AUC values below 1.0.
We trained the classifier ten times with a different train/ test/ validation data split each time.In Tab. 2 we present the mean AUC and standard deviation of these ten classifier trainings.The CaloClouds generated dataset performs the worst, leading to an almost perfect classification with AUC = 0.999.The two CaloClouds II variants both have a better score clearly separated from an AUC = 1.0.With an AUC = 0.923 the CaloClouds II (CM) model performs slighly better than the CaloClouds II model.For most events, both models result in a separability from the Geant4 simulated showers, but constitute a clear improvement over the baseline CaloClouds implementation.The better performance of the CaloClouds II variants is likely due to the improved radial energy distribution, as we observed a rather large deviation in the   radial 1 score and in the radial distributions in Fig. 6.In this Section, we benchmark the average time to produce a single calorimeter shower with the three models considered and investigate the speed-up over the baseline Geant4 simulation.The timing results are presented in Tab. 3.

Hardware
On both a single CPU and on an NVIDIA ® A100 GPU we generated 25× 2,000 showers with the same uniform energy distribution between 10 and 90 GeV.We report the mean and standard deviation of generating these showers.In particular the timing on a single CPU is interesting for current applications of generative models in high energy physics, as CPUs are much more widely available than GPUs and the current computing infrastructure relies on simulations run on CPUs.Further, the single CPU timing facilitates a direct comparison to the Geant4 simulation.Here CaloClouds already yields a speed up of 1.2×, but with less sampling steps CaloClouds II achieves a speed up of 6.0×.However, when implementing the consistency distillation, we achieve a speed up of 46× with the CaloClouds II (CM) model even surpassing previous generative models on the same kind of dataset such as the BIB-AE [20] by about a factor 5.
On an NVIDIA ® A100 GPU the CaloClouds model achieves a speed up of 157×, Calo-Clouds II achieves 640×, and CaloClouds II (CM) achieves 1873× speed up over the baseline Geant4 simulation on a single CPU.Note that Geant4 is currently not compatible with GPUs and that GPUs are significantly more expensive than CPUs.
For reference, the training of the CaloClouds model on similar NVIDIA ® A100 GPU hardware took around 80 hours for 800k iterations with a batch size of 128, while training of the CaloClouds II model took around 50 hours for 2 million iterations with the same batch size.The consistency distillation for 1 million iterations with a batch size of 256 took about 100 hours.
The speed up between CaloClouds and CaloClouds II is the result of a combination of the improved diffusion paradigm requiring a reduced number of function evaluations as well as the removal of the latent flow.The speed up due to the consistency model in CaloClouds II (CM) yields another large factor, since only a single model evaluation is performed.Both models would be slightly slower when applied in conjunction with the Latent Flow of the CaloClouds model as one evaluation of the Latent Flow is about 50% slower than a single evaluation of the PointWise Net.For a large number of model passes of the PointWise Net in the diffusion framework, the efficiency of the Latent Flow is negligible.However when we consider CaloClouds II (CM) with a single model pass, the application of the Latent Flow would have a noticeable impact on computational performance.Therefore, we removed the Latent Flow in favour of model efficiency as we did not see any improvement in generative fidelity when using it in the CaloClouds II framework.

Conclusions
CaloClouds was the first generative model to achieve high-fidelity highly-granular photon calorimeter showers in the form of point clouds with a number of points of O (1000).Due to their sparsity, describing calorimeter showers as point clouds is computationally more efficient than describing them with fixed data structures, i.e. 3d images.Additionally, as the point clouds are based on clustered Geant4 steps, they allow for a translation-invariant and geometry-independent shower representation.Such cell-geometry-independent models could be easily adapted for fast simulations of calorimeters with non-square cell geometries, i.e. hexagonal cells as used in the envisioned CMS HGCAL [64].
With CaloClouds II we introduce a more streamlined version of CaloClouds utilizing the advanced diffusion paradigm from Ref. [68].It allows for sampling with less model evaluations and for distillation into a consistency model.Using the consistency model in CaloClouds II (CM), generation with a single model evaluation is possible and results in a greatly improved computational efficiency and a speed up of 46× over Geant4 on a single CPU.This single event CPU performance is particularly promising for introducing a generative model into existing Geant4-based simulation pipelines.As opposed to other diffusion distillation methods like progressive distillation, consis-tency distillation only requires a single training to distill the diffusion model in CaloClouds II into a single step generative model, further emphasising the computational advantage of the models presented here.To our knowledge, this constitutes the first application of a consistency model to calorimeter data.
We compare all three point cloud generative models using one-dimensional distributions and a classifier-based measure and find comparable performance with a slight advantage for the Calo-Clouds II variants.In particular, the CaloClouds II (CM) model exhibits superior performance while being significantly more computationally efficient.It is counter-intuitive, that a distilled consistency model outperforms the original diffusion model, however, it is known that ODE solvers might introduce errors in earlier denoising steps that are then propagated to the generated samples [68].The consistency model avoids this since we use it for single-shot generation.Yet, slight deviations from the Geant4 simulations are still visible in various shower observables.Further improvements could likely be achieved by investigating more complex architectures for the diffusion model such as fast transformer implementations [91], equivariant point cloud (EPiC) layers [59], or cross-attention [92].
During the completion of this manuscript, another EDM diffusion based model with subsequent consistency distillation was shown to achieve good fidelity when generating particle jets in the form of point clouds with up to 150 points [63].While technically a similar approach, in our case the consistency model does not lose generative fidelity compared to the diffusion model and we demonstrate the generation of two orders of magnitude more points (6000 vs 150).
In conclusion, the CaloClouds II model generates high fidelity electromagnetic showers when benchmarked on various shower observables against the baseline Geant4 simulation.In combination with consistency distillation the CaloClouds II (CM) model yields an accurate simulator, which is significantly faster than Geant4 on identical hardware.This constitutes an important step towards the integration of point-cloud based generative models in actual simulation workflows.

A Radial and longitudinal energy observables
To explore the radial and longitudinal energy profile shown in Fig. 3 further and to calculate the evaluation scores in Sec.4.2, we define ten radial and longitudinal energy observables for the calorimeter showers.
Respectively, the ten observables are defined such that energy is clustered in each observable with an equal amount of statistics.Put differently, the energy is binned in ten quantiles with approximately the same number of cell hits in each quantile.The energy bins are defined by the quantiles calculated on the Geant4 test set with 40,000 events.While the bin edges are precisely defined for the radial energy, we round the bin edges of the longitudinal observables to the nearest layer integer number.
Histograms of the radial energy observables  radial, ∈ [1,10] are shown in Fig. 6 and of longitudinal energy observables  long, ∈ [1,10] in Fig. 7.The bin edges for all observables are given in Tab. 4.

Figure 1 :
Figure 1: Illustration of the training and sampling procedure of the CaloClouds II model.(a) During training a random continuous time step  is trained conditioned on the shower energy  and number of points .The loss,  MSE , is approximated by a simple mean squared error (MSE) between the noised data and the denoised output.The scaling functions  in ,  out , and  skip are defined following Eq.3.4.(b) During sampling the -conditional Shower Flow generates  as well as shower observables for calibration.After a  calibration the PointWise Net denoises iteratively noise N (0,  2 ) into a calorimeter shower.When sampling with CaloClouds II (CM) only one denoising step is performed.

Figure 2 :
Figure 2: Illustration of the consistency distillation process distilling the diffusion model of CaloClouds II (teacher model) into a consistency model (student and target model).The student model is updated via gradient descent and the target model is updated as an exponential moving average of the student model weights.

Figure 3 :
Figure 3: Histogram of the cell energies (left), radial shower profile (center), and longitudinal shower profile (right) for Geant4, CaloClouds, CaloClouds II, and CaloClouds II (CM).In the cell energy distribution, the region below 0.1 MeV is grayed out (see main text for details).All distributions are calculated with 40,000 events sampled with a uniform distribution of incident particle energies between 10 and 90 GeV.The bottom panel provides the ratio to Geant4.The error band corresponds to the statistical uncertainty in each bin.

Figure 4 :
Figure 4: Position of the center of gravity of showers along the  (left),  (center), and  (right) directions.All distributions are calculated for 40,000 showers with a uniform distribution of incident particle energies between 10 and 90 GeV.The error band corresponds to the statistical uncertainty in each bin.

Figure 5 :
Figure 5: Visible energy sum (left) and the number of hits (right) distributions, for 10, 50, and 90 GeV showers.For each energy and model, 2,000 showers are shown.The error band corresponds to the statistical uncertainty in each bin.

Figure 6 :
Figure 6: Radial energy observables for 50,000 showers.The error band corresponds to the statistical uncertainty in each bin.

Figure 7 :
Figure 7: Longitudinal energy observables for 50,000 showers.The error band corresponds to the statistical uncertainty in each bin.

Table 3 :
Comparison of the computational performance of CaloClouds, CaloClouds II, and CaloClouds II (CM) to the baseline Geant4 simulator on a single core of an Intel ® Xeon ® CPU E5-2640 v4 (CPU) and on an NVIDIA ® A100 with 40 GB of memory (GPU).2,000 showers were generated with incident energy uniformly distributed between 10 and 90 GeV.Values presented are the means and standard deviations over 10 runs.The number of function evaluations (NFE) indicate the number of diffusion model passes.