End-to-end simulation of particle physics events with Flow Matching and generator Oversampling

. The simulation of high-energy physics collision events is a key element for data analysis at present and future particle accelerators. The comparison of simulation predictions to data allows looking for rare deviations that can be due to new phenomena not previously observed. We show that novel machine learning algorithms, specifically Normalizing Flows and Flow Matching, can be used to replicate accurate simulations from traditional approaches with several orders of magnitude of speed-up. The classical simulation chain starts from a physics process of interest, computes energy deposits of particles and electronics response, and finally employs the same reconstruction algorithms used for data. Eventually, the data are reduced to some high-level analysis format. Instead, we propose an end-to-end approach, simulating the final data format directly from physical generator inputs, skipping any intermediate steps. We use particle jets simulation as a benchmark for comparing both discrete and continuous Normalizing Flows models. The models are validated across a variety of metrics to identify the most accurate. We discuss the scaling of performance with the increase in training data, as well as the generalization power of these models on physical processes different from the training one. We investigate sampling multiple times from the same physical generator inputs, a procedure we name oversampling , and we show that it can effectively reduce the statistical uncertainties of a dataset. This class of ML algorithms is found to be capable of learning the expected detector response independently of the physical input process. Their speed and accuracy, coupled with the stability of the training procedure, make them a compelling tool for the needs of current and future experiments.


Introduction
The simulation of high-energy physics (HEP) events is a necessary and complex task involving various steps.Comparing simulated predictions with actual data helps to notice unusual variations that might signal new, previously unseen phenomena or allows measuring processes known for being extremely rare.At the Large Hadron Collider (LHC), billions of simulated collision events are necessary, and this number is expected to increase with the development of future detectors and accelerators [1].The production of simulated samples usually starts from the outputs of a physics process generator (in the following referred to as gen) such as pythia [2], describing the list of particles resulting from the physical process at the collision point, and it ends with some analysis-ready data.
After gen, the next step, called simulation (or sim), based on frameworks, such as geant4 [3], relies on numerical Monte-Carlo approaches to propagate the particles through the detector and compute the energy deposits into the particle sensors.This step is followed by the conversion of energy deposits into digital electronics readout signals (digi ) reproducing the actual lowest level output of the detector during data taking.Finally, the electronics signals are converted back to physics quantities with the very same reconstruction (reco) algorithms used for real detector data.
This conventional simulation chain solves the problem of sampling from P (reco|gen), where the analytical form of this probability density function (pdf ) is unknown, and the various steps parameterize different contributions: P (reco|gen) = P sim (sim|gen) × P dig (digi|sim) × P rec (reco|digi). ( Given that the reconstruction step is common in actual data and simulation, its output is available in both and allows the data to simulation comparisons that are the key ingredients for HEP analysis.The data analysis typically starts with various data reduction steps where the physics information is further transformed into higher level observables, for example combining information from multiple detectors or multiple particles into some derived quantities.The output of these additional steps is usually in a much simpler format than the one at the reconstruction level.
This type of approach serves as the backbone of the majority of simulation frameworks for the various HEP collaborations, and it has been proved to be accurate and adaptable to the various experimental scenarios.However, the time and computational resources taken by this procedure are substantial, and are expected to increase rapidly as we reach new frontiers of energy and luminosity.
Several approaches exist to mitigate this issue through the development of faster simulation frameworks.Examples are Delphes [6], a framework for fast simulation of a generic collider experiment, and various experiment-specific toolkits.Delphes uses parametric smearing on the generator level information, going end-to-end by directly producing analysis-ready data.The application of ML to such an approach for analysisspecific applications is outlined in [7].
In the present work, we investigate the use of Normalizing Flows (NF), a type of generative machine learning (ML) algorithms, for performing end-to-end simulation, similar to Delphes, but with a higher degree of accuracy.Specifically, we focus on optimizing the choice of the NF model based on physics-aware metrics.As in the The present work investigates the capabilities of Normalizing Flows models to do end-toend simulation from the first step directly to the last one of this chain (figures taken from [4,5]).
traditional simulation chain, we use random noise z as input, along with generator outputs (gen).The use of gen in input is called conditioning, as this information conditions the output of our model according to the different physical processes P (reco|gen).The idea is to go from the gen inputs to the final event description directly by learning a surrogate f ϕ of the pdf from datasets obtained with conventional simulations: f ϕ (event|gen) ≈ P Target (event|gen), where ϕ are the parameters of a Neural Network.In this work, we demonstrate that such an approach can retain a good amount of details, and it faster than conventional simulation, especially when accelerated with GPUs.The general problem of HEP simulation and the proposed approach are illustrated in Figure 1.
The usage of ML techniques for simulation has already been proposed in HEP.For a comprehensive review, we point the reader to [8].As examples, the CMS FastSim simulation toolkit [9] is already using ML for both parametrization of single steps and refinement of final results [10].The LHCb ultra-fast simulation framework Lamarr [11] revolves around the use of Generative Adversarial Networks.Generally speaking, ML models are often applied to single steps of the simulation chain, rather than in an end-to-end fashion: an example is the simulation of calorimeter response as done in [12,13].Often times, existing approaches make use of more popular generative algorithms, such as Generative Adversarial Networks or Variational Autoencoders, rather than Normalizing Flows.Some convincing applications of NF have nonetheless been presented, for example by the ATLAS Collaboration in the context of photon simulation [14], from researchers studying anomaly detection at the LHC [15] and for usage within physical generators [16].A general study on the robustness of NF has been presented in [17].However, usage of Flows is often limited to analysis/process-specific applications.The use of newer architectures includes the exploration of Diffusion Models for jet simulation, as discussed in [18].
In this work, we present and discuss multiple NF models on a toy-dataset of particle jets, with a focus on assessing the most accurate class of models.The main contributions of this work are the following.
• We investigate both discrete NF, where the final transformation f ϕ (event|gen) is made up of multiple, simple discrete transforms; and continuous NF, where the model learns a vector field, which is used in computing the trajectories of an ordinary differential equation (ODE) which maps noise to data.We introduce a set of physically-motivated metrics, and we use them to compare the models.
• We compare speed and accuracy for the better performing models.We investigate if models trained on a small dataset can be used to produce a bigger dataset without introducing strong biases.We also show the generalization power of these algorithms, testing them on physical processes not seen during training.
• Finally, considering that the speed of these types of algorithms is even greater than that of physical generators, we discuss the possibility of producing multiple simulations starting from the same generator information gen, a procedure we name oversampling.As far as we know, this is the first time that such an approach has been proposed and discussed in the context of HEP simulation.We introduce a statistical procedure for handling histograms with oversampled data.

Related work
This work focuses on the optimization of Flow algorithms and has been performed on a toy-dataset mimicking an HEP experiment.This effort is related to the work within the CMS Collaboration to use these techniques on actual experiment datasets [19].We expect that the outcome of our optimization will be a useful input to improve the CMS application of Normalizing Flow for end-to-end simulation.

Methods
Normalizing Flows (NF) are a class of generative machine learning models, which aim to express the unknown data pdf P (x) starting from a simple base distribution B(z) (usually a Gaussian distribution) through an invertible mapping . We refer to [30] for a comprehensive review of existing NF algorithms.

Continuous Normalizing Flows
Continuous flows are a type of flows which maps the base distribution into the target one through a "continuous" transformation parametrized by a variable t ∈ [0, 1], such that P t=0 (z) = B(z) (the base noise distribution) and P t=1 (z) = P (x|g) (the unknown data distribution).At each given t, the current flow transformation is fully specified by a vector field v t,g : and the samples x = f t=1,g (z) = 1 0 dt v t,g are obtained by integrating this ODE.It has been observed [31] that continuous NF have the advantage that the network is less constrained by the specific choice of a transformation and can be more expressive than discrete ones.Additionally, this type of transformation acts simultaneously on all the features, while for discrete flows we need to introduce coupling/autoregressive mechanisms.However, training this class of flows traditionally relied on many passes through the network to solve the ODEs, and it is often simpler to train discrete flows.

Flow matching
A possible alternative to train continuous NF is the use of Flow matching [32].This novel approach consists in casting the training into a regression problem, which is simpler to address.The aim is to learn the vector field v t,g as the output of the network.Because it is unknown, another vector field u t is taken as the regression target, which defines a probability path P t running from B(z) to P (x|g) as t increases.If P t does this mapping well, then u t is close to what we would like to learn, i.e. v t,g .The intuition behind [32] is that u t and P t can be constructed in a sample-conditional basis, i.e., depending on the training sample x.The loss for the parameters ϕ of the network then becomes: ( which is a simple regression loss.We can construct various types of probability paths and their associated vector fields u t .In this work, we draw from recent developments in Flow matching presented in [33], and distinguish between two main classes of paths: Target conditional flow matching This is the flow matching originally proposed by [32].Let x be a single training data sample.It defines the probability path and the associated vector field as: which is a probability path from the standard normal distribution (B(z) = P 0 (z|x) = N (z; 0, I)) to a Gaussian distribution centred at x with standard deviation σ min , (P 1 (z|x) = N (z; x, σ 2 min )).
Basic form of conditional flow matching This is the I-CFM discussed in [33].In the I-CFM, we identify x with a pair of random variables, a source sample x 0 (the noise) and a target sample x 1 (the training data).We let the paths be Gaussian flows between x 0 and x 1 with standard deviation σ min , defined by: We note that in this formulation there is no assumption on the base distribution B(z) to be Gaussian, as was for the previous class of paths.This means that we can use I-CFM to create NF starting from other base distributions, such as the Uniform one.

Particle Jets dataset
We study the simulation of particle jets originating in proton-proton collisions.Note however that the approach presented in this work could be adapted to the simulation of any arbitrary physical objects and distributions.By combining the simulation of multiple objects, we can simulate end-to-end an entire event.
The ultimate aim of the approach presented is to learn the response function of detectors as accurately as that simulated with geant4 based toolkits.However, in this paper, we employed a toy-dataset for the comparison of various flow models.The geant4 simulation, which would be the realistic target, is replaced with Delphes-like smearing of generator level quantities.In order to make the dataset more challenging for the NF models, we introduced several correlations and dependencies that are expected in a real detector, e.g. the dependency of momentum measurement bias and resolution as a function of the jet energy or the correlation of such a resolution with the jet flavour.More details on the dataset and on the various processes simulated are given in Appendix A.
Eventually, the dataset we built consists of pairs of generator-level (gen) jets and associated reconstructed (reco) jets similar to those obtained after the full simulation chain of particle-matter interaction, digitization and reconstruction algorithms.
We created multiple datasets starting from the pythia generator, in proton-proton collisions for 4 different physical processes: top quark pair production (pp → t t), Z boson production in association with jets (pp → Z+Jets), W boson pair production (pp → W W ) and multiple jets production (pp → JJ + X).We clustered stable final state particles in order to define generator-level jets through the FastJet [34] library (specifically the anti-k T jet clustering algorithm [35], with R = 0.4).We kept only jets with p T > 15 GeV (we use the HEP experiment convention, with Lorentz 4vectors in cylindrical coordinates and the polar angle θ replaced by the pseudorapidity This strategy has the advantage of producing a target dataset complex enough to showcase the capabilities of flow-based simulation at a fraction of the complexity and computing cost than that of running a full simulation process. In the following section, we list the most notable features of our dataset, emphasizing non-trivial distributions and correlations introduced by our toy-physics simulation, which we would like our models to reproduce correctly.Table 1 shows the list of the 6 gen-level features which we give as input to our models (the conditioning on the pdf ) and the two different sets of reco target variables: a basic one with 5 variables, and the extended one with 16 variables.
Among them, we emphasize the presence of pseudo b-tagging and c-tagging scores, which have been defined to simulate the performance of generic tagging algorithms.Their value is bounded between 0 and 1, where the jet flavour of the b quark is associated with higher b-tagging scores, but also to generally higher values in the c-tagging score, as the two quarks produce similar features in the jets which can fool these types of algorithms.Light quarks and gluons are instead associated with lower scores.We show in Figure 2  Another correlation we introduced in the dataset is the one between the gen-level input p T and the reconstructed one.If we consider the p T response (ratio of the reco to the gen), its resolution shrinks as the gen p T increases.
Variables are preprocessed before training.Specifically, we standardized both input and target features by subtracting from each one its mean and dividing it by its standard deviation.Integer target variables were smeared with random uniform noise in [-0.5, 0.5], a process known as dequantization.Categorical variables, such as the jet flavour, were one-hot encoded into a series of 0/1 flags.

Validation metrics
We selected 6 different kinds of metrics to evaluate our models.The metrics chosen are the following.• The 1-d Wasserstein score (WS) [36] and the two-sample Kolmogorov-Smirnov distance (KS) for comparing 1-d distributions between the target and the samples produced by the model.A WS is assigned to each variable.
• The Fréchet distance as a global measure.It is the distance between Multivariate Gaussian distributions fitted to the features of interest, which [36] calls the Fréchet Gaussian Distance (FGD).It is generally called the Fréchet Inception Distance (FID) in image generation tasks: • Covariance matching: another global metric used to measure how well an algorithm is modelling the correlations between the various target features.Given the covariance matrices of the two samples, target and model, we compute the Frobenius Norm of the difference between the two: Correlations in the model samples are also visually evaluated through the use of dedicated plots.
• As b and c-tagging are such important tasks in the study of jets, we compute the receiver operating characteristic (ROC) curves for both scores.To quantify the performance of a model, we compute the difference in log-scale between the ROC coming from the model and that from the target distribution.Log-scale is used because the true positive rate (TPR) and false positive rate (FPR) span different orders of magnitude.We call this evaluation metric the Area Between the Curves (ABC).
• Finally, we implement a classifier two-sample test (c2st): we train a classifier to distinguish between training samples and samples coming from our models, giving as additional input the gen information.The output is the percentage P c2st of samples which were incorrectly classified.For the optimal model, it has a maximum value of 0.5.We thus report our results as 0.5 − P c2st : in this way the best model has the lowest c2st value.We use a scikit-learn [37] HistGradientBoostingClassifier with default parameters as our classifier.

Training of models
We use the PyTorch [38] package, and the derived torchcfm [33,39] package, for creating and training the models.Additionally, the dingo package [31] for analysing gravitational wave data has been a major source of inspiration when designing the models.B3), and for Discrete models on the right.We also report the smoothed loss values across a window of 15 epochs.matching training routines we typically set σ min = 10 −4 (see Equation 4).We train for 1000 epochs.More details about the various models architectures, hyperparameters and naming conventions are available in Appendix B. The code for reproducing the datasets along with training and validation of the best model is released here †.
For the continuous models, we observe absence of overfitting.As an example, we show in Figure 3 the loss for a continuous and a discrete model.

Initial model comparison
We explored different configurations of discrete and continuous NF models.In order to compare them, we computed 55 different metrics across the different target variables.We aggregate Wasserstein and KS metrics by performing a mean across variables.
The main models compared were: • Continuous ResNet Target (CRT) a continuous flow where a ResNet architecture is learning the vector field, trained in a Target Flow Matching regime; • Continuous ResNet Basic (CRB), same as before but with a Basic Flow Matching; • Continuous MLP Basic (CMB), same as before but with a MLP architecture; † https://github.com/francesco-vaselli/FlowSim• Discrete Affine Autoregressive (DAA), a discrete flow with affine transform and autoregressive conditioner, see [30]; • Discrete Affine Coupling (DAC), a discrete flow with affine transform and coupling conditioner.
We first performed a scan on the basic dataset, as we had many possible configurations of hyperparameters to test, especially in the discrete case.We used those results to perform a smaller set of informed trainings on the extended dataset.In particular, Figure 4 shows the behaviour of the median over the last 100 epochs for global metrics such as the ABC and KS mean as a function of the different numbers of transformation layers in autoregressive and coupling flows.The Affine Coupling 5 layers and Affine Autoregressive 20 layers emerge as the best family of discrete architectures and have been retrained on the extended dataset along with the continuous flows models for our final comparison.
For the extended case, we computed the metrics for each model every 10 epochs and their median on the last 10 evaluations.As the differences in performances for some models are close to the epoch-to-epoch oscillations, we computed the significance ∆ metric / σ 2 diff + σ 2 0 where ∆ metric is the difference between the median values between the model and the reference, σ 2 diff , is the variance of the difference and σ 0 is the measured value of the metric when comparing two simulations of the same gen data.The CRT model has been selected as reference, being the better performing one, as seen in Figure

Heatmap of Models and Metrics
Figure 5: This plot quantifies how far each model performance is from the CRT baseline one.We compute the distance between the metrics of the model and the baseline one, and then divide by the sum in quadrature of the deviations.By averaging results across all metrics, we get the average significance, telling us how many deviations there are between two models.Negative values indicate lower performance, positive ones stand for a better one.The number of trainable parameters for each model is reported in parentheses.
5 which shows the comparison across all the metrics.In particular, the last column shows the average of the metrics, and we can observe that the CRT, CRT bigger and CRB models are the best performing and very close to each other.We also notice that continuous flows consistently achieve better results than discrete ones.Furthermore, they do so using a much smaller number of parameters, at least an order of magnitude lower.The "CMB small" model outperforms all discrete flows, despite having less than 20k parameters compared to almost a million for the latter ones.
A comparison between different models during training is illustrated in Figure 6, which shows the behaviour of the FGD metric for different models as a function of the epoch.We can see how continuous models converge faster and to smaller values compared to the discrete one.
Another important factor to take into account is the generation (or sampling) speed of the different types of models.We compare in Figure 7   models speed by sampling with different batch sizes on a single GeForce RTX 4060 NVIDIA GPU.As expected, the DAC models are the fastest by a wide margin, thanks to the use of coupling layers.On the other hand, the CRT performance is in a similar range of the one of DAA models.For continuous models, sampling speed is strongly influenced by the choice of the ODE solver.To show this, we tested our models with the Dormand-Prince of Order 5 (dopri5 ), Dormand-Prince of Order 8 (dopri8 ) and Euler of Order 1 (euler ) solvers, with different absolute and relative tolerances, and we found them to have compatible results.For the simplest solver, the CRT model is as fast as the Discrete Autoregressive one.The size of the models also plays a role, as seen in Figure 7, showing the CRT and CRT bigger sampling speed side by side to showcase the impact of model size on the generation speed.To put the speed of our approach into context, we should consider that a single particle physics event has tens of objects to be simulated, and conventional simulation approaches can take tens of seconds to produce a single event.It is then clear that a multiple kHz speed on single objects translates into orders of magnitude speed-up with respect to conventional approaches, see Section 4.4.

Results
In the following, we show results obtained by sampling from a separate test split of 650k jets for the best model (CRT).We used the dopri5 solver, with absolute and relative tolerances of 10 −5 .Figure 8 shows the comparison of 1-d distributions between the target dataset and the flow model results.In particular, we show the b and c-tagging distribution by input flavour: we can see that the flow is learning to correctly reproduce the different response of the tagging algorithm according to the input flavour flag.A similar level of convergence is obtained for all other distributions, such as the number of constituents or the p T , showed in Figure 9.The full results for 1-d distributions are shown in Appendix B, Figure B1.The ratio bin-per-bin between the model and the target across all flavours is reported in the bottom panel.We show in Figure 10 a plot highlighting the correlations among a subset of target variables.We can see that non-trivial correlations, such as those between the b and c-tagging algorithms, are correctly reproduced.The same is true for the correlation between the generator-level p T and the reconstructed one, shown in Figure 11.The model is correctly reproducing the dependencies for both the response and resolution on jet momentum.A crucial figure of merit for assessing how well the model is taking into account the generator-level input is the ROC curve.It can be constructed from both the b and c-tagging distributions.While the ROC from our toy-dataset is not that of an actual experiment, it is nonetheless important to correctly reproduce it, especially in the low FPR regime (e.g.∼ 10 −2 ).In Figure 12 we can see the comparison between the target and model ROCs.In order to put into context, and to guide the eye, we also added a band around the target ROC, representing the typical data vs simulation differences at the LHC.The band shows 10% differences in the FPR at a TPR of ∼ 50% (see, e.g., the CMS experiment report [40]).We can see that the Flow ROC consistently falls inside the band, with an ABC smaller than that computed between the Target ROC and one side of the band.

Scaling of performance on training data
We investigated the scaling of results for the same model trained on different amounts of training data.We train the best model, CRT, on 10k, 50k, 500k, 1M and 10M training samples, with a batch size of 512.We then test on a separate test split of 1M events.As expected, we observe a general improvement in results as the training split gets bigger.Figure 13 shows how the ABC decreases as we increase the training split size.Similarly, Figure 13 also shows how the c-tagging distribution is better modelled, especially in the tails, when using a bigger training dataset.
We note that increasing the training dataset yields smaller and smaller gains after a certain point: the ABC performance of the model trained on 10 million samples is compatible, if not worse, with that of the model trained on 1 million.We can see that when the model is trained on a bigger dataset, the performance in the tails is modelled better than when using a smaller dataset.

Performance on different physical processes
In order to verify how well the present approach scales on different physical processes with respect to the training one, we tested the best model (trained on t t) on Z+Jets, WW and multi-Jet datasets never seen during training.
The results demonstrate a good performance on different processes, as showed in Figure 14, which contains the correlation plot for the WW dataset.Despite different tails and different correlations with respect to the physics process used in training, they are all reproduced correctly.Similar results are observed for tagging as well, as shown in Figure 15.However, some variables show a slight bias on some datasets.As an example, Figure 16 shows the number of constituents distributions, displaying a slight bias visible in the ratio plot.This is likely due to the presence of some generator-level information which is crucially correlated with this observable, but we are not giving as input to our model.For example, some jets originate from tau leptons hadronic decays and such information is missing from the model inputs, while the fraction of tau jets changes for different physical processes.A full comparison of the metrics is presented in Table B4, Appendix B.

Oversampling
We demonstrated that the flow-based simulation provides a high speed and accurate detector simulation.However, because of the high simulation rate, it is interesting to study the case in which the generation step is slower compared to the end-to-end simulation (see Figure 1).In this case, it is convenient to use the same generator event to simulate multiple different reconstructed events.We name this procedure oversampling.However, we must take into account the fact that reconstructed events sharing the same gen as input are correlated.Usually, in order to perform physical measurements, experimental collaborations construct probability distribution histograms for the quantities of interest in the analysis.Therefore, in our work, we propose a general procedure to build such histograms with proper bin statistical uncertainties, while considering the event correlation due to oversampling.
In particular, given a histogram, the probability associated with the i-th bin and its uncertainty is given by: where w are the statistical weights associated with the events (e.g., originated by a Monte Carlo physics generator).In the oversampling case, there are N events with a common gen event, called folds, and therefore: where p fold j is the probability associated with each fold, since each event can now enter multiple bins.Assuming that folds entering different bins are uncorrelated, the associated uncertainty becomes: In practice, a histogram is filled for each fold, defining p fold j as the bin content divided by N .The resulting histograms are then combined to obtain the final histogram.We can notice that, in this way, the resulting uncertainty for each bin is larger than the one obtained by (wrongly) considering all the events uncorrelated.
We evaluated the effect introduced by oversampling by developing a basic analysis aiming at reconstructing a W boson produced in t t process.After the production, the top quark (antiquark) decays into a W + (W − ) boson and a b( b) quark.The W bosons decay with higher probability into a pair of light (non-b) quarks.A general overview of the decay is shown in Figure 17.The resulting quarks are seen as jets of particles at the detector level.
Figure 17: Feynman diagram of the t t production and subsequent decay products.
The basic analysis works as follows: first, we selected only those jets with p T ≥ 25 GeV.Jets with b-tagging score greater than 0.5 were labelled as b-jets.Afterward, we selected the events containing precisely four jets, with two of them identified as btagged.Finally, we reconstructed the candidate W boson from the two non-b-tagged jets.Oversampling has been employed to increase the number of simulated events of a 10k target dataset by a factor 10. The analysis has been performed on three datasets: the original 10k target dataset, the flow-oversampled one and a separate target dataset of 100k events.
In Figure 18, we show the comparison between 10k and 100k events of the target dataset and the 100k of the oversampled dataset for the W boson mass and p T distributions, as well the ∆R between the two jets used for the boson reconstruction.The ∆R measures the distance between two objects in the detector space η − ϕ, and it is defined as ∆R = ∆η 2 + ∆ϕ 2 .
As expected, oversampling provides a method to increase the statistical power of the dataset by reducing its relative statistical uncertainty.However, the improvement is larger for those distributions whose resolution is strongly dependent on detector effects.As shown in Figure 18, the uncertainty reduction is larger for the invariant mass and p T distributions, where the performances between the oversampled dataset and its target equivalent are compatible.The reduction is smaller for the ∆R distribution, since the angular position of the jets are measured with higher precision compared to their energy and momentum.In this case, there is little advantage in using oversampling, since there is a moderate uncertainty reduction compared to the original 10k samples.Additionally, we note that there is no significant bias between the two 100k samples.

Application scenarios
The approach presented so far can be used in different scenarios: in order to simulate datasets with new detector response from existing generated events or to create new samples, including the generation step.In the latter case, different scenarios can be imagined depending on the generation speed.In particular, the time needed for generating one event can be as low as few tens of milliseconds (as in the case of the pythia generator used for the toy-datasets in this work) or as high as tens of second for generators with highest theoretical accuracy.
The overall performance in these different tasks can be estimated as a function of the flow sampling speed (see Figure 7) and of the size of folds when using the oversampling technique described in the previous section.We evaluated three different generation speed scenarios (0.01s/ev, 1s/ev, 20s/ev) and two cases, one with oversampling and one without.For each scenario, and with different hypothesis of flow sampling rate, we evaluated the number of events that can be produced in a day on a typical HPC node with 32 CPU core and 4 GPU (see [41]) by scaling the performance observed on single GPU.In order to extrapolate from the per object sampling speed to the per event one, we considered that a typical LHC event has about 20 objects (jets, electrons, muons, etc.) to be simulated.Table 2 shows the results obtained.It is clear that while high sampling speed (above 10 kHz) could be useful for the scenario where gen input already exists, a lower rate is sufficient for most of the other applications.Oversampling also plays a major role in these scenarios.

Conclusion
We introduced Normalizing Flows for end-to-end simulation in High Energy Physics.Continuous Normalizing Flows trained with Flow Matching have been identified as the most accurate class of models, after testing on a series of physically-motivated metrics.
We demonstrated how flow-based simulation can produce accurate results for processes different from the training one, if provided with the relevant physical information.Additionally, we proposed the novel oversampling technique, which allows using multiple reconstructed events coming from the same generator one.We showed how such a technique can effectively reduce the statistical uncertainties of existing datasets.
Moving away from the toy-datasets used in this work to an actual physics dataset, it will be important to use a series of metrics capable of measuring convergence reliably.For this task, novel goodness-of-fit testing techniques, such as the New Physics Learning Machine (NPLM) framework [42], can be employed.
We also aim to investigate the use of these algorithms for the simulation of objects with no generator information (fakes) or for global (scalar ) event quantities.Those are the missing pieces needed to perform a comprehensive end-to-end simulation of physical events.
We are excited about the potential of Flow-based approaches to aid in HEP simulation, paving the way for new discoveries.deformed based on a randomly generated number of secondary vertices, Poisson distributed with a mean depending on the jet flavour, and on the jet momentum and rapidity.Finally, the b-tagger is biased if a muon is present in the jet.Quark gluon discrimination is generated transforming a uniform distribution with a power-law where the coefficient depends on the flavour of the jet.Additional correlations are introduced with the number of jet constituents and with the jet b-tagging discriminator.
• Number and charge of the constituents: the reconstructed number of constituents is obtained smearing the truth value and it is propagated with an additional smearing to the two variables representing the number of charged and neutral constituents.
Additionally, the generator level fractions of energy in charged/neutral and hadronic/electromagnetic components are computed, without any smearing, and they are used as an additional set of targets for the Normalizing Flow.The information on number of constituents and the energy fractions are also used to simulate a jet ID discriminating variable that has larger value for jets with more charged constituents or with higher momentum, mimicking the behaviour of the jet identification discriminators used to distinguish signal jets from jets originating from pile-up interactions or noise in actual experiments (no such jets are present in this toy simulation).

Appendix B. Models training and comparisons
and a Uniform U [−1, 1] noise distribution, but we found it to have lower performances than the combination of Target flow matching and Gaussian noise source.

Best model
The detailed hyperparameters for the best model on the bigger dataset are specified in Table B1.For the ODE solver, we use torchdiffeq [43] with solver dopri5, atol and rtol both equal to 10 −5 and timesteps t = 100.Small dataset results Table B2 shows the results for models trained on the smaller dataset (6 inputs, 5 targets).We used these results to guide our choice of models to train on the bigger dataset.The two major insights drawn were: • Continuous models have a better performance than discrete ones, while using a smaller number of parameters.Their performances are quite close to each other, so we decided to retrain most of them on the bigger dataset; • Among discrete models, the best ones were the DAA 20 and DAC 5 with SiLU/GELU activation functions, which were retrained on the bigger dataset.

Big dataset results
We show in Table B3 the numerical scores for the various metrics for different models.Table B3 shows the median of the validation metrics on the last 100 epochs for each model, on the same validation split.For each metric, the ranking is shown in parentheses, where (1) stands for the best model and ( 10) for the worst one in the given metric.As some continuous flows are close in their results, we also compute the standard deviation for each model in each metric, and we report it under the score.
In Table B4 we show results for the best model (CRT) on different physical processes.We also report in the following Figure B1 the 1-d plots for all the target distributions of the best model on the t t test dataset.Table B3: General comparison of models, where C: continuous, D: discrete.We report the median values over the last 100 epochs for a series of validation metrics.Continuous models outperform all the discrete ones.Results are quite close between continuous flows, so we report the standard deviations of each median value under it.The models are ranked across each metric, where (1) is the best and (10) the worst.The alternative simulation was obtained by computing the reported metrics on two target datasets built using the same gen input but different random seeds for the reconstruction.

Figure 1 :
Figure 1: The typical simulation chain for HEP collaborations, consisting of generation, simulation & digitization, reconstruction and analysis data reduction steps.The present work investigates the capabilities of Normalizing Flows models to do end-toend simulation from the first step directly to the last one of this chain (figures taken from [4, 5]).
the 1-d distributions and correlations between the b-tagging score and c-tagging score based on the jet flavour.

Figure 2 :
Figure 2: The correlations between the two target tagging distribution are shown as contours lines in the bottom left panel, coloured by the flavour of the jets.The two target distribution have non-trivial correlations for light jets, and b or c flavours correspond to peaks in the tagging distributions.

Figure 3 :
Figure 3: Model losses for CRT and DAC.For Continuous models on the left (see CRT, TableB3), and for Discrete models on the right.We also report the smoothed loss values across a window of 15 epochs.(a) Continuous models: The loss does not show signs of overfitting.Sharp drops are due to the reduction of the learning rate on plateaus.(b) Discrete models: The loss shows a widening gap between Training and Validation, suggesting a possible overfitting.
Figure 3: Model losses for CRT and DAC.For Continuous models on the left (see CRT, TableB3), and for Discrete models on the right.We also report the smoothed loss values across a window of 15 epochs.(a) Continuous models: The loss does not show signs of overfitting.Sharp drops are due to the reduction of the learning rate on plateaus.(b) Discrete models: The loss shows a widening gap between Training and Validation, suggesting a possible overfitting.

Figure 4 :
Figure4: Analysis of discrete models on the small dataset, showcasing the performance of Affine Coupling 5 layers and Affine Autoregressive 20 layers in both the ABC (left) and KS mean (right) metrics versus the other classes of models, computed as the median over the last 100 epochs.For each model, we report the number of trainable parameters in parentheses.

Figure 7 :
Figure 7: The rate of sampling one batch of a given size for different models.On the right (b), a specific look at continuous model architectures of varying sizes demonstrates the impact of increased parameters on the generation rate.

Figure 8 :
Figure 8: The CRT model's capability in modelling the shapes of the different components of the tagging distributions given the generator-level input of the jet flavour.The ratio bin-per-bin between the model and the target across all flavours is reported in the bottom panel.

Figure 9 :
Figure 9: The number of constituents (left) and the transverse momentum (right) distributions are correctly reproduced.The 1-d Wasserstein score between Target and Flow distributions is reported as a measure of convergence.

Figure 10 :
Figure 10: The correlations between different target variables are in agreement between Target and Flow.On the diagonal, we show the 1-d histograms for each variable.The off-diagonal elements compare the contour lines for 2-d distributions for each pair of variables.

Figure 11 :
Figure 11: The plot shows the mean of the distribution (left) and its standard deviation (right) for each bin of the generator jet p T .The different colours represent the different flavours, while Target vs Flow results are drawn with a different line style.We observe a general agreement on the scaling of these curves, with the subtle differences between b and non b-flavoured jets being correctly learned by the model.

Figure 12 :
Figure 12: ROC curves for the b (left) and c (right) tagging algorithms.The Area Between Curves (ABC) shown as the red shaded area between the two curves, is consistently smaller than the typical data vs simulation discrepancy at LHC Experiments.

Figure 13 :
Figure13: Results when training on different data amounts.On the left, The ABC as a function of training data, with max and min values over the last 100 epochs.Despite fluctuations, we observe a downward trend.On the right, we compare the two extremes in training data.We can see that when the model is trained on a bigger dataset, the performance in the tails is modelled better than when using a smaller dataset.

Figure 14 :Figure 15 : 1 -
Figure 14: Correlation plot for the WW dataset, illustrating how the Target is reproduced even in physical processes different from the training ones.

Figure 16 :
Figure 16: The Number of Constituents for multi-Jet (left), and Z+Jets (right) datasets.

Figure 18 :
Figure 18:  Comparison between 100k and 10k events of the target dataset and 100k events obtained using the oversampling method for (a) invariant mass distribution of the reconstructed W boson and (b) its transverse momentum.Figure(c) shows the ∆R distance between the two jets selected for the reconstruction of the W boson.Each plot shows the distributions, the ratio between the oversampled dataset and its equivalent target and the relative statistical uncertainty for each bin of the histogram.

Figure B1 :
Figure B1: All the 1-d distributions obtained for the training dataset.The description of the shown variables can be found in 1.

Table 1 :
The two datasets used in this work: one with 6 input generator-level variables and 5 target reco-level variables; an extended one with the same inputs and 16 target reco variables in total.
the CRT, DAA and DACFigure 6: A comparison between models: the behaviour of the FGD as a function of training epochs.Continuous models converge to lower values.

Table 2 :
Comparison of millions of events produced per day on a single 4 GPU computing node in different scenarios and their ratio to a conventional simulation scenario taking 20 seconds per event.

Table B2 :
The results of our experiments on the small dataset with different models.The continuous ones (starting with C) outperform all other combinations of discrete flows (starting with D).

Table B4 :
Numerical results for the best model on the different physical processes.