LHC hadronic jet generation using convolutional variational autoencoders with normalizing flows

In high energy physics, one of the most important processes for collider data analysis is the comparison of collected and simulated data. Nowadays the state-of-the-art for data generation is in the form of Monte Carlo (MC) generators. However, because of the upcoming high-luminosity upgrade of the Large Hadron Collider (LHC), there will not be enough computational power or time to match the amount of needed simulated data using MC methods. An alternative approach under study is the usage of machine learning generative methods to fulfill that task. Since the most common final-state objects of high-energy proton collisions are hadronic jets, which are collections of particles collimated in a given region of space, this work aims to develop a convolutional variational autoencoder (ConVAE) for the generation of particle-based LHC hadronic jets. Given the ConVAE’s limitations, a normalizing flow (NF) network is coupled to it in a two-step training process, which shows improvements on the results for the generated jets. The ConVAE+NF network is capable of generating a jet in 18.30±0.04μs , making it one of the fastest methods for this task up to now.


Introduction
In the CERN LHC [1,2], approximately one billion proton-proton (pp) collisions occur every second.Those collisions generate outgoing particles that are measured by dedicated, custom-made particle detectors.The main data stream collected by such a detector is close to 1.4 kHz.After the upcoming LHC upgrade to the High-Luminosity LHC (HL-LHC) [3] this number will increase by a factor of 3.5-5.In high energy physics (HEP), an important part of the data analysis is the simulation of pp collisions that are used to test analysis techniques, estimate the particle detectors' efficiencies and resolutions, and make comparisons with collected data.Usually, the number of simulated events is an order of magnitude higher than the number of recorded events in real data, such that the statistical uncertainty of the simulated events doesn't dominate the overall uncertainty.For simulating collision data, Monte Carlo (MC) event generators are used.However, the increased granularity of the detectors and the higher complexity of the collision events after the HL-LHC upgrade pose significant challenges that cannot be solved by increasing the computing resources alone [4].
One of the most common final-state objects in high-energy pp collisions is an energetic cluster of hadrons, known as a hadronic jet [5].Jets can be described by the features of their constituent particles, mainly their four-momentum.The particle-flow (PF) [6] algorithm is used to combine information obtained by several sub-detectors to precisely measure each particle's properties.Jet clustering algorithms [7] can be used to cluster the particle constituents into a jet and obtain its relevant properties for physics analysis such as its invariant mass, transverse momentum (p T ), energy, azimuthal angle ϕ, and pseudorapidity η ‡.
Being able to accelerate the process of jet simulation in a particle detector would be of great advantage to mitigate the needed overhead in computing resources.One of the avenues that shows promising results is the use of generative machine learning (ML) algorithms to perform the jet simulation.The goal of this program is to maintain the accuracy of MC methods while making the process orders of magnitude faster.Several approaches using neural networks (NNs) have been studied and applied for this purpose, using generative adversarial networks (GANs) [8,9], normalising flows [10], and scorebased diffusion models [11,12].Another promising technique is the use of variational autoencoders (VAEs) [13], which have yet to demonstrate high fidelity results.
In this work, a new method for the generation of hadronic jets from noise is presented.It is based on a VAE coupled with a normalizing flow (NF) and trained in a two-step process.The paper is organized as follows.Section 2 contains a literature overview of ML applied to HEP, mainly focusing on generative methods for jets simulation.In Section 3 the dataset and the model are described.Section 4 details ‡ In the LHC experiments, the origin of the right-handed coordinate system is the center of the local pp collision region, the y axis is defined vertically upward, and the z axis is along the proton beam direction.Additional coordinates are the azimuthal angle ϕ and pseudorapidity η = − ln tan(θ/2), where θ is the polar angle.
the custom loss function and the evaluation metrics; results are presented and discussed in Section 5. Finally in Section 6, we summarize the work and give an outlook for using the new method.

Related Work
ML techniques have been applied to hadronic jets generation following several distinct approaches.Some of the first attempts [14][15][16][17][18] were performed using convolutional neural networks (CNNs) because of their impressive results in computer vision and the fact that an image-like representation of a jet arises naturally from the particle detectors' data.Other jet representations such as vectors of energy deposits in calorimeters or high-level jet characteristics have also been studied [19,20].
Recently, particle-based jet datasets have also been extensively used by the ML community in the context of jets generation [8,9,13,21].They are composed of the jets constituent particles' properties and give a detailed description of particles distribution inside the jet, which is known as the jet substructure [22,23].GANs based on graph networks [8,9] have provided inspiring results when aiming to generate particle-based jets, retaining great similarity between generated jets and MC simulated jets while improving the speed of simulation by 4 to 5 orders of magnitude depending on the number of particles inside the generated jet.Some VAEs based on convolutional networks have also been implemented using the same dataset for jets fast simulation [21] and for the jets generation as well [13].However, for the latter, there is great room for improvement when comparing the VAE generated jets with MC simulated jets.
The premise of this work is that the VAE by itself is not enough to capture the full characteristics of a hadronic jet, mainly due to the a priori choice of the latent space distribution.On the other hand, a combined VAE+NF approach, which has already been applied for image generation [24] and time series data prediction [25], among others, has not yet been applied in HEP.In this work, this combined approach is used to improve the generative capacity of the VAE.

Dataset and Model
We use the gluon HLS4ML LHC Jet dataset [26,27] that is composed of ∼177k gluon jets, containing the jet constituent particles' three-momenta (p x , p y , p z ).These were obtained from simulated pp collisions with a center of mass energy of √ s = 13 TeV that produced jets with p T ≈ 1 TeV.A simplified particle detector parameterization, similar to the apparatuses deployed by the large LHC experiments, is used to simulate particles' interactions with the detector material.The hadronic jets are reconstructed using the anti-k T algorithm [28,29] with a distance parameter R = 0.8.The 177k samples are divided into ∼70% for training and ∼15% each for validation and testing.Even though there is no intrinsic ordering to the particles inside a jet, the particles are ordered in a list by decreasing p T and only the first 30 particles are used as input to the ML algorithm.If a jet contains less then 30 particles, the rest of the list was filled with zeros (zero-padding).A feature-wise standardization was performed to bound the values of the three-momentum to be in the range [0.0, 1.0] as input to the network.

Convolutional VAE
To accommodate this representation of the input dataset, the chosen model is a VAE [30][31][32] composed of convolution layers (ConVAE).The VAE is composed of three components: an encoder, which learns to compress the representation of the input data into a lower dimensional representation, a latent space, which stores this lower dimensional representation into values that closely follow a probability distribution defined by the user, and a decoder, which decodes the latent space values to the output.The probability distribution of the latent space values has to be differentiable, easy to implement, and easy to sample from.The most common choice is the standard Gaussian N (0, 1).The goal of the VAE is to produce output data whose distribution is as similar as possible to that of the input data.In addition, it is desirable to be able to generate new data samples resembling the training data by sampling from N (0, 1) and feeding those values into the decoder.
The network was implemented in the PyTorch library [33].In this work, the encoder and decoder structures are mirrored, and only the encoder will be described.It is composed of a given number of consecutive two-dimensional convolution layers where the last convolution layer is flattened and introduced into two linear layers, where the latter is input into the latent space.An activation function is applied after each layer, except for the last linear layer of the encoder.The input and output layers of the network have a fixed size of 3×30.The activation function in between each layer, number of convolutional layers in the encoder and decoder, kernel size, latent dimension size, number of filters, and number of nodes in the linear layers between encoder (decoder) and the latent space are hyperparameters to be optimized.The activation function of the last deconvolution layer of the decoder was chosen to be the sigmoid [34] function, to bound the output values in between 0.0 and 1.0, given the feature-wise standardization.
In the encoder architecture, the number of input nodes of the first linear layer is fixed by the size of the last convolutional layer, while the number of output nodes of the second linear layer is fixed as two times the number of dimensions of the latent space to provide the means and standard deviations of the latent distributions §.In the decoder architecture, the number of input nodes in the first linear layer is fixed by the latent space size, and the number of output nodes in the second linear layer is fixed by the size of the first deconvolution layer.The stride and padding of the convolution layers were kept as 1 and 0, respectively.The first convolutional layer has 1 input filter and N filters output filters, and the rest of the convolutional layers have an input number of filters equal to 2 ℓ−1 × N filters and an output number of filters equal to 2 ℓ N filters , where ℓ is the layer number.
Hyperparameter optimization was performed using the Optuna package [35], which allows for fast convergence thanks to aggressive pruning based on intermediate results and easy parallelization.Each ConVAE was set to go through 300 trials of the optimization, where each trial was executed for 300 epochs of the training.At training time, the batch size was kept fixed at 100 samples and the Adam optimizer [36] was used with a learning rate (LR) that was set by the hyperparameter optimization.During the optimization, the evaluation metric (described in Section 4) was minimized using the value evaluated on the validation dataset after every 5 epochs.The ConVAE was retrained with the best set of hyperparameters for 1500 epochs, and the best model was chosen as the one that exhibited the smallest value of the metric.

ConVAE Loss Function and Evaluation Metric
The VAE loss function is given by the following expression [30]: where x is the vector of data values, z is the vector of latent space values, q ϕ (z|x) the variational posterior (i.e., the encoder), p θ (x|z) the conditional likelihood (i.e., the decoder), and p(z) the prior distribution.The first term on the right-hand side can be seen as the reconstruction loss L rec , measuring the distance between the output produced by the network and the input data.The second term D KL is the Kullback-Leibler divergence [37] that measures the difference between the probability distributions of the encoder output values and the latent space values.The distribution of the latent space values is defined a priori, and the distribution of the encoder output values is one that is easy to work with, but still complex enough to describe its output, where the usual choice is a multivariate Gaussian.The minimization of this loss function ensures that the output will be as close as possible to the input, while the values of the latent space closely follow the chosen distribution.In addition, one can add a hyperparameter β [38] to control the importance of each term in the loss.The loss function can be written as: The reconstruction loss term was customized to the task of particle generation.The loss function that compares output particles with input dataset particles was chosen to be the Chamfer distance [39], also referred to as nearest neighbor distance (NND).This loss function is preferred over the commonly-used mean-squared error (MSE) function because it is permutation invariant, which is desirable since particles in a jet have no intrinsic ordering.The NND loss is expressed as: where i and j are indices of the particles in the input and output samples, respectively, k is the index of a given jet, J k represents the k-th jet in the dataset, hat distinguishes between input (without) and output (with) objects, and D(⃗ p i , ⃗ pj ) is the Euclidean distance between input and output particles, treating each as a vector in (p x , p y , p z ) space.The first term finds the closest output particle to a given input particle, the second term finds the closest input particle to a given output particle, and their distances are summed.The loss L NND alone was not enough to provide good quality of the generated jets.Therefore, physics-inspired constraints were included in the form of distances between the input and output jet properties.The best results were achieved using a combination of the jet mass and p T terms using MSE as a distance function.The additional loss term was added as follows where the parameters γ p T and γ m were used to weight each contribution.Finally, the combined reconstruction loss is given by in which α and γ were used to weight the importance of each contribution to the loss function.The full set of loss parameters (α, β, γ, γ p T , γ m ) was optimized with Optuna.The hyperparameter optimization procedure was carried out in two steps.In the first, only the loss parameters and the optimizer learning rate were optimized while the network parameters were fixed at some reasonable choices established after manual tests.In the second, all the network parameters were optimized given the best set of loss parameters.This was done to ensure convergence of the optimization procedure in a feasible amount of time given the amount of resources available.
The evaluation metric chosen to quantitatively measure the generation capabilities of the distinct models was the earth mover's distance (EMD) [40], or 1-Wasserstein distance, calculated for each histogram of the input and output jets mass, energy, p T , η, and ϕ, and summed to get the EMD sum .The choice of the variables was based on relevant quantities for jet-based searches in LHC experiments [41,42].This metric was used as the hyperparameter optimization objective, the method to choose the best model, and to compare different approaches.

Normalizing Flows
When the ConVAE is optimized for reconstruction, i.e., the value of β in Equation ( 2) is small, a looser constraint is applied to ensure the latent space values follow a given probability distribution.In this case, the probability distribution of those values, p(z), favours optimizing the reconstruction rather than strictly matching the imposed prior.These reconstruction-imposed latent space distributions are generally unknown and one cannot easily sample from them.
A normalizing flow [43,44] is a transformation of a simple probability distribution (e.g., N (0, 1)) into a more complex distribution (and vice versa) through a sequence of invertible and differentiable mappings usually learned by training a NN.Therefore, we can combine the ConVAE approach with a NF to find the transformation from the more complex latent space distribution to a simpler one, such as p(z) where u(x) = N (0, 1).Since these transformations are invertible by construction, it is possible to start from values that follow the simple distribution, pass it through the inverse of the transformation and obtain values that follow the complex and unknown distribution as u(x) The network is trained by maximizing the right-hand side of Equation (6).
where ∂f (z)/∂z is the Jacobian of the transformation.
The NF network we use is based on the RealNVP network∥ [45,46] that makes use of simple analytic expressions for each intermediate transformation using the coupling layers [47], where only a subset of the input data undergoes the transformations while the rest remains unchanged, and the subsets are permuted every epoch to ensure that all input data goes through the flows.The parameters s and t are obtained as the outputs of two multilayer perceptrons (MLPs), which can be as complex as needed.
To combine the ConVAE with the NF network (ConVAE+NF), a two-step training was performed.First, the hyperparameters of the ConVAE were optimized and the resulting best model was trained for 1500 epochs.The model with the smallest EMD sum when comparing the input with reconstructed jets in the validation dataset, calculated every 5 epochs, was saved.The latent space values of the saved network when receiving as input the training and validation jet datasets were extracted to generate new training and validation datasets for the NF.A diagram containing the full training procedure is displayed in Figure 1.
The NF network was implemented using the PyTorch library.The Adam optimizer was chosen to update the network parameters by minimizing the negative of the mean of the expression in Equation ( 6) as the loss function.The learning rate, number of flows and the number of layers and nodes of the two MLPs that determine s and t were hyperparameters to be optimized.The ReLU [48] activation function was used for each layer of the MLPs, and for s the hyperbolic tangent was used after its last layer to constrain its values to be in the range (−1.0, 1.0).The number of input and output nodes of the s and t networks were set to match the number of latent dimensions of the ConVAE.
The metric used to evaluate the combination of both networks as the hyperparameter optimization objective was EMD sum .After every four epochs, N latent ∥ The implementation of the RealNVP network in this work was performed following the GitHub repository https://github.com/ispamm/realnvp-demo-pytorch.values were sampled from N (0, 1) and input into the inverse of the NF network, whose output was given as input to the decoder of the ConVAE tuned for reconstruction, and its output was compared with jets from the validation dataset.The optimization was run for 300 trials with 20 epochs of NF training per trial.The best set of hyperparameters was used to train the final NF network for 100 epochs, and the model that exhibited the smallest metric value, checked after every epoch, was saved.

Hyperparameter optimisation
As described above, the hyperparameter optimization was performed using the Optuna framework and executed in steps to speed up the convergence of the method given the limited computational resources.The ranges considered for each hyperparameter during the optimization for both types of NN are shown below.In the ConVAE case, the set of parameters to be optimized was: the learning rate of the optimizer (Adam), the loss function parameters α, β, γ, γ p T , and γ m , and the architecture parameters as the number of latent dimensions, number of filters, kernel size related to the number of particles to convolve, number of nodes in linear layers, number of convolution layers, and activation function.For the 2D kernel size (k 1 , k 2 ), only k 2 , the number of particles to convolve, was varied, while k 1 , the number of particle features to convolve, was set to 3 for the first (last) convolution (deconvolution) layer of the encoder (decoder) and 1 otherwise.
The ranges of the parameters γ p T and γ m were determined by the dataset because jet p T and mass have different scales, which imply different magnitudes for those loss components.The range for γ p T was set from 20% smaller to 20% larger than unity, while the range for γ m was set from 20% smaller to 20% larger than the ratio of the mean of the jet p T and mass.For the loss function hyperparameter optimization, the network architecture hyperparameters were fixed at the reasonable choices displayed in Table 1.For both ConVAE networks (ConVAE and ConVAE+NF approaches) almost all the ranges of the hyperparameters for the optimization were the same, and are shown in Table 2.The only exception was for the loss hyperparameter β that, due to the difference in the magnitudes of L rec and D KL , needed to be much closer to 1.0 for the ConVAE tuned for generation, where the optimization range was [0.9, 1.0] in log scale, and much closer to 0.0 for the ConVAE tuned for reconstruction, where the optimization range was [0.0, 0.1] in log scale.
N latent Filters Kernel N linear Layers Act.Func.Since the number of hyperparameters of the RealNVP network is much smaller than for the ConVAE, its hyperparameter optimization was performed only in one step.The set of hyperparameters that were optimized was: the learning rate of the optimizer (Adam); and the network architecture hyperparameters as the number of flows, number of linear layers for the s and t MLPs and the number of nodes in each layer of those MLPs.For both the RealNVP network used together with the ConVAE and the one used by itself, the ranges of the hyperparameter optimization are shown in Table 3.

LR
N flows Layers N linear 10 −5 -10 −1 5-100 1-4 50-400 Table 3. Ranges of architecture hyperparameters for the RealNVP optimization.The number of nodes in each linear layer (N linear ) is optimized separately for each layer.

Experiments
We experiment with three approaches: a baseline ConVAE-only model, the ConVAE+NF model, and another baseline NF-only model.The final set of optimized hyperparameters for the former is displayed in Table 4, which result in an EMD sum value of 0.0033.Figure 2 shows the comparisons of the distributions of input and output jet mass, p T , energy, η and ϕ after hyperparameter optimization.Qualitatively, the distributions of jet energy, η, and ϕ are very similar, with differences only at the extremes of the histograms, while for p T there is a slight difference at the peak of the distribution and small discrepancies at both tails.The greatest discrepancies are in the distributions of jet mass, where they are only similar in the approximate range 80 GeV < m jet < 175 GeV.The complete sets of optimized hyperparameters for the ConVAE+NF model are displayed in Tables 5 and 6, which result in an EMD sum of 0.0026.The histograms of the jet variables are displayed in Figure 3.

LR
Based on the EMD sum , there is a large improvement in the generative capabilities of the ConVAE+NF model compared to the ConVAE model.In particular, the agreement in the jet mass distribution improves dramatically, with disagreement only for masses smaller than 25 GeV.However, there remain notable deviations from unity in the η, ϕ, and p T histograms ratio between randomly generated and input jets.For comparison, a RealNVP network was also optimized, in the same way as the NF network for the ConVAE+NF approach, and trained to receive as input a flattened jet ¶, transform each feature distribution into an uniform normal, and invert the transformation back to flattened jets.Table 7 contains the set of best parameters after hyperparameter optimization for this network.The value of EMD sum for this case was 0.0049 and the histograms of jet variables are shown in Figure 4.The agreement between the distributions and the EMD sum are significantly worse than ConVAE+NF, demonstrating the advantage in combining the inference capability of the ConVAE, with the invertibility of the NF network.Table 7. Set of optimal hyperparameters for the NF.

LR
Following the prior work in Refs.[8,49], another set of metrics was calculated and compared to the results obtained using the state-of-the-art message-passing generative adversarial network (MPGAN) on this dataset.The metrics W M 1 , W P 1 and W EFP 1 correspond to the average 1-Wasserstein distance between input and generated jets mass, constituents η rel , ϕ rel and p rel T , and five energy-flow polynomials (EFPs) [50], which are a set of variables that capture n-point particle correlations and jet substructure.The computation of the EFPs was performed using the JetNet package [51], where the EFPs correspond to the default choice defined in the package.The Fréchet ParticleNet Distance (FPND), which is the ParticleNet [52]-adaptation of the Fréchet Inception distance (FID) [53] that is standard in computer vision; and coverage (COV) and minimum matching distance (MMD) which measure the ratio of samples in X (e.g., the generated jets) that have a match in Y (e.g., the input jets) and the average distance between matched samples, respectively, are also measured.Table 8 shows that among the models implemented in this work, the ConVAE+NF approach is the best in almost every metric.However, MPGAN outperforms the VAE-based approaches in every single metric.The advantage of the ConVAE+NF approach is that it is approximately two times faster: the average time to generate a single jet is 18.30±0.04µs for ConVAE+NF versus 35.7 µs for MPGAN measured using an NVIDIA Tesla T4 GPU.
From the comparisons of the low-and high-level jet feature distributions, we observe that the mixed ConVAE+NF approach exhibits an improvement over the ConVAE alone.We attribute this to two reasons: (1) the standard Gaussian distribution, used by the ConVAE alone but altered in the ConVAE+NF approach, is not the optimal probability distribution for the elements in the latent space dimension, and (2) the decompression of those values could be improved by using the same parameters of the ConVAE tuned for reconstruction.Still, there are discrepancies between the ¶ The input jet 3×30 feature matrix representing each particle's p x , p y and p z , was flattened to a 1×90 feature vector containing all particles' p x , then all p y and, finally, all p z .

Model
W M 1 (×10 −3 ) W P 1 (×10   ConVAE+NF generated jets and input jets, mainly related to the substructure variables, which are not present for the MPGAN approach.Other than the structural distinction between GAN and VAE, one of the core differences between the two approaches is that the architecture for the MPGAN is a GNN.The representation of jet constituents as an unordered set of particles is more natural, intrinsically preserves permutation invariance, and most likely contributes to the differences between the ConVAE and MPGAN results.

Summary and Outlook
We presented a novel technique in the context of hadronic jets generation in simulated high-energy pp collisions, based on a machine learning approach that combines a convolutional variational autoencoder (ConVAE) with normalizing flows (NFs).From the values of EMD sum obtained, there was a clear improvement with respect to the previous work [13], not only by the application of hyperparameter optimization to the ConVAE, but also with the usage of the ConVAE+NF technique.The generation of gluon jets was performed using the ConVAE+NF model and compared with baseline ConVAE and NF models alone, as well as the state-of-the-art message-passing generative adversarial network (MPGAN).The ConVAE+NF approach is superior to the first two, but needs further improvements to be comparable to the latter.
There is, however, an improvement in inference time compared to MPGAN as the ConVAE+NF can generate a jet twice as fast.This can already offer advantages in physics applications where large amounts of simulated samples are needed but perfect accuracy is not a strict requirement.It is worth noting that the MPGAN architecture is based on a graph neural network, which has a much more natural representation for a particle-based jet dataset.In future work, a graph-based VAE+NF can be implemented, trained, and tested to check for improvements in the jet generation capabilities.

Figure 2 .
Figure 2. Comparison of input jets variable distributions (red) with randomly generated jets from the ConVAE model (blue).From left to right, top to bottom: mass, energy, p T , η, and ϕ distributions are displayed.In the subpanels, the ratio generated /input is shown.

Figure 3 .
Figure 3.Comparison of input jets variable distributions (red) with randomly generated jets from the ConVAE+NF model (blue).From left to right, top to bottom: mass, energy, p T , η, and ϕ distributions are displayed.In the subpanels, the ratio generated /input is shown.

Figure 4 .
Figure 4. Comparison of input jets variable distributions (red) with randomly generated jets from the NF only model (blue).From left to right, top to bottom: mass, energy, p T , η, and ϕ distributions are displayed.In the subpanels, the ratio generated /input is shown.

Table 1 .
Set of common architecture hyperparameters for ConVAE loss parameters optimization.

Table 2 .
Ranges of common architecture hyperparameters for ConVAE loss parameters optimization.The choice of activation function was also optimized, chosen from one of ReLU, GeLU, LeakyReLU, SELU or ELU.

Table 4 .
α β γ γ p T γ m N latent Filters Kernel N linear Layers Act.Func.Set of optimal hyperparameters for the ConVAE tuned for generation.

Table 5 .
α β γ γ p T γ m N latent Filters Kernel N linear Layers Act.Func.Set of optimal hyperparameters for ConVAE tuned for reconstruction.

Table 6 .
Set of optimal hyperparameters for the NF to be used on the ConVAE+NF approach.

Table 8 .
[8,49] metrics proposed in Refs.[8,49]to compare the performance of distinct ConVAE and NF approaches.In bold are the best values of the metrics when comparing the methods implemented in this work.The last row displays the performance of the MPGAN for the same gluon jet dataset.