Differentiable Earth Mover’s Distance for Data Compression at the High-Luminosity LHC

. The Earth mover’s distance (EMD) is a useful metric for image recognition and classification, but its usual implementations are not differentiable or too slow to be used as a loss function for training other algorithms via gradient descent. In this paper, we train a convolutional neural network (CNN) to learn a differentiable, fast approximation of the EMD and demonstrate that it can be used as a substitute for computing-intensive EMD implementations. We apply this differentiable approximation in the training of an autoencoder-inspired neural network (encoder NN) for data compression at the high-luminosity LHC at CERN. The goal of this encoder NN is to compress the data while preserving the information related to the distribution of energy deposits in particle detectors. We demonstrate that the performance of our encoder NN trained using the differentiable EMD CNN surpasses that of training with loss functions based on mean squared error.


Introduction
The Earth mover's distance (EMD) [1], also known as the 1st Wasserstein distance, has found use in various domains of machine learning (ML) problems.The EMD is a function of optimal transport between two distributions of points p i with weights w p i and q j with weights w q j defined as where f ij is the optimal flow between p i and q j that minimizes the overall cost of transportation and d ij is the ground distance between p i and q j .Roughly speaking, it measures the minimum work required to transport one distribution to another It has been used in image classification [2], visual tracking [3], and medicine [4].In Ref. [5], the authors develop a generalization to the metric known as the energy mover's distance.The energy mover's distance has been used to explore the geometry of particle jets [6], as a basis for defining event isotropy [7], to train variational autoencoders [8], for anomaly detection [9], and to evaluate particle physics simulations [10].
Two issues commonly arise from implementations of the EMD metric: (1) they are not differentiable and (2) they are often computationally intensive.Efficient, differentiable, or linearized approximations to the EMD, have been studied in the literature to make them more suitable for a variety of tasks.Ref. [11] adopts the EMD as a metric to determine image relevance.The authors use a structured fully connected layer with the energy mover's distance inserted as a layer (with its Jacobian computed using the implicit function theorem) to classify dense image representations.However, the implementation in Ref. [11] is slow and does not generalize to large datasets.Another study [12] develops a linearized approximation to the energy mover's distance.Ref. [13] proposes the use of graph neural networks (GNNs) to approximate the energy mover's distance for use in particle reconstruction.The Sinkhorn distance [14] has also been proposed as faster, differentiable (upper bound) approximation to EMD, and has been used in particle physics applications [15].However, in some applications, including the one we study, this approximation can be inaccurate.Recently, the authors of Ref. [16] developed a way to approximate the energy mover's distance in a differentiable way using a Lipschitz network [17] that enforces an exact upper bound on the Lipschitz constant of the model by constraining its weights.Ref. [18] introduced a new onedimensional encoding of collider events, which allows for a closed-form expression of the 1st Wasserstein distance dubbed the spectral energy mover's distance.Ref. [19] uses a "sliced" Wasserstein distance [20] as a loss function for training an attention-based GNN for pile-up mitigation.
In this paper, we propose the use of convolutional neural networks (CNNs) to learn a differentiable approximation of the EMD metric.We then use this metric to optimize an algorithm for front-end data compression at the high-luminosity LHC (HL-LHC).In particular, we train a convolutional autoencoder to directly minimize our differentiable approximation to the EMD via gradient-based optimizers.We demonstrate that this technique leads to an autoencoder that better preserves the relevant physical information stored in the shape of the energy deposits created by electrons and other high energy particles.The rest of this paper is organized as follows.Section 2 introduces the high-granularity calorimeter (HGCAL), and the energy concentrator trigger (ECON-T), which are the setting for data compression example.In Section 3, we introduce the EMD CNN and its optimization.Section 4 demonstrates the use of the EMD as a loss function for the ECON-T autoencoder, and in Section 5 we show an improved trigger performance.We summarize our results in Section 6.

HGCAL Autoencoder
The next generation of big data science experiments, including the HL-LHC at CERN, presents challenges because of the massive data rates generated during data taking.One solution is to immediately compress data before transmitting it off the detector in front-end electronics.A deep learning technique, based on training an autoencoder, has recently been proposed [21] to compress data from a new HGCAL [22] that is part of the upgraded CMS detector.The HGCAL includes over 6 million readout channels that must be compressed before being sent to the level-1 trigger (L1T), a real-time filter system implemented on field-programmable gate arrays (FPGAs) [23].In high energy physics experiments, the L1T [24,23,25,26] is the first stage of the system that decides which collision events are interesting enough to record, such as events potentially containing a Higgs boson.In order to provide input to the L1T, the HGCAL transmits a stream of trigger data at a frequency of 40 MHz.Application-specific integrated circuits (ASICs) are used to digitize and encode trigger data before transmission to back-end FPGAs for further processing.The goal of the algorithm is to compress the HGCAL data on an ASIC while preserving the information related to the distribution of energy deposits in silicon particle detector sensors.
The ASIC autoencoder (composed of an encoder and decoder neural network) is trained to reproduce the input data while producing an optimum latent representation.The encoder is a reconfigurable CNN trained with QKeras, a quantization-aware training library [27].While the architecture of this encoder will remain fixed, the weights will be reprogrammable.Ref. [21] details the hardware implementation of the encoder onto the ASIC.Groups of 2×2 or 3×3 HGCAL readout channels are summed together into trigger cells (TCs).Each HGCAL silicon sensor consists of 48 such TCs, with each TC transmitting 7 bits of data at 40 MHz.These 48 TCs are arranged in the hexagonal geometry as shown in Figure 2. The task of the ECON-T ASIC is to aggregate the 48 7-bit signals and compress them into a representation that allows for the best decision making by the L1 trigger.While the range of bits in the output varies per expected cell occupancy, this study is conducted at the maximum expected cell occupancy: 9 bits for each encoded value.We read a 7 bit input for each of the 48 trigger cells, and compress it to a latent representation of 16×3 bits.The decoder mirrors the encoder to produce the original 48 input image.

Baseline Loss Functions
As a proxy for the ultimate physics performance, the EMD between the HGCAL input and the decoded autoencoder output can be used as a metric to evaluate a given model.As mentioned in Section 1, this EMD metric as implemented in the Python Optimal Transport [28] library cannot directly be optimized using for gradient-based methods.Therefore, alternative loss functions have been used in the past for the optimization of the autoencoder.We compare the performance of the autoencoder when trained with three different loss functions: the EMD CNN approximation we introduce in section 3, and two additional baselines described here.The first baseline for comparison is a simple weighted mean-squared error (MSE) loss function where N 1×1 = 48 is the number of cells, x i is the set of charge fractions input to the encoder, and y i is the set of charge fractions output by the decoder.The square of the differences is weighted by the larger of the two charge fractions, with the aim of emphasizing the accuracy of trigger cells containing the most charge.The second, improved baseline is an ad-hoc loss function proposed in Ref. [21] known as the telescope MSE loss function.This loss function is designed to better capture differences in the shape of the spatial distribution of charge deposits.The telescope MSE loss function computes a weighted MSE for multiple groupings of trigger cells, at different "telescoping" granularities: individual cells (denoted 1×1), 2×2 sums of trigger cells, and 4×4 sums of trigger cells.
Illustration of the telescope loss.The red lines correspond to one example of a 2×2 trigger cell grouping in the L 2×2 loss term, while the blue lines correspond to the three 4×4 trigger cell groupings in the L 4×4 loss term for telescope loss.
The component of the loss from the individual cells L 1×1 is equal to L Weighted MSE from Eq. (2).For calculating the contribution from 2×2 sums of trigger cells, a weighted MSE is found from charge fractions of neighboring 2×2 groupings of trigger cells.To account for the fact that cells along the edge of the wafer will not be included in as many 2×2 sums, an additional factor is applied to each weight based on the number of times its constituent trigger cells get counted into sums.The component of the telescope MSE function from the 2×2 sums, L 2×2 , is defined as where of the output of the decoder, and w i is the weight corresponding to the ith sum.To account for the fact that cells along the edge of the wafer will not be included in as many 2×2 sums, an additional factor is applied to each weight based on the number of times its constituent trigger cells are included in sums.For example, a 2×2 sum can be formed from q 0 , q 1 , q 8 and q 9 (see Fig. 3 for the numbering scheme).Based on their locations on the edge of the detector, q 0 is only included in one sum, q 1 and q 8 are included in 2 sums, and q 9 is included in 4 separate sums.The sum made up of these four trigger cells has a weight of Finally, three 4×4 sums are produced.The trigger cells that get combined into the 4×4 sums are shown in blue in Fig. 2. Because trigger cells are included in one and only one 4×4 sum, no additional weight term is necessary here.Thus, the component of the telescope MSE from the 4×4 sums, L 4×4 , is given by The total value of the telescope MSE loss is then a weighted sum of the three components Weighting the individual loss terms in this fashion corresponds more closely to EMD.The chosen weights are inversely related to the number of trigger cells in each grouping.

Hexagonal Geometry Encoding
The HGCAL wafers have a unique hexagonal geometry because hexagonal wafers have an efficient spatial use of the silicon sensors [22].Conventional CNNs require regular/rectangular grids of pixels, possibly arranged in different channels, e.g., corresponding to red/blue/green in natural images.There is no ideal remapping of these hexagonal wafers into rectangular grids that preserves all the nearest neighbors of all cells.To account for this unique geometry, we use two remapping strategies.Encoding (A) shown in Fig. 3 (left) remaps the trigger cells into an 8×8 image with 1 channel, corresponding to the charge fraction in that cell.This encoding is used for the autoencoder as it results in a smaller input size, which allows for a more efficient hardware implementation.Encoding (B) shown in Fig. 3 (right) remaps the trigger cells into a 4×4 image with three separate channels.The three channels correspond to the charge fractions in three different areas of the hexagonal wafer.It was chosen for the EMD CNN training to maintain the best performance as the same hardware constraints do not apply.Once we develop a neural network loss with encoding (B), we remap to encoding (A) when we train the autoencoder.

Dataset
To train our neural networks we use simulated electron-positron events in HL-LHC conditions, corresponding to up to 200 additional proton-proton interactions within the same or nearby bunch crossings (pileup).Pileup can contribute additional energy depositions in the HGCAL.As the main goal of ECON-T is to identify and trigger on low-momentum electrons or positrons, we select events containing an electron (positron) with a maximum transverse momentum (p T ) of 35 GeV.The dataset in ROOT format before filtering and preprocessing is published on the Zenodo platform [29,30].q 32 q 33 q 34 q 35 q 36 q 37 q 38 q 39 q 40 q 41 q 42 q 43 q 44 q 45 q 46 q 47 q 4 q 5 q 6 q 7 q 12 q 13 q 14 q 15 q 20 q 21 q 22 q 23 q 28 q 29 q 30 q 31 q 0 q 1 q 2 q 3 q 8 q 9 q 10 q 11 q 16 q 18 q 18 q 19 q 24 q 25 q 26 q 27 q 0 q 1 q 2 q 3 q 4 q 5 q 6 q 7 q 8 q 9 q 10 q 11 q 12 q 13 q 14 q 15 q 16 q 17 q 18 q 19 q 20 q 21 q 22 q 23 q 24 q 25 q 26 q 27 q 28 q 29 q 30 q 31 q 32 q 33 q 34 q 35 0 0 0 0 q 36 q 37 q 38 q 39 0 0 0 0 q 40 q 41 q 42 q 43 0 0 0 0 q 44 q 45 q 46 q 47 0 0 0 0

Encoding (A)
Encoding (B) The detector response is modeled using a detailed description of the upgraded CMS detector based on Geant4 [31].The detector is described using a Cartesian coordinate system with the z axis oriented along the beam axis and x and y axes defining the transverse plane.The azimuthal angle ϕ is computed from the x axis.The polar angle θ is used to compute the pseudorapidity η = − log(tan(θ/2)).
The target EMD for training is computed using the ot.emd2 function from the Python Optimal Transport library [28].Generating a dataset for training the EMD CNN requires pairs of samples.The total number of sample pairs in the dataset is 340 000.The dataset is split into 70% for training and 30% for validation.We ensure that there is no overlap in the samples between the different datasets.

Architecture and Hyperparameter Optimization
We implement the neural network using the Keras [32] API of TensorFlow [33] with multiple convolutional and dense layers on the remapped 4×4×3 tensor input.We average the output over both input orders f (x 1 , x 2 ) = 1 2 (g(x 1 , x 2 ) + g(x 2 , x 1 )) to enforce the symmetry in the EMD metric.The model's hyperparameters are optimized with a grid search.The evaluation metric we use to compare the performance of different models is the standard deviation of the distribution of the relative difference between the predicted and true EMDs σ EMD rel.diff.evaluated on the validation dataset.All trainings were performed on a NVIDIA GeForce GTX-1080 Ti GPU.
In the grid search, we varied multiple hyperparameters related to the number and sizes of the layers.The number of 2D convolutional layers ranged from 1 to 4. For these layers we tested kernel sizes of 1, 3, and 5, and 32, 64, 128, and 256 filters.Each 2D convolutional layer is followed by a batch normalization layer [34] and rectified linear unit (ReLU) activation function [35,36].We tested three different loss functions: standard MSE, mean-squared logarithmic error (MSLE), and Huber [37].However, we found that MSE loss function performed the best.For each hyperparameter choice, the training was performed using the Adam optimizer [38] with a learning rate of 1×10 −3 for 120 epochs.We found that one fully-connected layer was sufficient, but increasing the number of 2D convolutional layers improved our performance.The best hyperparameters and corresponding σ EMD rel.diff.are detailed in Appendix B.
The optimized model, shown in Fig. 3.2 has four 2D convolutional layers, each with 32 filters, a kernel size of 5, and a stride of 1.There is one fully-connected layer with 256 neurons.
Figure 5 shows the performance of this model in predicting the EMD.On the left, the distribution of the relative difference between the predicted and true EMD validation dataset is shown.There is a relatively small bias (µ EMD rel.diff.= −1.3%)and the standard deviation σ EMD rel.diff.= 5.3% shows good resolution.With this optimized EMD CNN model, we can now define a differentiable EMD loss function for use in training the autoencoder.

EMD neural network
ReLU ReLU Figure 4. Architecture of our EMD CNN.We remap hexagonal HGCAL wafers to a more regular 4×4×3 tensor, and take in pairs of these remapped wafers as our input.
The neural network has multiple 2D convolutional layers, each followed by a batch normalization layer and ReLU activation.The network then feeds into a single fullyconnected layer, followed by a batch normalization layer and ReLU activation.We then average over both input orders to enforce the symmetry in the EMD metric.

Differentiable EMD Loss
The ASIC encoder for the HGCAL is a CNN trained in QKeras with quantization aware training.The autoencoder is trained with the aforementioned trigger cell encoding (A).The convolutional layer has 8 nodes which encode the input into a 16 by 9 bit latent space.This layer has a kernel size of 3 and a stride of 2. The decoder mirrors the encoder.
We quantify the performance of this autoencoder by computing the EMD after training We now train this autoencoder with our EMD CNN loss.We freeze the weights of an optimized EMD CNN, and use that as a loss function for back propagation.Comparing our results to weighted MSE and telescope MSE loss in Fig. 6 demonstrates an improved performance of the autoencoder with the EMD CNN loss.For instance for 1.8 ≤ η ≤ 2.0, the median EMD improves by 35% with respect to the weighted MSE and by 28% with respect to the telesecope MSE.

Physics Performance
While the task of the autoencoder is to reproduce the input wafers, our ultimate goal is to reconstruct electrons (and other particles) in the HGCAL in the L1T in real time.With access to back-end FPGA resources, we can decode trigger cells, and cluster them to reconstruct trigger observables.We evaluate the performance of the autoencoder by decoding and clustering single electron events for the three different tested loss functions.To evaluate the resolution, we first bin the clusters by the pseudorapidity η of the generated electron.In this bin, we calibrate the energy of the cluster, by a multiplicative factor: to account for the L1T reading only half the layers, and an additive factor: to correct for pileup.We then compute the effective RMS (RMS eff ) as the half-width of the shortest interval containing 68% of the electron events.We scale to the mean of the predicted p T to obtain the resolution.An ideal implementation of the ECON-T has zero RMS eff .The RMS eff is shown as a function of η in Fig. 7.By  using the EMD CNN instead of the other loss functions, we find we achieve a better physics resolution for all values of η.In particular, the effective resolution decreases by 13% at η = 2.0 when using the EMD CNN relative to the telescope MSE.

Summary
In this paper, we developed a convolutional neural network (CNN) to approximate the Earth mover's distance (EMD), a nondifferentiable and computationally intensive distance metric, for two-dimensional images.To demonstrate the utility of this approximation, we showed how it can be used to improve the performance of an ondetector data compression algorithm for the high-luminosity upgrade of the LHC (HL-LHC).Specifically, we used the EMD CNN as a custom loss function for an on-detector encoder in the CMS experiment.The optimized version of this neural network approximated the EMD with a resolution of 5%.This neural network is a fast-tocompute and natively differentiable approximation.We used the EMD CNN as a loss function for the energy concentrator trigger ASIC autoencoder.By training the autoencoder with the EMD CNN loss we improved the median EMD between input and output by 25% with respect to the previous best optimization procedure introduced in prior work [21].Finally, we evaluated the physics performance of the autoencoder in terms of the resolution of the transverse momentum of electrons by simulating the HL-LHC trigger path.The EMD CNN loss improves the effective resolution by 13% at a pseudorapidity η = 2.0.We define the performance as RMS eff of p L1T T /p gen T , where RMS eff is one standard deviation away from the mean.We divide the RMS eff by the mean of distribution in that bin.
Beyond this particular application, the technique of approximating optimal transport functions using neural networks is applicable to many scientific tasks where it is important to minimize nondifferentiable or computationally intensive metrics.For example, this technique could be used to train an autoencoder to detect jet anomalies [13], cluster jets [16], or mitigate the effects of pileup [19].In addition, a differentiable EMD approximation can be used to construct observables in searches for new physics [39] that can then be directly optimized for maximum sensitivity via backpropagation.Finally, image-or point-cloud-based generative models often use EMD to evaluate the fidelity of generated data [40,41,42,43], and using a differentiable approximation would enable direct minimization during training [44].By directly optimizing EMD, better and more computationally efficient physics simulators can be trained, potentially enabling end-to-end detector optimization via differentiable programming [45].

Figure 1 .
Figure 1.Conceptual overview of the HGCAL trigger path for the autoencoder.We read a 7 bit input for each of the 48 trigger cells, and compress it to a latent representation of 16×3 bits.The decoder mirrors the encoder to produce the original 48 input image.
25/4, with the extra factor of 1/4 included to keep the scale the 2×2 component of the loss function consistent with the individual cells because each charge fraction can be included in up to 4 sums.Table A1 in Appendix A shows the N 2×2 = 36 2×2 sums and corresponding weights.

Figure 5 .
Figure 5.Performance of the EMD CNN with optimized hyperparameters.Distribution of the relative difference between the EMD CNN prediction and the true EMD (left) Predicted EMD as a function of true EMD (right).The optimized hyperparameters correspond to 32 convolutional filters, kernel size of 5, 4 2D convolutional layers, 1 fully-connected layer with 256 neurons, and MSE loss function.

Figure 7 .
Figure 7. Performance relative to weighted MSE as a function of η.We define the performance as RMS eff of p L1TT /p gen T , where RMS eff is one standard deviation away from the mean.We divide the RMS eff by the mean of distribution in that bin.
Figure 6.Median EMD for decoded HGCAL images versus the charge occupancy in the wafer (left) and wafer position (right).The vertical lines denote 68% EMD intervals.Ideal autoencoder trainings have zero EMD between the input and output.Occupancy is defined as the number of TCs with signals exceeding one minimum ionizing particle divided by cosh η.

Table A1 .
Arrays of 2×2 trigger cells summed for L 2×2 , and corresponding weight factors to account for the number of times constituent cells show up in 2×2 sums.

Table B1 .
The eight best hyperparameter configurations found when optimizing the EMD CNN.