Point Cloud Transformers applied to Collider Physics

Methods for processing point cloud information have seen a great success in collider physics applications. One recent breakthrough in machine learning is the usage of Transformer networks to learn semantic relationships between sequences in language processing. In this work, we apply a modified Transformer network called Point Cloud Transformer as a method to incorporate the advantages of the Transformer architecture to an unordered set of particles resulting from collision events. To compare the performance with other strategies, we study jet-tagging applications for highly-boosted particles.


Introduction
The interactions between elementary particles is described by the Standard Model (SM) of particle physics. Particle colliders are then used to study these interactions by comparing experimental signatures to SM predictions. At every collision, hundreds of particles can be created, detected, and reconstructed by particle detectors. Extracting relevant physics quantities from this space is a challenging task that is often accomplished through the usage of physics-motivated summary statistics, that reduce the data dimensionality to manageable quantities. A recent approach is to interpret the set of reconstructed particles as points in a point cloud. Point clouds represent a set of unordered objects, described in a well-defined space, often used for applications in selfdriving vehicles, robotics, and augmented reality, to name a few. With this approach, information from each bunch crossing in a particle collider is interpreted as a point cloud, where the goal is to use this high-dimensional set of reconstructed particles to extract relevant information. However, extracting information from point clouds can also be a challenging task. One novel approach is to use the Transformers architecture [1] to learn the semantic relationship between objects. Transformers yielded a great success in recent years applied to natural language processing (NLP), often showing a superior performance when compared to previous well-established methods. The advantage of this architecture is the capability of learning semantic affinities between objects without losing information during long sequences. Transformers are also easily parallelizable, a huge computational advantage over sequential architectures like gated recurrent [2] and long short-term memory [3] neural networks. Applications of the Transformer network have already been applied to examples outside NLP problems, with examples in image recognition [4,5].
However, the original Transformer architecture is not readily applicable to point clouds. In fact, since point clouds are intrinsically unordered, the Transformer structure has to be modified to define a self-attention operation that is invariant to permutations of the inputs. A recent approach introduced in [6] addresses these issues through the development of Point Cloud Transformer (PCT). In this approach, the input points of the point cloud are first passed through a feature extractor to create a high dimensional representation of the particles features. The transformed data is then used as an input to a self-attention module. This module introduces attention coefficients per particle that learn the relative pairwise importance of each point in the set.
In this work, we will first describe the key features of the PCT architecture, and then evaluate it in the context of a high energy physics task, in the form of jet-tagging. Due to the nature of our example task the segmentation module of the PCT is not required. Results are compared with other benchmark implementations using three different public datasets.

Related works
Neural network architectures that treat collision events as point clouds have recently grown in number given their state-of-the-art performance when applied to different collider physics problems. A few examples of such applications are jet-tagging [7,8], secondary vertex finding [9], event and object reconstruction [10,11,12,13,14], and jet parton assignment [15]. A comprehensive review of the different methods is described in [16].
Two particular algorithms for jet-tagging will be relevant for the following discussions of the PCT implementation. These are the ParticleNet [17] and ABCNet [18] architectures. The former introduces the EdgeConv operation for jet tagging, initially developed in [3]. This operation uses a k-nearest neighbors approach to create local patches inside a point cloud. The local information is then used to create high level features for each point that retains the information of the local neighborhood. ABCNet, on the other hand, uses the local information to define an attention mechanism that encoded the neighborhood importance for each particle. This idea was first introduced in [19] and applied in [20]. A similar concept of attention mechanisms are defined for PCT, where a self-attention layer is used to provide the relationship importance between all particles in the set.
Jet-tagging is a common task used to benchmark different algorithms applied to collider physics. While a number of algorithms have been proposed in recent years, special attention will be given to algorithms with results in public datasets. In [17], results are presented for both quark gluon and top tagging datasets, while [21] introduces a multiclassification sample containing five different jet categories. The description of each dataset is discussed in Sec. 5.

PCT network
The Transformer implementation applied to point clouds requires two main building blocks: the feature extractor and the self-attention (SA) layers. The feature extractor is used to map the input point cloud F in ∈ R N ×d in to a higher dimensional representation F e ∈ R N ×dout . This step is used to achieve a higher level of abstraction for each point present in the point cloud. In this work, two different strategies are compared. An architecture consisting of stacked, one-dimensional convolutional (Conv1D) layers, and a second option, based on EdgeConv blocks. The EdgeConv block consists of an EdgeConv operation [3] followed by 2 two-dimensional convolutional (Conv2D) layers and an average pooling operation across all neighbors. To ensure the permutation invariance between particles is maintained, all convolutional layers are implemented with stride and kernel size of 1 and unless otherwise stated, are followed by a batch normalization operation and ReLU activation function. All convolutional layers are implemented such that the particle information is preserved, i.e for N particles with F features, the convolutional operations only modify F while N is unchanged. The EdgeConvolution operation uses a k-nearest neighbors approach to define a vicinity for each point in the point cloud. This enhances the ability of the network to extract information from a local neighborhood around each point.
The first strategy is referred to simple PCT (SPCT) while the second will be referred to just PCT.
The second main building block is the usage of an offset attention defined as a self-attention (SA) layer. The output of the feature extractor F e is used as the input of the first SA layer. The goal of the SA layer is to determine the relationship between all particles of the point cloud through an offset attention mechanism. This approach differs from the one taken in ABCNet, where a self-attention and neighboring attention coefficients are defined for a neighborhood of each particle.
In the same terms defined in the original Transformer [1] work, three different matrices, all built from linear transformations of the original inputs. These matrices are called query (Q), key (K), and value (V). Attention weights are created by matching Q and K through matrix multiplication. These attention weights, after normalization, represent the weighted importance between each pair of particles. The self-attention is then the result of the weighted elements of V, defined as the result of the matrix multiplication between the attention weights and the value matrix. The linear transformations are accomplished through the usage of Conv1D layers such that: The dimension d a of the Q and K matrices is independent from the dimension d out .
In this work, and similarly to the choice used in the original PCT implementation, d a is fixed to d out /4. While other options can be used, we have not observed further improvements for different choices of d a . To maintain the linear operations in these definitions, no batch normalization or activation function are used. The matrices (W q , W k , W v ) contain the trainable linear coefficients introduced by the convolutional operation. The attention weights (A) are then calculated by first multiplying the query matrix with the transpose of the key matrix followed by a softmax operation: The softmax operation is applied to each row of A to normalize the coefficients for all points, while the columns are normalized using the L1 norm (represented with N in the equation), after the Softmax operation. This decision follows the approach introduced in the PCT model, different from the original Transformer implementation that used 1/ √ d, with d the embedding dimension. The attention weights are then multiplied by the value matrix, resulting in the self-attention F sa with While the original Transformer architecture uses F sa to derive an absolute attention per object, it was observed by the PCT authors, and later revisited in Section 6, that the usage of an offset-attention might result in a superior classification performance. The offset-attention defines the self-attention coefficients as per particle modifications of the inputs, rather than an absolute quantity per particle. To calculate the offset-attention, first the difference between F e and F sa is calculated as the initial offset. This initial offset is then passed through a Conv1D layer with same output dimension d out . The result of this layer is now the offset added to the initial inputs F e . Different levels of abstraction can be achieved by stacking multiple SA layers, using the output of each SA layer as the input for the next.
To complete the general architecture, the SA layers are combined through a simple concatenation over the feature dimension, followed by a mean aggregation, resulting in the overall means of each feature across all the particles, differently from the original PCT implementation that instead uses a maximum pooling operation. The output of this operation is passed through fully connected layers before reaching the output layer, normalized through a softmax operation. Similarly to the convolutional layers, all fully connected layers besides the output layers are followed by a batch normalization and ReLU activation function.
The general PCT network and the main building blocks are shown in Fig. 1. The training details are explained in Sec. 4

Training details
The PCT implementation is done using Tensorflow v1.14 [22]. A Nvidia GTX 1080 Ti graphics card is used for the training and evaluation steps. For all architectures, the Adam optimiser [23] is used with a learning rate starting from 0.001 and decreasing by a factor 2 every 20 epochs, to a minimum of 1e-6. The training is performed with a mini batch size of 64 and a maximum number of 200 epochs. If no improvement is observed in the evaluation set for 15 consecutive epochs, the training is stopped. The epoch with the lowest classification loss on the test set is stored for further evaluation.

Jet classification
Different performance metrics are compared for (S)PCT applied to a jet classification task on different public datasets. Jets are collimated sprays of particles resulting from the hadronization and fragmentation of energetic partons. Jets can show distinguishing radiation patterns depending on the elementary particle that has initiated the jet. Traditional methods use this information to define physics-motivated observables [24] that can distinguish different jet categories. The PCT architecture uses two EdgeConv blocks, each defining the k-nearest neighbors of each point with k = 20. The initial distances are calculated in the pseudorapidity-azimuth (η − φ) space of the form ∆R = ∆η 2 + ∆φ 2 . The distances used for the second EdgeConv block are calculated using the full-feature space produced in the output of the last EdgeConv block.
Besides the feature extractor, PCT uses three SA layers while SPCT uses two. The output of all SA layers are concatenated in the feature dimension for both PCT and SPCT. The output of the last EdgeConv block is also added during concatenation with a skip connection. The detailed architectures used during training for PCT and SPCT are shown in Fig. 2.
PCT and SPCT receive as inputs the particles found inside the jets. The input features vary between applications, depending on the available content for each public dataset. For all comparisons, up to 100 particles per jet are used. If more particles were found inside a jet, the event is truncated, otherwise zero-padded up to 100. The choice of 100 particles cover the majority of the jets stored in all datasets without truncation (more than 99% of the cases for all datasets). To handle the zero-padded particles, the original PCT implementation has been modified such that these particles are masked during the k-nearest neighbors calculation and are replaced by large negative values to the matrix A defined in Eq. 2 such that the following softmax operation will result in zeros for zero-padded entries.

HLS4ML LHC Jet dataset
For this study, samples containing simulated jets originating from W bosons, Z bosons, top quarks, light quarks, and gluons produced at √ s = 13 TeV proton-proton collisions are used. The samples are available at [25]. This dataset is created and configured using a parametric description of a generic LHC detector, described in [26,27]. The jets are clustered with the anti-k T algorithm [28] with radius parameter R = 0.8, while also requiring that the jet's transverse momentum is around 1 TeV, which ensures that most of the decay products of the generated particles are found inside a single jet.  The training and testing sets contain 567k and 63k jets respectively. The performance comparison is reported using the official evaluation set, containing 240k jets. For each jet, the momenta of all particle constituents are available. For each particle, a set of 16 kinematic features are used, matching the ones used in [21], to facilitate the comparison. All input features available in the dataset are used without modification, with exception to the momenta and energy of the particles, which are taken after a logarithmic function is applied to limit the feature range.
The area under the curve (AUC) for each class is calculated in the one-vs-rest strategy. The results of the AUC are shown in Table 1 while the true positive rate (TPR) for a fixed false positive rate (FPR) at 10% and 1% are shown in Table 2. PCT has the overall higher AUC compared to other algorithms, followed closely by SPCT and JEDI-net. For fixed FPR values, PCT also shows a good performance for all categories, achieving the best result for almost all categories and FPR values. Table 1. Area under the curve and accuracy for each jet category reported on the HLS4ML LHC Jet dataset. Results for all algorithms are taken as the average of 10 trainings with random network initialization. If the uncertainty is not quoted then the variation is negligible compared to the expected value. Bold results represent the algorithm with highest performance. All results besides (S)PCT are taken from [21].

Algorithm
Gluon

Top tagging dataset
The top tagging dataset consists of jets containing the hadronic decay products of top quarks together with jets generated through QCD dijet events. The dataset is available at [29]. The events are generated with Pythia8 [30] with detector simulation done through Delphes [31]. The jets are clustered with the anti-k T algorithm with radius parameter R = 0.8. Only jets with transverse momentum p T ∈ [550, 650] GeV and rapidity |y| < 2 are kept. The official training, testing, and evaluation splitting are used, containing 1.2M/400k/400k events respectively. For each particle, a set of 7 input features is used, based only on the momenta of each particle clustered inside the jet. The input features per particle are the same ones used in [17] to facilitate the comparison between algorithms. Considering jets containing top quarks as the signal, the AUC and background rejection power, defined as the inverse of the background efficiency for a fixed signal efficiency, are listed in Tab. 3, with a reduced number of algorithms as reported in [17]. A more complete, although slightly outdated list is available at [32].

Quark and gluon dataset
The dataset used for the studies are available from [33], providing for each particle the same information present in the top tagging dataset (the momenta of the particles) with additional information that identifies the particle type as either electron, muon, charged hadron, neutral hadron, or photon. It consists of stable particles clustered into jets, excluding neutrinos, using the anti-k T algorithm with R = 0.4. The quarkinitiated sample (treated as signal) is generated using a Z(νν) + (u, d, s) while the gluon-initiated data (treated as background) are generated using Z(νν) +g processes. Both samples are generated using Pythia8 [30] without detector effects. Jets are required to have transverse momentum p T ∈ [500, 550] GeV and rapidity |y| < 1.7 for the reconstruction. For the training, testing and evaluation, the recommended splitting is used with 1.6M/200k/200k events respectively. Each particle contains the four momentum and the expected particles type (electron, muon, photon, or charged/neutral hadrons). For each particle, a set of 13 kinematic features is used. These features are chosen to match the ones used in [17,18]. The AUC and background rejection power are listed in Tab. 4.

Ablation study
To understand the benefits of the self-attention and the architecture choices for the development of (S)PCT applied to jet-tagging, we provide the comparison between the results attained by (S)PCT shown in the previous sections with different implementation choices. We start by comparing the benefit the SA layers bring to the classification by training the (S)PCT model without the SA layers, effectively taking the output of the feature extractor and passing it through the mean aggregation layer before the fully connected layers. While describing the PCT architecture, an offset-attention was used instead of the self-attention introduced in the original Transformer architecture. The comparison of both options has also been performed, where F sa , introduced in Eq. 3, is used as the output of the SA layer. These comparisons are summarized in Tables 5  and 6 with the HLS4ML LHC Jet dataset, Table 7 with the top tagging dataset, and 8 with the quark and gluon dataset. The values reported for the different scenarios are the result of a single training (hence no uncertainties provided), but is has been verified that the results shown are stable with additional trainings. The implementation of PCT without the attention modules results in an architecture that is similar to the ParticleNet-Lite implementation from [17], also resulting in similar performance for both top tagging and quark gluon datasets. The differences for SPCT with and without the attention modules are much bigger compared to PCT. Without the attention layers, SPCT does not have access to the information shared between particles, resulting in worse performance. Since for PCT the neighboring information is already used in the feature extractor, this effect is mitigated. Compared to the original self-attention implementation, the offset-attention shows better performance, which is more noticeable in the quark gluon dataset. The quark gluon dataset uses the same input features per particle as the top tagging dataset, but with an additional particle identifier (PID) information, encoded as binary entries to categorize the particles into five categories: muons, electrons, photons, charged hadrons, and neutral hadrons. Given that results without attention for the quark gluon dataset also seem to yield better performance compared to the standard attention, this might indicate that binary input features are not correctly handled by the standard attention implementation. In the offset-attention, the module calculates an offset of the inputs per particle, relying less on the absolute value of the PID flag.

Computational complexity
Besides the algorithm performance, the computational cost is also an important figure of merit. To compare the amount of computational resources required to evaluate each model, the number of trainable weights and the number of floating point operations (FLOPs) are computed. The comparison of these quantities for different algorithms are shown in Tab. 9.

Algorithm
Weights FLOPs ResNeXt-50 [17] 1.46M -P-CNN [17] 348k -PFN [33] 82k -ParticleNet-Lite [17] 26k -ParticleNet [17] 366k -ABCNet [18] 230k -DNN [21] 14.7k 27k GRU [21] 15.6k 46k CNN [21] 205.5k 400k JEDI-net [21] 33 While PCT shows a better overall AUC compared to SPCT, the improvement in performance from the usage of EdgeConv blocks comes with a cost in computational complexity. SPCT, on the other hand, provides a good balance between performance and computational cost, resulting in more than 100 times less FLOPs and almost 30 times less trainable weights compared to PCT. Although JEDI-net shows similar performance to SPCT, the implementation takes as inputs 150 particles with the model based on O sums appearing with a large FLOP count of 458M. This number however can be decreased with a sparse implementation, since many of these operations are ×0 and ×1 products.

Visualization
The SA module defines the relative importance between all points in the set through the attention weights. We can use this information to identify the regions inside a jet that have high importance for a chosen particle. To visualize the particle importance, the HLS4ML LHC jet dataset is used to create a pixelated image of a jet in the transverse plane. The average jet image of 100k examples in the evaluation set is used. For each image, a simple preprocessing strategy is applied to align the different images. First, the whole jet is translated such that the particle with the highest transverse momentum in the jet is centered at (0,0). This particle is also used as the reference particle from where attention weights are shown. Next, the full jet image is rotated, making the second most energetic particle aligned with the positive y-coordinate. Lastly, the image is flipped in the x-coordinate in case the third most energetic particle is located on the negative x-axis, otherwise the image is left as is. These transformations are also used in other jet image studies such as [34,18]. The pixel intensity for each jet image is taken from the attention weights after the softmax operation is applied, expressing the particle importance with respect to the most energetic particle in the event. A comparison of the extracted images for each SA layer and for each jet category is shown in Fig. 3 .       Figure 3. Average jet image for each jet category (columns) and for each self-attention layer (rows). The pixel intensities represent the overall particle importance compared to the most energetic particle in the jet.
The different SA layers are able to extract different information for each jet. In particular, the jet substructure is exploited, resulting in an increased relevance to harder subjets in the case of Z boson, W boson, and top quark initiated jets. On the other hand, light quark and gluon initiated jets have a more homogeneous radiation pattern, resulting also in a more homogeneous picture.

Summary of experimental results
In this section, the summary of the comparisons made using different datasets is given. In the HLS4ML dataset, both PCT and SPCT show state-of-the-art performance, resulting in superior AUC compared to other approaches. Although the AUC provides a general idea of the performance, the TPR at fixed FPR thresholds represent a more realistic application of the algorithm in high energy physics, where one is interested in maximizing TPR while having control of the background level. in these comparisons again PCT and SPCT show a superior performance for the majority of the jet categories. In the top tagging and quark gluon datasets, a similar picture is shown, with PCT and SPCT again performing well compared to other algorithms. In the top tagging dataset, the results achieved with PCT and ParticleNet are similar and compatible within training uncertainties. This observation suggests that the information encoded through EdgeConv layers might already provide enough model abstraction to perform the classification. On the other hand, in the quark gluon dataset, the performance of PCT compared to ParticleNet shows a 20% improvement in background rejection power at 30% signal efficiency. This performance is similar to the one reported by the ABCNet architecture that implements attention through graph attention pooling layers [35], evidencing some of the benefits of the usage of attention mechanisms.

Conclusion
In this work, a new method based on the Transformer architecture was applied to a high energy physics application. The point cloud transformer (PCT) modifies the usual Transformer architecture to be applied to a set of unordered points present in a point cloud. This method has the advantage of extracting semantic affinities between the points through the development of a self-attention mechanism. We evaluate the performance of this architecture applied to several jet-tagging datasets by testing two different implementations, one that exploits the neighborhood information through EdgeConv operations and a simpler form that connects all points through convolutional layers called simple PCT (SPCT). Both approaches have shown stateof-the-art performance compared to other publicly available results. While the classification performance of SPCT is slightly lower compared to the standard PCT, the number of floating point operations required to evaluate the model decreases by almost a factor 20. This reduced computational complexity can be exploited in environments with limited computing resources or applications that require fast inference responses.
A different advantage of (S)PCT is the visualization of the self-attention coefficients to understand which points have a greater importance through the classification task. Traditional methods often define physics-motivated observables to distinguish the different types of jets. PCT, on the other hand, exploits subjet information by learning affinities on a particle-by-particle basis, resulting in images with distinct features for jets of different decay modes.