SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine Learning

The cross section is one of the most important physical quantities in high-energy physics and the most time consuming to compute. While machine learning has proven to be highly successful in numerical calculations in high-energy physics, analytical calculations using machine learning are still in their infancy. In this work, we use a sequence-to-sequence model, specifically, a transformer, to compute a key element of the cross section calculation, namely, the squared amplitude of an interaction. We show that a transformer model is able to predict correctly 97.6% and 99% of squared amplitudes of QCD and QED processes, respectively, at a speed that is up to orders of magnitude faster than current symbolic computation frameworks. We discuss the performance of the current model, its limitations and possible future directions for this work.


Introduction
Most machine learning applications in high-energy physics focus on numerical data (see, for example, Refs.[1,2,3,4], while only a few studies have investigated the application of machine learning to symbolic data [5].Working with symbolic data is a challenging task both for humans and machines.Hand manipulation of large symbolic expressions is error prone, hence the routine use of domain-specific symbolic manipulation software tools [6,7].The question we address in this paper is whether it is feasible to encapsulate accurately the highly non-trivial symbolic manipulations encoded in these tools in a machine learning model?Our motivation is to amortize the time required to compute symbolic expressions by accepting an upfront cost in creating them, so that one can reap the benefits later of a symbolic calculation that is much faster than the software tools used to generate the symbolic data. We use a sequence-to-sequence (seq2seq) model, specifically, a transformer [8], to compute symbolically the square of the particle interaction amplitude, a key element of a cross section calculation.To the best of our knowledge, this is a first application of such models to this task.
Cross sections are important quantities in high-energy physics because they link the real world of experimental physics with the abstract world of theoretical models.The calculation of a cross section can be an exceedingly complicated procedure, requiring many mathematical operations, including Lorentz index contractions, color factor calculation, matrix multiplications, Dirac algebra, traces, and integrals.Moreover, the complexity of these operations increases dramatically with the number of final state particles.It is a testament to the high mathematical sophistication and the depth of domain knowledge of the developers of these symbolic manipulation tools that such analytical calculations have been automated (see, for example, FeynCalc [6], CompHEP [9] or MARTY [7]).The squaring of amplitudes can result in very long expressions and correspondingly long computation times, an important challenge to overcome in practical applications.In this work, we address this challenge using symbolic machine learning and demonstrate a proof-of-concept for the symbolic calculation of the key element of cross section calculations.
The paper is organized as follows.In Sec. 2 we note related work.In Sec. 3 we provide the context for the work and define the relevant quantities involved in high-energy physics cross section calculations.We also introduce the seq2seq models, followed by the pertinent details of the datasets, transformer model and its training parameters in Sec. 4. We present our results in Sec. 5, followed by a discussion in Sec.6 and our conclusions in Sec.7

Related Work
Sequence-to-sequence models are the basis of natural language processing (NLP) systems that are frequently used to translate from one language to another [10], summarize text [11] and even map images to textual summaries of them [12].Transformer-based models [8] are now considered the state-of-the-art in NLP applications.The most well-known NLP models are: BERT (Bidirectional Encoder Representations from Transformers) [13] and GPT (Generative Pre-trained Transformer) [14].These models have been used in medicine [15] [16] and DNA analysis [17].
Recently, seq2seq models have been used to perform symbolic calculations in calculus and to solve ordinary differential equations symbolically, achieving excellent results [18].This model was able to predict correctly the symbolic solutions of integrals and found solutions to some problems that could not be completed using the the standard symbolic mathematics systems within a time limit orders of magnitude longer than the model execution time.Sequence-tosequence transformer models have also been used to infer the recurrence relation of underlying sequences of numbers [19].The authors used a transformer model to successfully find symbolic recurrence relations given the first terms of the sequence.Transformer models have also been used for symbolic regression [20].
In physics, symbolic machine learning, such as symbolic regression, have been applied to classical mechanics problems [21,22,23].In Ref. [5], symbolic regression was used to extract optimal observables as an analytic function of phase space applied to Standard Model Effective Field Theory (SMEFT) Wilson coefficients using LHC simulated data.In our work, we take a further step and show that symbolic machine learning relying on transformers can be applied to complex, realistic, calculations in high-energy physics such as the calculation of squared amplitudes.

Background
We begin with an overview of the context of the work reported in this paper.More details can be found in [24], however, for the benefit of readers who are not high-energy physicists, we provide a concise description.
The Standard Model, one of the great intellectual achievements of the 20th century, describes all known elementary particles and their interactions through three of the four known fundamental forces, namely, the weak, electromagnetic, and strong forces (see, for example, [24]).The Standard Model is a Quantum Field Theory (QFT) specified in a  mathematical expression called a Lagrangian.In QFT, the elementary particles are described in terms of quantum fields in space-time, where each type of particle is associated with a different field.The interactions among these particles are governed by fields, which are sometimes referred to as force carriers, whose details are precisely determined by the symmetries imposed on the Lagrangian.
There is a well-defined procedure to extract from the Lagrangian all the possible particle interactions of interest, each associated with a mathematical quantity called an amplitude.These amplitudes can be represented by Feynman diagrams, as shown in Fig. 1.
Calculating the cross section, for example of the process depicted in Fig. 1 and represented symbolically in Fig. 2, requires computing the squared amplitude and averaging and summing over the internal degrees of freedom of the particles.
As noted above, these calculations are typically performed with domain-specific computer algebra frameworks, such as FeynCalc [6], CompHEP [9] and MARTY [7]. Figure 2 shows one of the "shortest" 2 → 3 quantum electrodynamic (QED) expressions after simplification, where in this process two incoming electrons e(p 1 ) and e(p 2 ) scatter into two electrons e(p 3 ) and e(p 4 ) and a photon γ(p 5 ).Typical expressions can have hundreds of terms and the computational time can become a major challenge, especially if higher-order amplitudes are included to render predictions more precise.
The key insight in all the current uses of machine learning for symbolic applications is that many tasks can be viewed as a language translation problem.For example, a system that maps images to textual summaries of them can be viewed as translating from the language of images to a natural language.Likewise, algebraic manipulation such as the mapping of amplitudes to their squared form can be conceptualized as a language translation task.Since this task maps one sequence of symbols to another it is natural to consider seq2seq models.

Model Description and Training
In recent years, sequence-to-sequence transformer models have become one of the most powerful models for a variety of seq2seq applications [8].Transformers use a mechanism called self-attention, whereby the model identifies and makes use of long-range dependencies among the symbols (or tokens) of the sequences.The architecture of the model consists of two parts: an encoder and a decoder, as shown in Fig. 3.The encoder first maps the input sequence to a vector in a high-dimensional vector space in a process called embedding, and encodes the position of each token in the sequence, an operation referred to as positional encoding.Next, a measure of the degree to which a given token is related to other tokens, in a mechanism called multi-head attention, is calculated.During the training of the model, the decoder takes the output vector from the encoder, which encodes information about the input sequence to the encoder, together with the encoded target sequence one token at a time and outputs a sequence also one token at a time.

Datasets, Model and Training
We use the symbolic computation program MARTY [7] to generate expressions for possible interactions in quantum electrodynamics (QED), the theory of the electromagnetic force, and quantum chromodynamics (QCD), the theory of the strong forces.For our proof-of-concept, we restrict the scope 2-to-2 and 2-to-3 particle tree-level processes.All interactions involving off-shell and on-shell particles, anti-particles gauge bosons and triplet and anti-triplet color representations for QCD are included.Since it is possible for different amplitudes to yield the same squared expressions, we include such amplitudes in our dataset.Many of the QED and QCD interactions are non-trivial and lead to complicated expressions.All expressions are simplified with the Python symbolic mathematics module SymPy [25], factorized by particle masses, and organized into a standard format (first the overall factors, then the terms of the numerator, and third the denominator).For practical reasons, we exclude expressions longer than 264 tokens after simplification which excludes 5% and 26% of all QED and QCD expressions.This procedure yields 251,000 QED and 140,000 QCD expression pairs.The data are split into three sets: training, validation and test, 70%, 15% and 15%, respectfully, and we choose a random sample of 500 expression pairs (from the test set) to evaluate the performance of the trained model.
The Tensorflow [26] package is used to map symbols to integers, specifically with the TextVectorization class, which performs the tokenization, that is, the assignment of an integer to each symbol and the padding of sequences to make them of equal length.Each sequence is then converted to a vector built from these integers.The amplitudes are tokenized by operator (tensor) and its indices, while for squared amplitudes we tokenize them by each mass, product of momenta and numerical factor (for example, 4 * m 2 e * p 1 .p 2 is three tokens) as there are a finite number of terms consistent with the physical dimension (in powers of mass) and conservation laws.
The transformer model is implemented using Tensorflow 2.8 [26] along with Keras 2.8 [27] without structural modifications.The model has 6 layers and 8 attention-heads, with 512 embedding dimensions and 8192-16384 latent dimensions.We use sparse categorical crossentropy as the loss function, the Adam optimizer [28] with a learning rate of 10 −4 and a batch size of 64.The training was performed for 50-100 epochs on two CASCADE-NVIDIA V100 GPUs which took about 12-24 hours.

Results
The accuracy of the predicted symbolic expressions is assessed by taking a random sample of 500 amplitudes (from the test set) that have not been seen by the transformer model and predicting their squared amplitudes.We use three distinct metrics to assess the model accuracy.
1. Sequence Accuracy: the percentage of predicted symbolic expressions that identically match the correct expression.
2. Token Score: a measure of the number of tokens (symbols) predicted correctly in the correct location in the sequence.
where n c is the number of tokens predicted correctly, n ex is the number of extra tokens that the model predicts (if any), and n act is the number of tokens in the correct sequence.
3. Numerical Error: the relative difference between the values of the predicted and the correct expressions when using random numbers between {0, 100} for the momenta in the squared amplitude expression.
where x act is the numerical value of the actual (correct) squared amplitude, x pred is the predicted numerical squared amplitude.Repeating this process 100 times and varying the random numbers each time, we compute the Root-Mean-Square-Error (RMSE) whose distributions are shown in Fig. 5 Table 1 summarizes the performance of our models.For QED, the model trained on QED expressions, correctly and identically predicts the squared amplitude for 493 out of 500 (98.6%)amplitudes.The remaining 7 squared amplitudes  differ by one or more tokens.The overall token scores is 99.7%, i.e., the model correctly predicts at least 99.7% of the tokens on average.For QCD, the model trained on QCD expressions has a sequence accuracy of 97.4% and the token score is 98.9%.The average numerical error is 1.3×10 −3 for QED and 8.8×10 −3 for QCD (excluding two anomalous results with RMSE > 10 in QCD only).The model trained on the combined QED and QCD dataset achieves a sequence accuracy of 99.0% for QED with a token score of 99.4% and an RMSE of 2.5×10 −3 .The combined model gives a sequence accuracy of 97.6% and token score of 98.8% and RMSE of 6.8×10 −3 for QCD.

Discussion
It has been demonstrated that even for a mathematical problem as complicated as squaring an amplitude, averaging and summing of over the internal degrees of freedom of particles, and manipulating the result into a meaningful form, it is possible to encapsulate that domain knowledge in a transformer model.Our model can "do the algebra" with a sequence accuracy between 97.6% and 99.0%.The time of inference, on average, varies from one second to two seconds.The prediction time for QED is slightly longer (or similar) to the time taken by MARTY for the same calculation.For QCD, we find that our model is up to 2 orders of magnitudes faster than MARTY for some amplitudes, (especially those with anti-triplet color representations).
We observe, as it shown in Table 2 and Table 3, that the accuracy of the mapping is primarily affected by two factors: the amount of training data and the sequence length.The accuracy of the QED results is slightly higher than for QCD, which reflects the fact that the QED training sample is larger than the QCD sample and the QED sequences are shorter on average than those for QCD.This is especially true of the amplitude expressions, which are typically more complicated for QCD than for QED.
It is noteworthy that when trained on the combined QED and QCD datasets, we obtain a better performance overall and a higher performance on the QED and QCD test sets separately.This indicates that our results are still data limited.Therefore, we should anticipate performance improvements if we are able to create and use larger datasets.There are several ways to achieve that.For example, additional tree-level processes can be included, such as 2-to-4 or 2-to-5, or by going beyond tree-level processes.Additionally, we can add predictions from theories of beyond the Standard Model (BSM) physics as long as the mathematical structure of these theories is similar to that of the Standard Model.
The complexity of the transformer model increases quadratically with sequence length, which makes the training highly resource-intensive and affects the accuracy.This challenge can be addressed with variants of the basic transformer  Another issue that needs to be addressed is how to identify when a predicted squared amplitude is anomalous when there is no access to the correct answer.We have seen that a single token error involving a quantity with a large value, such as the top quark mass, impacts the numerical result severely.It would be highly desirable to have a built-in solution rather than rely on human intervention by, for example, automating dimensional analysis (as each term should have the same dimension), or checking if the expression respects the conservation laws.While these practical solutions may be sufficient in some cases, a machine-learning solution whereby a confidence level can be assigned to the overall expression as well as on a per-token basis may be needed for this task.We leave this to future work.
As sequence-to-sequence models can be applied to all types of sequences, we can consider another interesting direction.Since Feynman diagrams themselves can be written down as sequences, we can map the Feynman diagram to the squared amplitude or to the cross section directly, as shown schematically in Fig. 6.A notable advantage of this is that the input Feynman diagram can be written by hand, if desired, without the need for a domain-specific tool to construct the amplitude.As a proof-of-concept of this idea, we test the transformer model on 2-to-3 QED and QCD processes following the same procedures as before.Remarkably, the QED Feynman diagram-based sequence model attains a sequence accuracy of 99.0% and token accuracy of 99.7% with average RMSE of 9.3×10 −4 (see Table 1).
For QCD, the accuracy is much lower than for QED; the model for QCD attains a sequence accuracy of 73.4% and a token accuracy of 82.0% with average RMSE of 0.3 (excluding anomalous results with RMSE > 10).The current accuracy in QCD is lower compared to the models relying on amplitude sequence information, but this is expected as the number of input tokens is much smaller and may also indicate the need for a much larger training dataset, which we shall consider in a future work.
Our results demonstrate the ability of a symbolic deep learning model to learn the mapping between a particle interaction amplitude and its square to high accuracy despite the complexity of the mapping process.We observe that the accuracy strongly depends on the sequence length.The model has some limitations; however, the results obtained are sufficiently promising to motivate the search for further performance improvements that will be the focus of future work.

Figure 1 :
Figure 1: A Feynman diagram of two incoming electrons e(p 1 ) and e(p 2 ) scattering into two electrons and a photon.

Figure 2 :
Figure 2: Amplitude and squared amplitude of the ee → eeγ scattering process.

Figure 4 :
Figure 4: Sequence length distributions for QCD and QED sequences after expression simplification.

Figure 5 :
Figure 5: (left) RMSE distributions for QCD and QED expressions when models are trained separately for QCD and QED.(right) RMSE distributions when a single model is trained using the combined QCD and QED dataset.

Figure 6 :
Figure 6: Amplitude to squared amplitude and Feynman diagram to squared amplitude workflows.

Table 1 :
Amplitude-squared amplitude Model resultsTraining Size Sequence Acc.Token Score RMSE

Table 3 :
Model performance on different sequence lengths of QCD and QED dataset QCD Maximum sequence length: 170 tokens 195 tokens full length (256 tokens)