Unravelling physics beyond the standard model with classical and quantum anomaly detection

Much hope for finding new physics phenomena at microscopic scale relies on the observations obtained from High Energy Physics experiments, like the ones performed at the Large Hadron Collider (LHC). However, current experiments do not indicate clear signs of new physics that could guide the development of additional Beyond Standard Model (BSM) theories. Identifying signatures of new physics out of the enormous amount of data produced at the LHC falls into the class of anomaly detection and constitutes one of the greatest computational challenges. In this article, we propose a novel strategy to perform anomaly detection in a supervised learning setting, based on the artificial creation of anomalies through a random process. For the resulting supervised learning problem, we successfully apply classical and quantum support vector classifiers (CSVC and QSVC respectively) to identify the artificial anomalies among the SM events. Even more promising, we find that employing an SVC trained to identify the artificial anomalies, it is possible to identify realistic BSM events with high accuracy. In parallel, we also explore the potential of quantum algorithms for improving the classification accuracy and provide plausible conditions for the best exploitation of this novel computational paradigm.


I. INTRODUCTION
Current approaches for the description of elementary particles rely on the standard model (SM) of particle physics [1][2][3][4].Despite its experimental success, the SM is theoretically incomplete and new physics is yet to be explored [5,6].Since the discovery of the Higgs boson, the search for new physics, for example via experiments at the Large Hadron Collider (LHC), has eventually become one of the main focuses of research around high energy physics (HEP).
Experimental data obtained by the LHC experiments can help addressing the theoretical shortcomings of the SM.If recorded, observations showing a significant deviation from the SM would indicate the existence of new physics.However, state of the art LHC experiments do not yet show a clear indication of phenomena that could motivate and validate new theories.A key challenge in this quest is represented by the problem of storing and processing the amount of data produced by the LHC, amounting to O (10 6 ) collisions per second, each consisting of O(1) MB.This challenge will be amplified significantly with the start of the High Luminosity-LHC (HL-LHC) program planned for 2029 [7], that will reach a larger luminosity at the price of a larger data flow.
A possible remedy to handle this huge amount of data relies on the use of Machine Learning (ML) models, which digest large amounts of data in order to extract underlying patterns [8][9][10][11][12].The use of anomaly detection techniques has been proposed as a valuable tool to identify Beyond SM (BSM) events among the dominating number of SM background events [13][14][15].Different approaches have been investigated.In the supervised learning setting, the ML model is trained to distinguish between background and signal events using a dataset where the events were labelled with the corresponding class.However, to acquire the labels of an event some preliminary knowledge about the studied processes is required, e.g. through numerical simulations of a BSM theory.This typically limits the generalization power of a supervised algorithm.In the context of the search for new physics at the LHC, this means that this approach is typically effective whenever the considered signal is the correct one (e.g., for Higgs boson searches).In unsupervised learning setting, the ML model is trained to learn the structure of a dataset largely dominated by SM events.Without being provided any additional information, it aims at identifying an anomaly as an outlier of some typicality measure, learned during the training process.An unsupervised learning algorithm requires no labels.On one hand, this increases the generalization power of the model.On the other hand, it reduces the accuracy since less information about any specific signal is used.
In this paper, we propose a strategy for anomaly detection that tries to retain the best out of the supervised and 1. Illustration of the pipeline advised for our anomaly detection strategy.We start with HEP datasets of simulated SM, Higgs and Graviton events.The Higgs events are considered separately, because they are not included in the processes of the SM dataset.We apply a random process to the SM dataset to create artificial anomalies (see II). Based on the four different datasets, we create balanced two-class datasets.One of the classes is always the SM.The other class is either composed of artificial anomalies (training and testing), or Higgs or Graviton events (validation).We apply two preprocessing steps to the classification datasets: feature extraction with PCA and normalization of the extracted features to the interval [−π, π].
Each training dataset is then used to train multiple quantum or classical SVCs from which we select the best one based on the performance on the test dataset.The best SVC is then applied in the detection of unseen anomalies (Higgs or Graviton events).
unsupervised setting.We propose a supervised learning setting in which the signal sample is built perturbing the background sample, without relying on any specific BSM theory.The background events are represented by a selection of SM processes, and signal events are generated artificially through a random process, which is denoted here as scrambling.This approach ensures that we introduce as little physically-inspired bias as possible into the types of signal events we are looking for, while guaranteeing that the defined processes are compliant with the conservation laws of physics and detector-specific constraints.We solve the resulting binary classification between background and signal events with the support vector classifier (SVC) approach.An illustration of the proposed anomaly detection pipeline is shown in Fig. 1.
In this work, we verify numerically the feasibility to distinguish SM events from artificially created anomalies.We show that such a classifier preserves its discrimination power once the scrambling anomalies are replaced by events from realistic BSM theories.Furthermore, we demonstrate the application of a kernel-based quantum classification algorithm to the problem under study.
This paper is organized as follows: In section II we present the scrambling method, followed by a short description of the applied classification algorithm in section III.In section IV we demonstrate and discuss the effectiveness of the proposed anomaly detection workflow.Finally, in section V, we conclude with a general discussion of the proposed method and the application of quantum algorithms for the studied classification task.

II. DATASET SCRAMBLING
To generate a dataset for our supervised learning problem, we start from an existing collection of SM events and we vary the different features in order to introduce anomalies (artificial events).The events are represented by features like the momenta or the number of particles, extracted from the collision data obtained from experiments or simulations (see section IV A and appendix A for details).The variation of these features is done under certain constrains imposed by physics conservation laws FIG. 2. a Schematic illustration of the scrambling idea.The scrambling process generates random events (blue region) based on SM events (yellow region).Being able to distinguish a SM event from a scrambled event, enables us to identify events originating from physics beyond the SM (green region).b Comparison of initial SM data and scrambled data at medium scrambling strength, for a selection of the high-level features.
or experiment-related constraints.We call this process data scrambling.Its goal is to generate events that do not conform with the SM, without relying on a specific BSM theory, and therefore introducing as little bias as possible about the type of BSM events we would like to identify.We propose to do this via a random perturbation of the SM events.The main idea of the scrambling is illustrated in Fig. 2a.Starting from the SM dataset (yellow region), we generate artificial events outside the SM (blue region).Even though there is an overlap between the region of SM events and artificial events, we want to verify that a classifier, trained to separate the SM events from the artificial BSM ones, would retain its discrimination power once applied to realistic BSM events (green region).
The scrambling is done by replacing a feature in the original SM dataset with a new value chosen according to a Gaussian distribution N (µ, σ), where µ = f is the initial value of the feature and σ is the standard deviation.Depending on the feature, the standard deviation of the scrambling distribution is chosen according to one of the following three options; (i) the standard deviation is fixed to a constant, σ = λ f , (ii) proportional to the standard deviation σ f of the initial feature distribution, σ = λ f σ f , or (iii) proportional to the feature value f , σ = λ f f .The constant λ f , in the following denoted as scrambling factor, determines the strength of the scrambling.Some of the features are correlated (e.g.transverse momentum of lepton and missing transverse energy) and therefore cannot be scrambled individually.Additionally, depending on the feature we have to implement different strategies to respect conservation laws or detectorspecific limitations.Therefore, the features are divided into four categories: momenta, isolations, jets and particle numbers, each with their own scrambling strategy.The scrambling strategies are presented in detail in appendix B. In Fig. 2b, the scrambling is visualized for some features in the SM dataset, each belonging to one of the categories mentioned above.
The idea of the scrambling is not limited to the specific choice of the sampling distributions introduced above.In principle, any sampling distribution is valid, as long as some generated events lie outside the "event space" of the SM, and they respect the physical conservation laws and the constraints imposed by the detector.However, there is no guarantee that any chosen scrambling distribution will generate events resembling BSM events.Nevertheless, the hope is that by learning to distinguish between SM and artificial events, we obtain some level of generalization on out-of-distribution samples, and therefore the possibility to detect BSM events even if they would lie outside the space of scrambled events (no overlap between the blue and green region in figure 2a).

III. SUPPORT VECTOR CLASSIFIER
a. Classical SVC The Support Vector Classifier (SVC), which belongs to the family of kernel methods, is a supervised learning model that can draw hyperplanes between two classes of data points.By embedding data points into a high dimensional feature space, where they become linearly separable, SVCs can successfully solve complex classification tasks.The success of the SVC results from the so-called kernel trick, which allows to calculate a similarity measure between data points (i.e., the kernel) without explicitly performing the mapping to the high dimensional feature space.An example of such a kernel is the widely used radial basis function (RBF) kernel (also known as Gaussian kernel), where x i , x j ∈ R n , and γ is a hyperparameter to determine the bandwidth of the kernel function.Notice that this specific kernel function corresponds to a situation in which data points would effectively be mapped into an infinite-dimensional feature space [35].b.Quantum SVC For the quantum SVC (QSVC), the kernel values are evaluated in a Hilbert space of quantum states.Specifically, classical features x ∈ X are en-< l a t e x i t s h a 1 _ b a s e 6 4 = " X c A A l q D u k r + q x C 4 t v 9 j S S c I O R l N c g A c 0 y k q q w 0 8 8 b l e A g y K 6 l m r P k e F k n z r G q d V 2 u 3 t X L 9 a t p R E R 2 i I 1 R B F r p A d X S D G s h G B D 2 h F / S K 3 r R n 7 V 3 7 0 D 4 n q w V t e r O P Z q B 9 / Q K x R 5 4 3 < / l a t e x i t > U (x 0 ) † < l a t e x i t s h a 1 _ b a s e 6 4 = " j P A F P a        Kernel matrix for the classification between SM events and artificial anomalies calculated on the IBM quantum processor ibm cairo using 6 features (corresponding to 6 qubits).
coded in the quantum state space F via a feature map φ : X → F, Usually, the feature map is given in terms of a parameterized unitary U (x) applied to a fixed reference state, e.g.|φ(x) = U (x) |0 .The kernel used in the classical SVC optimization is then calculated as the overlap between two encoded quantum states, A kernel of this form can be evaluated on a quantum device, and could bring an advantage over classical SVCs provided that the quantum feature map is hard to simulate classically [21,36].The quantum circuit used to evaluate the kernel is schematically shown in figure 3a.
The overlap between the encoded quantum states is given by the probability of measuring the all-zero state at the end of the circuit.The details about the applied feature map are presented in appendix C.An example of a kernel matrix resulting from such a quantum feature map is shown in figure 3b.The kernel values were estimated with the IBM Quantum processor ibm cairo using 6 input features (corresponding to 6 qubits).
IV. RESULTS

A. Datasets
In the studied anomaly detection problem, background events are represented by simulated samples of the SM processes typically observed at 13 TeV [37,38].The included processes are (with relative occurrences): W bosons decaying into a charged and a neutral lepton (59.2%), multi-jet production from QCD processes (33.8%),Z boson decaying into two charged leptons (6.7%), and t t production (0.3%).The processes are described by 23 high-level features representing simulated measurement results as obtained with e.g. the CMS detector.A detail list of the features and their description in given in appendix A. For the validation of the proposed anomaly detection strategy, we use two different types of processes as signal events.The first one is the Higgs boson, which has not been included in the SM dataset, and can therefore be interpreted as a BSM particle.The dataset for the Higgs boson consists of simulated high-mass Higgs particle produced via vector boson fusion [39].Additionally, we also use a sample of simulated Randall-Sundrum Gravitons [40] decaying to two Z bosons, forcing each Z to decay to a lepton pair [41].In both BSM datasets, the events are represented by the same 23 features as in the SM dataset.
Using the scrambling process introduced in section II we create three different datasets with artificial anomalies, each with different scrambling intensity, denoted as low, medium and high scrambling.The corresponding scrambling factors λ f are listed in appendix B 4. Here, we limit ourselves to the scrambling of a subset of 17 out of the 23 high level features, allowing us to satisfy physical constraints like energy conservation.
Following this methodology, we create training, testing and validation datasets for the binary classification.Unless stated otherwise, the classification datasets contain 1000 samples per class, where the background is labelled as the negative class, and the signals as the positive class.We create 10 training datasets for each scrambling strength, and one test dataset, all consisting of a combination between SM data and artificial anomalies.Further, we prepare two validation datasets, one consisting of a combination of SM and Higgs data, and the other of a combination of SM and Graviton data.The datasets and their composition are schematically shown in the first two panels of Fig. 1.

B. Numerical experiments
The training of a classifier consists of two main steps, the preprocessing of the data samples and the training itself.For the preprocessing we consider the following steps: standardization applied of the input features, feature selection or feature extraction to reduce the number of features used in the classification, and normalization applied to balance the importance between the feature before forwarding them to the classifier.In the classical case, the kernel matrix for the training of the classifier is calculated with equation (1).To calculate the kernel matrix with a quantum computer, we use a parameterized quantum circuit to encode the classical features in a quantum state, and obtain the kernel values by calculating the fidelity between two encoded data samples (equation ( 3)).Similarly to the classical case, and as proposed in [42,43], we introduce a hyperparameter γ in the quantum feature map to control its resolution in the Hilbert space.Figure 6 in the appendix shows an overview of the training workflow, and in appendix D we present a detailed hyperparameter optimization for the QSVC model.
For the simulations we have fixed the steps of the training workflow in the following way.We use no standardization transformation prior to the feature extraction with PCA, since the tested standardization algorithms all lead to a reduced performance (see appendix D 1).In the normalization step we scale all features to the interval [−π, π].For the quantum classifier, we encode the data with a feature map similar to the one introduced in Ref. [21] with a hyperparameter fixed to γ = 0.5.A detailed description of the applied feature map and a justification for fixing its hyperparameter are given in appendix C and D 4, respectively.In the case of the classical SVC, the hyperparameter of the radial-basis function kernel (equation ( 1)) is optimized for each classification individually, by selecting the one that achieves the best validation score on an independent test dataset.
Using this workflow, we train (Q)SVCs for each scrambling strength and training dataset, and evaluate them on the corresponding test and validation datasets.All quantum computations were done with Qiskit [44].The results are shown in figure 4.
In figure 4, we compare the AUC score for identifying artificial anomalies (blue lines), Higgs events (red lines) and Graviton (green lines) events with a quantum (solid lines) and a classical (dashed lines) classifier at different scrambling strength and for different number of features.A few observations are in order: confirming our expectations, the validation score for identifying the artificial anomalies (blue lines) increases with the number of features, and the scrambling strength.We can also confidently conclude that the proposed anomaly detection strategy is successful, since it generalizes to Higgs and Graviton events, even though the classifiers were trained to identify the artificial anomalies.For the low scrambling strength, using 8 features leads to the highest identification AUC of artificial anomalies and the highest detection AUC of the Higgs and Graviton events.For the medium and high scrambling strengths the highest AUC values are reached for 6 features.A possible reason why the number of features with the highest AUC changes for the different scrambling factors is that inducing more extreme anomalies produces, on average, events which are easier to distinguish from SM ones.In general, this leads to both an increased validation and detection AUC for the same number of features.In other words, for an increased scrambling strength, a lower number of features is required to achieve the same performance.However, the highest detection AUC value is reached for the low scrambling.
In most cases the classifiers are even better at detecting the Higgs and the Graviton events than identifying the artificial ones.The numerical values of the highest validation AUC and the corresponding detection AUC are listed in table I.
The performance of the classical and quantum SVCs is very similar.Looking only at the classification between SM and artificial anomalies (blue curves), the classical SVC outperforms the quantum SVC in all cases.However, the gap gets smaller for increasing number of features, and increasing scrambling strength.Focusing on the detection we observe the contrary behaviour.For low scrambling strength (where the detection AUC is the highest) the quantum SVC is better at detecting the Higgs and Graviton events.However, increasing the scrambling strength closes the gap between the classical and quantum SVC when detecting Graviton events, and the order gets reversed for the detection of Higgs events.
Overall, the results suggest that, although a quantum SVC can be better than a classical one in terms of detection ability, in general the two methods exhibit essentially comparable performances.
A possible explanation for the improved detection score of the QSVC for Higgs and Graviton events -which are not explicitly present in the dataset of anomaliescould lie in lower overfitting on the classification task and, hence, a better generalization power.However, we have observed that introducing a bias to the classical SVC by fixing the hyperparameter of the classical kernel to γ = 0.5, only leads to a minimal drop in its classification accuracy, but to a significant gain in the detection accuracy.The corresponding figures are shown in appendix E.

C. Hardware experiments
For all the results presented in the previous sections, the quantum kernel values were computed via the simulation of a perfect quantum computer, without errors due to finite measurement statistics or hardware noise.While the former can be included in numerical simulations without much effort [45], the latter is harder to capture, and it is therefore important to benchmark quantum algorithms directly on existing quantum processors.
We perform all hardware calculations on the IBM Quantum superconducting device ibm cairo, using the same experimental setup as above (same training workflow and hyperparameters).We use 10 4 repetitions (shots) for the estimation of the kernel values, and we apply a depolarization error mitigation method to the obtained kernel matrices (see appendix G 1).  ROC-AUC curve of the classification between SM events and artificial anomalies.The kernel matrices for the classification were provided by a classical kernel function (blue), a simulated quantum kernel (orange), and a quantum kernel estimated using the quantum device ibm cairo (green).
the requirements on the quantum device, we only use 50 events per class and only look at the classification between SM and artificial events.Additionally, we also just consider the 6-feature case for medium scrambling strength.
One instance of a kernel matrix calculated with the quantum device is displayed in figure 3b.In figure 5, we also report the average ROC-AUC curve of the quantum SVC trained on noisy kernel matrices.Performances are compared with the simulated quantum SVC and the classical SVC.The validation AUC of the QSVC is listed in table I.
The validation AUC evaluated on the hardware is lower than the AUC obtained with classical SVC and simulated quantum SVC.The most probable reasons for this discrepancy are hardware noise sources other than the mitigated depolarization errors, and the lower number of training and testing samples used in the hardware experiments.

V. CONCLUSIONS
In this work, we proposed and successfully tested a supervised learning strategy for anomaly detection.Instead of generating anomalies provided by a BSM theory we randomly generate them with a scrambling process that is based on an initial dataset containing SM processes.We verify the feasibility of the proposed anomaly detection strategy by identifying HEP processes that were not included in the SM dataset during the training of a classifier.The success of the strategy can be confirmed with both classical and quantum SVCs approaches.
Currently, the scrambling process generates events which have a high overlap with the initial data samples, and reducing this overlap could improve the classification between SM and artificial events.However, this does not guarantee that the detection of unknown events will also improve.A possible strategy to reduce the overlap would be to use a scrambling distribution that is different from the Gaussian distribution.The only constraints that the generated events have to respect are physical conservation laws and requirements by the detection.Therefore, it would in principle be possible to construct a scrambling process that generates events with less overlap to the initial data samples, or even generate events that lie in a desired region of the "event space".Additionally, we would like to stress that the proposed scrambling strategy is not limited to HEP datasets, and could in principle be applied for any anomaly detection problem.
While our results establish empirically a successful application of quantum kernel methods to a HEP anomaly detection task, we could not yet observe a generalized promise for quantum advantage.However, it is not possible to rule out individual problem instances where a quantum classifier could outperform a classical one, e.g. for different scrambling distributions, or a higher number of features.In fact, a similar study on anomaly detection in HEP has found evidence for scenarios where a quantum model can outperform the classical counterpart [46].Such results emerge from a combination of unsupervised learning approaches applied to a dataset where the collision events are described by the 4-momenta of the involved particles, hence using a representation closer to the physical "raw data" compared to the dataset used in this work.Generally, there is mounting evidence that quantum advantage on classical datasets can only be strictly guaranteed when specific structure is present [36,[47][48][49]].An analysis of our classification problem via the methods proposed in Ref. 50

is also not conclusive (see appendix F).
As the data constituting the target of our work originates from a quantum HEP process, the idea of using a quantum technique for its classification seems rather natural.However, the features currently used to describe the processes, collected from detector measurements, are fully classical.This loss of "quantumness" could in fact represent an important limitation to the use of more sophisticated QML techniques for the analysis and classification of quantum states [23,[51][52][53].We therefore believe that a different setup bypassing the extraction of classical features could be a promising road towards quantum advantage also in the context of HEP.While for collision events at LHC such a setting is currently not possible, there already exist experiments at CERN where quantum sensors are studied for information extraction [54].In the future, it could certainly be interesting to couple a quantum processor to quantum sensors embedded in detectors, hence enabling direct manipulation and classification of quantum amplitudes produced in an experiment.
There are two different types of transverse momenta we can scramble, the transverse momenta related to the leptons and the transverse momenta related to the jets.In theory, due to conservation of momenta, these two types of momenta would be related.However, for the jets we only have the scalar sum of the transverse momenta as a feature in the dataset (H T ).We therefore scramble the

Feature Description
Scrambled?

HT
The scalar sum of the transverse momenta pT of all jets having pT > 30 GeV and |η| < 2.4. x

MJ
The invariant mass of all jets entering the HT sum.

NJ
The number of jets entering the HT sum. x

NB
The number of jets identified as originating from a b quark.
x p µ T,T OT The vector sum of the pT of all PF muons in the event having pT > 0.5 GeV. x

Mµ
The combined invariant mass of all muons entering the sum in p µ T,T OT .

Nµ
The number of muons entering the sum in p µ T,T OT .
x p e T,T OT The vector sum of the pT of all PF electrons in the event having pT > 0.5 GeV. x

Me
The combined invariant mass of all electrons entering the sum in p e T,T OT .

Ne
The number of electrons entering the sum in p e T,T OT . x

Nneu
The number of all neutral hadron PF-candidates. x The number of all charged hadron PF-candidates. x

Nγ
The number of all photon PFcandidates. x The transverse momentum of the highest pT lepton in the event.
x η l The lepton pseudorapidity.x q l The lepton charge (either −1 or +1).Iso l ch The lepton isolation related to all other charged hadron PF-candidates.
x Iso l neu The lepton isolation related to all neutral hadron PF-candidates.
x Iso l γ The lepton isolation related to all photons.
x MET The parallel component of the missing transverse energy with respect to the lepton.
x MET ⊥ The orthogonal component of the missing transverse energy with respect to the lepton. x

MT
The combined transverse mass of the lepton and the missing transverse energy system.IsEle A flat set to 1 if the lepton is an electron, 0 if it is a muon.

TABLE II.
High-level features used as description of the events in the HEP datasets [37,38].The additional column indicates if the feature is considered in the scrambling process.The abbreviation PF stands for Particle Flow, and is related to the event reconstruction algorithm used to process the raw collision data.The output of the algorithm are the so-called PF candidates [38].
H T independently of the transverse momenta of the leptons, and assume that the change in H T can be absorbed in an appropriate change in the directions of the transverse momenta of the jets.The H T is updated according to the following equation, For the transverse momenta related to the leptons we only scramble the transverse momentum p l T of the chosen lepton l with the highest transverse momentum.The features describing p l T , are the transverse momentum p l T and the pseudorapidity η l , where l ∈ {e, µ} is either an electron or a muon.Changing these two features has an effect on other features, which have to be adapted accordingly.Specifically, the parallel and orthogonal component of the missing transverse momentum (MET , MET ⊥ ), and the sum of the transverse momenta of the leptons p l T,TOT .The total momentum of the lepton l can be written as where the coordinate system is chosen such that the zaxis is along the beam line, and the transverse momenta lies in the x-y plane.The polar angle θ is the angle between p and the z-axis, the azimuthal angle φ is the angle between x-axis and the transverse momentum, p l T is the transverse momentum, and p l L is the longitudinal momentum.The pseudorapidity η l is defined as a function of the polar angle, η l = − log (tan θ/2).As an initial point we set φ = 0 as it is possible to align the x-axis with the transverse momentum p l T .In this case, the momentum p l is fully determined by the transverse momentum p l T and the pseudorapidity η l , where the polar angle θ is determined by the pseudorapidity η l .To sample a new momentum vector for lepton l we randomly generate values for the following three quantities, The sampled transverse momentum p l T has to fulfill the requirement (p l T ) > 23 GeV, and the sampling is therefore repeated until this constraint is respected.
Assigning a new value to the momentum of lepton l has an effect on other quantities, which we have to adapt accordingly: • Missing transverse energy: The missing transverse energy MET is specified by a vector with two components where the parallel and perpendicular direction are with respect to lepton l.Sampling a new transverse momentum p l T for lepton l also changes the definition of the parallel and perpendicular direction.To update the components of the MET, we therefore first add p l T to MET , rotate the MET by the sampled azimuthal angle φ and add (p l T ) to the new parallel component, which results in the following equations, • Sum of transverse momenta of leptons: In the SM dataset, the vector sum of the transverse momenta of all leptons is only characterized by its absolute value.Therefore, we miss the directional information, required for an accurate compensation of the change in p l T , and we update the sum of transverse momenta in the following way This does not consider the direction of the momenta, however, usually p l T is the dominant component of p l T,TOT and equation (B7) is a good approximation of the actual change in p l T,TOT .

a. Isolations
The isolation Iso of the leptons, photons and neutral atoms is randomly assigned a new value according to Iso ∼ |N (Iso, λ Iso σ Iso )| . (B8) The absolute value is taken, because the isolation is always positive.Additionally, as a requirement of the reconstruction process of an CMS event, the isolation has to be smaller than 0.45.Therefore, the sampling is repeated until this constraint is respected.

Jets
The total number of jets N J and the number of jets involving a b-quark N B are randomly assigned a new value according to The number of jets is a non-negative integer, and therefore we round to the nearest integer (denoted by [•]) and take the absolute value.Additionally, the number of bjets cannot exceed the total number of jets.Therefore, the sampling is repeated until N b ≤ N J .

Particle Number
The particle number N for the neutral and charged hadrons, photons, electrons and muons are assigned a new value according to where the standard deviation is proportional to the original value.The particle number is a non-negative integer, and therefore we round to the nearest integer and take the absolute value.

Scrambling factors
We create three different anomaly datasets each with a different strength of the scrambling.The strength of the scrambling is controlled by the scrambling factors introduced in section II.The three scrambling strength are denoted as low, medium and high.The specific values of the scrambling factors are listed in table III.

Standardization
The standardization is applied to a dataset prior to the feature selection/extraction, and has the purpose to balance the importance of the different features.We consider two different standardization transformations, the transformation to mean zero and unit variance, and the transformation to the fixed interval [−1, 1], denoted as standard scaling and min-max scaling, respectively.As a reference, we also consider the training without any standardization transformation.For all other steps of the training workflow we choose the defaults given in figure 6.The results are shown in figure 7a.
As expected, the AUC increases with the number of features for all considered standardization options.Surprisingly, however, we get the best performance when we use no standardization transformation (green curve).Therefore, we will use no standardization transformation prior to the feature selection/extraction in the following experiments.

Feature selection/extraction
The Principal Component Analysis (PCA) algorithm is a widely used algorithm for feature extraction.For comparison, we also consider three feature selection strate-gies, the decision tree classifier (DTC), the gradient boosting classifier (GBC), and the random forest classifier (RFC).We use no standardization transformation prior to the feature selection/extraction, and for the remaining steps in the training workflow we choose the defaults given in figure 6.The results are shown in figure 7b.
All feature selection strategies have about the same performance.However, all are out-performed by the feature extraction with PCA.Therefore, we will use PCA to extract the features in the following experiments.

Normalization
The normalization transformation is applied after the feature selection, with a similar purpose as the standardization transformation.We want to balance the importance of the features before inputting them to the classification algorithm.We again consider the standard scaling and min-max scaling to the interval [−π, π].Additionally, we also consider a global min-max scaling, where the features are transformed to a fixed interval with a joint transformation, instead of individual transformations for each feature.The intuition behind the global min-max scaling is that the features retain their relative structure obtained through the feature selection/extraction algorithm.As a reference, we again consider the training without any normalization transformation.We use no standardization transformation, and PCA for the feature extraction.For the remaining steps of the training workflow we take the defaults given in figure 6.The results are shown in figure 7c.
Using a normalization transformation is clearly beneficial.Without the normalization the trained classifier has the same performance as a random classifier (AUC is 0.5).All other normalization transformations have a similar performance, with the standard and min-max scaling having the highest AUC values.In the following we will use the min-max scaling to the interval [−π, π], in order to have some control over the range of values the features will take after the normalization.

Scaling factor
The scaling factor, as introduced in [42,43], is interpreted as a hyperparameter of the applied quantum kernel.This parameter should therefore be optimized.Here, we show an example of such an optimization, and how the optimal scaling factor is chosen.For the preprocessing we apply the optimal steps found in the previous sections, and for the feature map we use the default given in figure 6.The results are shown in figure 7d.
The figure shows a sweep over the scaling factor, for different number of features used for the classification.Clearly, there is some region of scaling factor values, where the resulting classifier has the best performance.Additionally, this region is similar for different number of features.Of the considered scaling factor values, γ = 0.46 results in the highest performance for all considered number of features.For this learning task and this specific choice of feature map a scaling factor of γ ≈ 0.5 therefore leads to a good performance.This values of the scaling factor ensures that all angles entering the feature map will effectively be in the interval [− π 2 , π 2 ].For a different dataset and/or a different feature map the optimal scaling factor may be different.

Feature map
We consider three different options for the entanglement layout of the feature map introduced in section C. We have full entanglement if we apply a two-qubit gates for each pair of qubits.We also consider linear entanglement is we apply a "ladder" of two-qubit gates only for neighbouring qubits (without periodic boundary).For the R ZZ rotation applied in the feature map, the linear entanglement is equivalent to a more depth efficient layout, where at most two layers of R ZZ are required (even and odd connections between neighbouring qubits).We call this layout restricted entanglement, and apply it here instead of the linear entanglement, which is favourable when running the circuits on hardware.The last type of layout we consider is the separable encoding, without any entanglement between the qubits.For the preprocessing of the input data, we use the optimized steps presented in the previous sections.The results are shown in figure 7e.
The figure shows the validation AUC for an increasing number of features for different entangling strategies.Against our expectations, the entangling strategy does not have a significant influence on the AUC of the classification.Generally, adding entanglement between the qubits, is expected to reveal correlations among the features, and we therefore would expect an improvement in the AUC for increasing entanglement.Possible explanations why this can not be observe here, could be that the classification task is too easy, or during the preprocessing step (especially the PCA) all correlations between the extracted features are removed, and adding correlations in form of entanglement is therefore not improving the classification.It could also be, that the dataset does not fall into the class of problems where using a quantum model could lead to an advantage over classical models.We present a corresponding investigation in section F.
Although we do not see a practical advantage of using entanglement for the specific datasets used in the evaluation, we still use the restricted entanglement for the experiments in the main text in order to keep the anomaly detection scheme as general as possible, also across different datasets.

Appendix E: Restricting classical model
In the main text the hyperparameter of the classical RBF kernel (equation ( 1)) was optimized for each combination of training and test dataset.Here, the hyperparameter will be fixed to γ = 0.5.The resulting classification and detection score are shown in figure 8. Compared to the results in figure 4 the drop in the classification AUC is minimal, but the detection AUC is significantly improved.The classical SVC now also outperforms the quantum SVC in the detection.However, this improvement is expected to be very specific to the classification and detection problem at hand, and cannot be expected in general (especially when the real anomalies are unknown).
Appendix F: "Power of data" metrics In Ref. 50 the authors introduce a strategy to check if a dataset falls into the class of problems where quantum ML models may perform better than classical models.In the following, we will follow this strategy to check the class of the studied HEP dataset.The metrics introduced in Ref. 50 were calculated using the code provided in the software package QuASK [55].First, we evaluate the geometric difference g, defined as a similarity measure between two kernel matrices, where λ is a regularization parameter.The geometric difference g gen is also related to a training error, which is upper bounded by Similarly to Ref. 50 we report the geometric difference g gen for a λ such that the training error g tra ≈ 0.0045.
In figure 9a we show the geometric difference, calculated from kernel matrices obtained with the workflow described in the previous sections, for the classification between the SM and anomalies (resulting from the medium scrambling) for different numbers of features and different number of samples N in the training and validation datasets.For each problem instance (specific number of features and samples), the geometric difference is calculated between the corresponding classical and quantum kernel matrix.In the figure, we additionally fit a function proportional to √ N to the geometric differences (dashed lines), specifically g(N ) = a• √ N +b.The fit is done separately for each number of features.Visually, there is good agreement between the measured geometric differences and the dashed lines, which puts us in the regime where the geometric difference scales proportional to √ N .After finding the scaling proportional to √ N , the next step in the assessment is to calculate the model complexity, defined as s K (N ) = Visually, we observe good agreement between the measured model complexities and the dashed lines especially for the higher number of samples, at least for the classical SVC.For the quantum SVC, the calculation of the model complexity is somehow not very meaningful.The model complexity is orders of magnitude bigger than in the classical case, and the values also seem to converge for larger numbers of features.However, they do not seem to be fully converged yet, and a fair assessment is not possible.Therefore, based on the geometric difference and the classical model complexity, we could end up either in the case with "Potential quantum advantage" or the case where the problem is "Likely hard to learn".
From the results we obtained in the sections above, we would have expected to be in the case where both, classical and quantum SVCs, can learn well (either g CQ √ N or s C N ).Therefore, the results obtained from the measured metrics is not conclusive and this test cannot be used to argue for (or against) quantum advantage in the studied classification problem.shown in figure 10.To mitigate some of the hardware errors we apply a depolarization error mitigation strategy to the obtained kernel matrices.

Depolarization error mitigation
Some of the errors occurring on the hardware can be modeled by a depolarization channel where λ is the survival probability of the original quantum state ρ, and n is the number of qubits.To mitigate the depolarization error we can exploit that in the noiseless kernel matrix K all diagonal entries are 1.Therefore, if one measures the diagonal entries in a noisy settings one can gather information about the device noise [56].
The survival probability λ i of the noisy kernel matrix element K * ii is The mitigated kernel values can then be obtained with For the experiment in the main text, we assume that all survival probabilities λ i have the same value, which can be estimated with where N is the size of the symmetric training kernel matrix.This value can then also be used for the mitigation of the non-symmetric validation kernel matrix.
FIG.1.Illustration of the pipeline advised for our anomaly detection strategy.We start with HEP datasets of simulated SM, Higgs and Graviton events.The Higgs events are considered separately, because they are not included in the processes of the SM dataset.We apply a random process to the SM dataset to create artificial anomalies (see II). Based on the four different datasets, we create balanced two-class datasets.One of the classes is always the SM.The other class is either composed of artificial anomalies (training and testing), or Higgs or Graviton events (validation).We apply two preprocessing steps to the classification datasets: feature extraction with PCA and normalization of the extracted features to the interval [−π, π].Each training dataset is then used to train multiple quantum or classical SVCs from which we select the best one based on the performance on the test dataset.The best SVC is then applied in the detection of unseen anomalies (Higgs or Graviton events).
3 n Y b O 8 I l W B t i P h J 5 Q 4 a n 6 9 y I l g Z S T w N O b e U S 5 6 O X i f 1 4 3 U f 6 l m 7 I w T h S E d P b I T z h W E c 4 b w A M a 6 n A P D W g C g x B e 4 B X e r G f r 3 f q w P u e t J a u Y O Y Y F W F + / H D m Y C Q = = < / l a t e x i t > |0i < l a t e x i t s h a 1 _ b a s e 6 4 = " s j T O t H 4 D I q 5 y E D X n C A r 6 W f d 2 5 F o = " > A A A C A 3 i c b V D L S g N B E O y N r 7 i + o h 6 9 D A b B U 9 g V U Y 9 B L x 4 j m A c k S 5 i d 9 C b D z j 6 Y m R X C k q M f 4 F U / w Z t 4 9 U P 8 A n / D S b I H k 1 j Q U F R 1 0 9 3 l p 4 I r 7 T j f V m l t f W N z q 7 x t 7 + z u 7 R 9 U D o 9 a K s k k w y Z L R C I 7 P l U o e I x N z b X A O x J j P 1 7 0 R O I 6 X G k W 8 6 I 6 p H a t m b i v 9 5 3 U w H N 1 7 O 4 z T T G L P 5 o i A T R C d k + j s Z c I l M i 7 E h l E l u b i V s R C V l 2 i R k L 6 z x J T X R T G y T j L u c w y p p X d T c q 9 r l w 2 W 1 f l t k V I Y T O I V z c O E a 6 n A P D W g C g x B e 4 B X e r G f r 3 f q w P u e t J a u Y O Y Y F W F + / H D m Y C Q = = < / l a t e x i t > |0i < l a t e x i t s h a 1 _ b a s e 6 4 = " s j T O t H 4 D I q 5 y E D X n C A r 6 W f d 2 5 F o = " > A A A C A 3 i c b V D L S g N B E O y N r 7 i + o h 6 9 D A b B U 9 g V U Y 9 B L x 4 j m A c k S 5 i d 9 C b D z j 6 Y m R X C k q M f 4 F U / w Z t 4 9 U P 8 A n / D S b I H k 1 j Q U F R 1 0 9 3 l p 4 I r 7 T j f V m l t f W N z q 7 x t 7 + z u 7 R 9 U D o 9 a K s k k w y Z L R C I 7 P l U o e I x N z b X A H j 3 4 l a + N M Z g O 7 S + 4 3 q s E B 5 X 9 S 7 9 c P S X j W i S b Z I v s k I A c k i o 5 J x e k R j h 5 J M / k h b w 6 7 8 6 n O + v O j 6 2 u 8 9 2 z Q S b K L X w B B o y 5 y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S T q S H j 3 4 l a + N M Z g O 7 S + 4 3 q s E B 5 X 9 S 7 9 c P S X j W i S b Z I v s k I A c k i o 5 J x e k R j h 5 J M / k h b w 6 7 8 6 n O + v O j 6 2 u 8 9 2 z Q S b K L X w B B o y 5 y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S T q S N 5 U U Y 5 o n R D t C A 0 c 5 s I B x L e x f K b 9 l m n E b i v E m 1 o S a 9 Q E n D s n G K e Z M j q S w H j 3 4 l a + N M Z g O 7 S + 4 3 q s E B 5 X 9 S 7 9 c P S X j W i S b Z I v s k I A c k i o 5 J x e k R j h 5 J M / k h b w 6 7 8 6 n O + v O j 6 2 u 8 9 2 z Q S b K L X w B B o y 5 y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S T q S N 5 U U Y 5 o n R D t C A 0 c 5 s I B x L e x f K b 9 l m n E b i v E m 1 o S a 9 Q E n D s n G K e Z M j q S w H j 3 4 l a + N M Z g O 7 S + 4 3 q s E B 5 X 9 S 7 9 c P S X j W i S b Z I v s k I A c k i o 5 J x e k R j h 5 J M / k h b w 6 7 8 6 n O + v O j 6 2 u 8 9 2 z Q S b K L X w B B o y 5 y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S T q S FIG.3.a Quantum circuit used by the Quantum Support Vector Classifier (QSVC) to measure the overlap between two encoded quantum states.The feature vector x is encoded into a quantum state through the parameterized unitary U (x). b Kernel matrix for the classification between SM events and artificial anomalies calculated on the IBM quantum processor ibm cairo using 6 features (corresponding to 6 qubits).
FIG. 5.ROC-AUC curve of the classification between SM events and artificial anomalies.The kernel matrices for the classification were provided by a classical kernel function (blue), a simulated quantum kernel (orange), and a quantum kernel estimated using the quantum device ibm cairo (green).

FIG. 7 .
FIG. 7. Hyperparameter analysis for the training of a QSVC.Comparing different standardization transformations (a), feature selection and feature extraction methods (b), and normalization transformations (c) for an increasing number of features, searching for the optimal scaling factor (d), and comparing different entanglement strategies of the feature map (e).More detailed information about the specific algorithms and evaluation settings can be found in the text of the corresponding sections.

FIG. 8 .
FIG.8.Validation AUC for classification of BSM events among SM events with quantum and classical SVCs for different scrambling strength and different number of features.Same experimental setup as in figure4, except that the hyperparameter of the classical RBF kernel (equation (1)) is fixed to γ = 0.5.Note: only the curves of the classical SVC changed (dashed lines) compared to figure 4.
+ λI) −2 √ K ij y i y j , (F3) where y i,j are the labels of the classification, K is the classical or quantum kernel matrix, and λ is again a regularization factor.Related to the model complexity we can define a training error, t K (N ) = λ 2 regularization such that this training error and the model complexity are both minimized.The resulting model complexities for the classical kernel matrices and the quantum kernel matrices are shown in figure 9b and 9c, respectively.In both cases, we fit a function proportional to N to the model complexities (dashed lines), specifically s(N ) = a • N + b.

FIG. 9 .
FIG. 9. Metrics for the characterization of the "hardness" of classifying SM and artificial anomalies.a Geometric difference between classical and quantum kernels for difference dataset sizes and increasing number of features.b Model complexity of classical kernels for different dataset sizes and increasing number of features.The inset shows the model complexities for 13 up 17 features.c Model complexity of quantum kernels for different dataset sizes and increasing number of features.The inset shows the model complexities only for 16 and 17 features.
4o easeFIG.4.Validation AUC for classification of BSM events among SM events with quantum and classical SVCs for different scrambling strength and different number of features.a Validation AUC for identifying artificial anomaly (blue), Higgs (red) or Graviton (green) events among SM events for a QSVC (solid lines) and a classical SVC (dashed lines).The classifiers were trained on artificial anomalies generated with the low scrambling strength.b, c Equivalent results, but for medium and high scrambling intensity, respectively.
TABLE I. Highest validation AUC for classification between artificial anomalies and SM events with quantum and classicalSVCs for different scrambling strengths (fourth column).The third column lists the number of features for which this validation AUC is achieved.The last two columns hold the corresponding detection AUC of the Higgs and Graviton events.*Quantum kernel estimation executed on IBM Quantum device ibm cairo.