"Flux+Mutability": A Conditional Generative Approach to One-Class Classification and Anomaly Detection

Anomaly Detection is becoming increasingly popular within the experimental physics community. At experiments such as the Large Hadron Collider, anomaly detection is at the forefront of finding new physics beyond the Standard Model. This paper details the implementation of a novel Machine Learning architecture, called Flux+Mutability, which combines cutting-edge conditional generative models with clustering algorithms. In the `flux' stage we learn the distribution of a reference class. The `mutability' stage at inference addresses if data significantly deviates from the reference class. We demonstrate the validity of our approach and its connection to multiple problems spanning from one-class classification to anomaly detection. In particular, we apply our method to the isolation of neutral showers in an electromagnetic calorimeter and show its performance in detecting anomalous dijets events from standard QCD background. This approach limits assumptions on the reference sample and remains agnostic to the complementary class of objects of a given problem. We describe the possibility of dynamically generating a reference population and defining selection criteria via quantile cuts. Remarkably this flexible architecture can be deployed for a wide range of problems, and applications like multi-class classification or data quality control are left for further exploration.


Introduction
Nuclear and particle physics are often characterized by problems where one needs to identify particles or events that (i) belong to or (ii) deviate significantly from a specific 'reference' arXiv:2204.08609v1[cs.LG] 19 Apr 2022 class.In the first case we refer to one-class classification (OCC) -to identify objects of the reference class amongst all objects, in the latter to anomaly detection (AD) -which can leverage on OCC to detect abnormal data points compared to the reference class.
Examples of OCC can be found in an extensive literature review provided by [1].As for AD, there is a growing number of applications that span from accelerator operations to physics analyses, the latter being of great interest for example at the Large Hadron Collider (LHC) since new physics beyond Standard Model (BSM) remains elusive (as discussed, e.g., in [2][3][4][5]). 1 In both cases, one typically deals with multiple features that vary as a function of the phase space of the final state particles reconstructed in the detector.
In this paper we introduce a novel approach to cope with OCC and AD that leverages two different stages that we call "flux" and "mutability" (F+M).In the first stage, flux, we learn the distributions of the reference class data, and in doing that we utilize a combination of conditional autoencoders (cAE) and flow-based models, particularly conditional masked autoregressive flows (cMAF), which are conditioned to the kinematics of the particles or events we are trying to distinguish.As we will describe herein, this allows us to augment the space of the features with a gain in classification performance.The second stage, mutability, consists of addressing if the data in the inference phase -which undergoes a forward pass in the cAE has significantly deviated from the reference class.In other words, at a given kinematics one can dynamically generate a reference cluster in the augmented space and measure if an object belongs to the reference cluster or not. 2 Hierarchicalbased clustering (namely, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [6]) is used to fit these data and come up with a probability cut that provides the confidence level for an object to belong or not to the reference class.
In this work we focus on two applications: i) distinguishing neutrons from photons in the Barrel Calorimeter (BCAL) of the GlueX experiment [7] where neutrons in certain kinematic regions are difficult to simulate or isolate from real data and photons are therefore used as reference class; and ii) identifying possible rare BSM dijet events from QCD dijet background events at LHC [3], the latter representing our reference class.One of the advantages of our novel and flexible architecture described in the following sections, is that it relies only on the 'reference' class and remains agnostic to the class of objects complementary to the reference class during both the training and inference phases.The two classes can be thought of as 'signal' and 'background' in physics applications.When using strictly supervised methods instead, the model typically requires both signal and background as input in order to learn the feature space and produce a binary output.While these algorithms can be efficient and accurate, they are limited by the quality of data we inject.That is to say, they are prone to any bias we may introduce when constructing our training samples.This can be critical when our control over one of the two classes, (e.g., the background), is limited and we need to rely on the other class (e.g. the signal) or vice versa.Using an architecture that relies on the information of one class removes any assumptions we must make about the complementary class and can be extremely important for different applications: for example, it can be utilized for anomaly detection of rare events as well as a method to increase the purity of samples when the original set of real data is characterized by two classes only.
The remainder of the paper is structured in the following way: In Sec. 2 we will describe the developed architecture and provide a detailed discussion of the training and inference phases.In Sec. 3 we introduce the two problems that our architecture has been applied to: detection of neutrons within the GlueX BCAL, and tagging of Z → t t from QCD dijet background.In Sec. 4 we present our analysis and results.Finally, in Sec. 5 we conclude with a summary and perspectives on future work.

Flux + Mutability
The F+M approach can be broken into three components, namely: (i) a cAE, (ii) a cMAF, and (iii) a clustering algorithm.
(i) The cAE is trained to reconstruct features as a function of kinematic parameters.In this paper we will show two examples: a) identifying single neutral showers that depend on 14 reconstructed observables which vary as a function of the shower energy and location in the BCAL calorimeter at GlueX; and b) analyzing topologies of events at LHC characterized by 2 jets, which are described by 15 reconstructed observables that depend on the transverse momentum of the jets.Note that in both cases the kinematic variables are continuous.The reader can find more details about these datasets in Sec. 3.This model is trained first, independently of the cMAF, deploying Huber Loss (see Eq. 1): Using this trained model we forward pass all training samples and obtain both the reconstructed vectors (x ) and the residual vectors (x − x) which are then combined into an augmented space.Namely, the augmented space will consist of 28 dimensions for the GlueX and 30 dimensions for the LHC problems, respectively.
(ii) This augmented dataset is then used to train the cMAF.Let x ∈ X denote an element from the set of input vectors within the training dataset, k ∈ K the conditional vector for the kinematics, and z ∈ Z represent the transformed Gaussian vector given by the invertible bijection f . 3.A conditional flow with N layers can be described by: The logarithm of the transformed probability is then given by Eq. 3, where π denotes the probability under a Gaussian distribution: The loss function is then given by the negative log-likelihood: It has been found that using the features reconstructed by cAE (x ) instead of the original features (x) allows for a better separation of classes at inference.We also observed that an augmented space made by the recontructed features and the residuals (x −x) increases the separation power.As it will be shown in Sec. 4, our reference class data at each kinematics can be represented by a cluster in the feature space that is normalized on a hypersphere.It turns out that features provide localization in space, while residuals push events deemed "anomalous" radially outward, in otherwords, the augmented space with residuals allows the extraction of clusters originally nested within the main population.Hence, cMAF is trained on the augmented space of residuals and reconstructed features.The conditions remain the kinematic variables discussed prior, although now we must normalize them on the interval (0,1) to allow better convergence of the flow network.Normalization of the conditions for the cAE is not mandatory, although if the conditions exist over a large domain they should be normalized prior to injection.The flow network will be used as the conditional generator to form the reference cluster in the augmented space as a function of the kinematics. 4iii) The last part of the architecture consists in clustering based on HDBSCAN.This allows us to fit the objects in the inference phase with respect to the reference cluster on an object-by-object basis, i.e., on a particle-by-particle basis for the neutral shower identification problem of GlueX, and on an event-by-event basis for the LHC jet problem, as described in Sec. 4.
The following provides more details on the approach taken to deal with certain aspects that characterize the F+M architecture.Sec.2.1 describes the continuous conditional generation; Sec.2.2 covers the separation via clustering and the choice of the dynamic cuts that are applied; finally Sec.2.3 provides a global overview of the workflow during the inference phase with Fig. 1 depicting the connection of the components (i), (ii), and (iii) described in this section.

Continuous conditional generation
Continuous conditions give rise to the problem of sparsity within the dataset, meaning low numbers of events per condition, i.e.,in different kinematic domains.The obvious method to circumvent this issue is pre-binning of conditions such that they become discrete.We instead choose to take a different approach, in which we allow conditions to remain continuous but enforce sampling from restricted domains.This is achieved using Kernel Density Estimation (KDE) to model the transformed probability distribution of the training data in kinematic bins.That is to say, for each bin in the conditional space, we form a density estimation object in which we can call upon to sample sets of latent vectors from restricted domains.In inference, we use the inference particle's kinematics to map to a KDE model, fit on the training samples transformed distribution.This model then generates a sample in the 2N augmented space of transformed features and residuals, where N is the dimensionality of the feature space. 5We then concatenate the original inference kinematics to each generated 2N Gaussian vector, forming a vector of size 2N + L and forward pass this vector through the cMAF to generate our augmented space.

Separation via clustering and dynamic cut
We introduce an outlier score which is obtained comparing each object candidate (e.g., either a particle or a physics event) to the reference cluster.This can be thought of as a probability of being an outlier with respect to the reference class.In other words, it corresponds to the complementary probability of being an inlier: This metric is used to make a decision if the object is more likely to belong to the reference class or not.Every point of the reference cluster is characterized by an inlier probability and therefore by an outlier score.In what follows we detail the process of generating the outlier distribution for the reference cluster.We will then compare the outlier score of the candidate object to the reference distribution.
The following is a brief and concise description of the HDBSCAN algorithm.The reader can find more details about the algorithm in [9,10]; for our implementation in particular we utilized the documentation in [6].HDBSCAN utilizes the mutual reachability distance between the points to form a weighted graph, which is in turn used to build the clustering hierarchy. 6Heuristically the density corresponds to the inverse of a distance, λ = 1 distance ; the smaller the distance between points the higher the density.In order to get more information on the clustering, a condensed tree is formed.This tree contains clusters and each cluster contains underlying leaf clusters.This stems from the hierarchical nature of the algorithm.In the algorithm the user has control over hyperparameters such as "min samples" which sets a lower limit on the number of points needed to be considered a core point and for the algorithm to perform any mutual reachability calculation.The algorithm prioritizes regions of high density, eventually merging less dense regions with the main cluster if they are reachable under some threshold.Persistence is introduced defining a notion of membership based on how long the point was retained in the cluster, i.e., position in the spanning tree, in order to compare the relative distance scales between a fixed cluster and a point in question.We also want to consider a density-based notion of membership.This is done via a modification of the Global-Local Outlier Score from Hierarchies (GLOSH) algorithm [10] that allows us to perform the comparison of the points membership persistence with the maximum persistence of the cluster, in order to get a measure of how much of an outlier the point is relative to the fixed cluster.
Combining notions of both distance and density we can now obtain a membership distribution for the reference cluster at each kinematics.This is used to define an outlier metric when classifying new data points.This metric is dynamic in that requires generating a cluster representative of the reference population at any kinematics.Therefore one can define a quantile threshold, which can be some outlier score value corresponding to, e.g., keeping 95% of the population.
It should be stressed that the quantile cut defines the outlier score of the candidate, that is the probability that an object is an outlier with respect to the reference class.Thus, when we classify data via our quantile metric, we define an outlier score cut that corresponds directly to a certain confidence level in data.This metric allows us to remain completely agnostic with respect to the complimentary class, removing the need for semi-supervised methods-which require an example signal in mind during training, see, e.g., [11]-in defining the optimal selection threshold.

Workflow at the inference phase
The workflow of F+M is depicted in Fig. 1 which describes the flow of an individual object (a shower in the GlueX BCAL case or a dijet event in the LHC case) in which we wish to perform inference on.The object is initially fed through the cAE, producing both the reconstructed and residual feature vectors to be augmented.
Subsequently, the kinematics of the input object are mapped to a KDE instancepre-fit on the transformed distribution of training samples-which in turn produces a set of Gaussian vectors from a restricted domain.It is worth mentioning again that using KDE is necessary, given that we are using continuous conditions.For a given condition vector k, there are not enough data to reliably sample a full width Gaussian, so we must instead restrict the cMAF sampling domain at the generation stage, for which using KDE is an effective approach.The inference kinematics are concatenated to the vectors and fed to the cMAF for a forward pass.The model is then forced to interpolate over the restricted domain at the generation stage.The generated data is normalized (either on a hypersphere or by applying a standard scaler) and fed to an HDBSCAN clustering instance, forming the reference cluster for the given kinematics.The inference object is added to the cluster, and classification is performed via the dynamic quantile cut.We directly include the inference object in the initial reference cluster since this greatly improves the speed of the algorithm (saving a second clustering).In doing so we are careful to keep the true reference population large such that an individual point has no influence on the quantile metric.

Physics applications and corresponding datasets
Our approach can be applied to a plenitude of problems in different research areas.We selected two examples in particular, one related to the identification of neutral showers caused by neutrons in the GlueX BCAL, the other one to BSM dijet events that significantly differ from SM background at LHC.

Neutral showers in the GlueX BCAL
The GlueX experiment, located at Hall D Jefferson Lab, aims to confirm the predictions of Lattice Quantum Chromodynamics, searching for a class of particles known as exotic hybrid mesons [12]. 7,8The theory predicts multiplets of exotic mesons with different quantum numbers, and the unambiguous establishment of exotic hybrids requires the full mapping of the hybrid multiplet spectrum.This mapping demands the identification of neutral and charged particles in the final state in several topologies and the validation of the results through consistency checks between different decay modes of the same hybrid meson.Production of charged exotic mesons implies a particle other than a proton must be produced in the reaction.This limits the resulting products to be either a ∆ baryon or a neutron.Charge exchange can also occur in which the proton provides its charge to a positively charged exotic meson, resulting in the production of a neutral ∆ and a neutron.In practice, ∆ baryons are difficult to work with.This is due to large underlying physics background, accompanied by difficulties describing their kinematics, which are necessary for analysis purposes.Ideally, we would like to detect and isolate the neutrons as they do not require detailed modeling of production and decays, and provide constraints to theoretical predictions.
We focus on the Barrel Calorimeter of GlueX [13], a 400 cm long (about 115 • in polar angle coverage) electromagnetic calorimeter designed primarily for photon detection.The detector consists of scintillating fibres compressed between thin layers of lead (see Fig. 2).The detector is segmented into 48 azimuthal modules.Each module is partitioned into four readout channels, consisting of double-ended readout using Silicon Photo-Multipliers (SiPM).The GlueX photon beam is incident on a liquid hydrogen target (γ + p) which results in many different final states (termed "reaction channels").Many of these particles leave the target and strike the BCAL, creating electromagnetic showers within the detector.It is from these showers we attempt to classify particles based on their profiles.Low-level and high-level observables reconstructed in BCAL are injected in the algorithm as input features (the reader can find a detailed description of the features obtained from the BCAL detector response in Appendix A).
Dealing with neutrons is typically more complicated, in part due to the BCAL detector's response being more difficult to extract compared to a photon.As a result, the extraction of reconstructed neutrons can be affected by sizable uncertainties. 9Calibrations from real data are also easier for photons, since one can rely on large samples of standard "candles" like π 0 decaying into photons which are abundantly produced at GlueX and are detected in the BCAL.With this novel architecture, we are able to limit assumptions we impose on neutrons in the training phase of our algorithm as we rely on half the information of supervised algorithms, namely, the training will be based only on detected photons as they typically provide clear signatures of neutral showers.For the proof of principle of our architecture, we will focus on isolated neutral showers, so anything that is not recognized as a photon is then classified as a neutron. 10he neutral shower reconstruction problem is characterized by 14 features.The dimensionality of this space will be augmented utilizing residuals.The detector response depends on the kinematics of the particles, that is with which energy and at which position the particle interacts in the calorimeter. 11The dependence on the particle kinematics is encoded in our approach through conditional cAE and cMAF, as explained in Sec. 2. In fact, as it should clear from Fig. 2, the BCAL design has cylindrical geometry, so the two main kinematic parameters that characterize the reconstruction of the neutral shower are the energy E of the particle and the z position at which it strikes the innermost layer.
Training and testing data consists of simulated samples, in which generation of only a specific particle type occurs via Geant4 particle gun [14], allowing by construction high purity samples in both sets (neutrons and photons).
Particle samples are simulated in such a way to be approximately flat in reconstructed energy E = 200−2200 MeV and in z = 162−262 cm within the BCAL.This corresponds to a region of the phase space that is highly active within the calorimeter.
We initially deploy fiducial cuts on reconstructed widths of showers, namely, width in z, radial width and width in time.Doing so removes any artifacts left over from the reconstruction process after simulation, in which are careful to apply loose cuts such that we do not impose restrictions on the data and lose sensitivity to the tails of data distributions.A strict requirement of one shower per event is also required to further eliminate any other interactions (for example split-offs in the photon sample). 12Since we are interested in only neutrons that mimic photons to a high degree, i.e. those not easily separated via rectangular cuts, we deploy a tight pre-selection on the radius of a shower (must be within the first 3 BCAL layers) and the amount of energy within the BCAL 4 th layer (less than 0.1 GeV).The 14 features are comprised of detector response variables and their definitions can be found in Appendix A. The entire training dataset consists of ≈ 1.8M photons in which we reserve about 10% for validation.Testing samples of photons and neutrons are generated independently and each contain about the same number for validation.We condition the model on continuous values of E and z, modeling the cMAF latent representations in bins of 4 cm, 40 MeV, values that correspond to a coarse representation of the BCAL photon resolution.
The kinematics of the photons and neutrons detected by BCAL are displayed in Appendix B. A comparison of the 'original' feature distributions injected in the cAE, the 'reconstructed' by the cAE and those dynamically 'generated' by cMAF can be found in Appendix C, along with a comparison of the residuals and their corresponding generations.

BSM dijet events at LHC
Our architecture can be utilized in other problems too.Despite the multiple searches for physics beyond the Standard Model (BSM) conducted at the LHC, new physics remains elusive as of today.In the last few years many novel approaches have been developed for AD in order to detect signal events which would stand out as anomalous with respect to a reference background: these span from new unsupervised AD technique leveraging on neural density estimation [4] to tag-and-train techniques that can be applied to unlabeled data thus offering to be less sensitive to subtle features of jets which may not be well modeled in simulation [15].In this context, our architecture can be utilized to characterize how anomalous an individual event is with respect to a background events by remaining agnostic with respect to the individual events being analyzed.
In this paper we consider QCD dijet events as background and we look for BSM dijet events from the decay of a Z .We utilized a suite of jets for SM and BSM particle resonances which is available on Zenodo [16], provided by the authors of [17].Primarily, we isolate Z → t t jets (anomalous signal) from SM QCD dijet (background) in order to remove the varying length feature vectors seen in other BSM datasets, such as W → W + jj.The datasets have been generated with MADGRAPH [18] and PYTHIA8 [19].The DELPHES [20] framework has been used for fast detector simulation.For a detailed description of the dataset we refer the reader to reference [16].The simulated QCD [21] and BSM dijets [16] are produced with the same selection criteria.Clustering of the jets was done using FASTJET [22], deploying the anti-k T algorithm [23] with a cone size of R = 1.0.As stated in [16,21], events within the datasets must meet the requirement of the leading jet having transverse momenta p T > 450 GeV and the sub-leading jet having p T > 200 GeV.
In this case we use only a single conditional, namely the leading jet transverse momentum, and form a fixed length feature vector consisting of the remaining 4 vector properties of the leading jet, its n-subjettiness variables, the sub-leading jet 4 vector and its n-subjettiness variables. 13This feature vector then gets augmented with its residual vector from the cAE, resulting in a vector of 30 features at inference including residuals.We apply a further condition on the datasets, requiring the leading jet to have p T < 800 GeV in order to provide sufficient data as a function of the conditional parameter.We model the cMAF's transformed space in bins of 1 GeV.
The architecture is trained on ≈ 600k QCD dijet events and validated on ≈ 50k, retaining around 50k for testing.We use only a single top jet file from [16] (m t = 174 GeV), providing 50k anomalous events.More details on the feature distributions of both classes can be found in Appendix D which includes also a comparison of the 'original' feature distributions injected in the cAE, the 'reconstructed' by the cAE and those dynamically 'generated' by cMAF along with the residuals and their corresponding generations.

Analysis and results
In what follows, we deploy our architecture on the two different physics scenarios introduced in Sec. 3.

Neutral showers classification with the GlueX BCAL
We demonstrate the potential of the model as an OCC method for GlueX photons, which in turn allows to tag neutron candidates from the sample of isolated neutral showers described in Sec.3.1.As already explained, there are specific regions in the phase space of the BCAL where simulating the detector response to neutrons is challenging because it is characterized by large uncertainty.
Our strategy aims at isolating neutrons by applying cuts on the photon showers, the latter taken as the reference class.As described in Sec. 2, our approach is unsupervised and agnostic to the neutrons; it allows us to dynamically generate a reference cluster in the augmented space of the features as a function of the particle kinematics.The reference cluster is used to establish if a new particle is more likely to belong to the photon class or the neutron class.A quantile cut is applied on an outlier score to determine the probability of a particle to be a member of the cluster or to be an outlier.This approach can be useful when the uncertainty on the distributions of the complementary class (neutrons) is expected to be large compared to the reference class (photons), and the distributions of neutrons and photons cannot be easily separated by standard rectangular selections.In such scenarios, fully supervised approaches become less reliable without a proper assessment of the uncertainty quantification.Our approach allows to select the true positive rate (TPR) of the reference class which, by construction, is consistent with the quantile cut on the outlier score chosen for the selection.The outlier score of each particle corresponds to a probability of not belonging to the photon reference class, according to Eq. ( 5); given that we work with isolated neutral showers which can only be either a photon or a neutron, the outlier score is interpreted as how confident we are to have identified a neutron.
Fig. 3 depicts the average value of the outlier score in bins of E and z. 14 It is easily seen that the average outlier scores for neutrons are much higher across the entire phase space when comparing both plots.It is also apparent that the outlier score of photons is flat and close to zero as a function of the kinematics and for neutrons it is large in value and rather uniform in distribution too, despite the reconstructed features do largely depend on the kinematics of the particles (as displayed in Figs.B1, B2).This means the architecture has provided an approximately uniform and good separation power in the phase space we are covering.Deploying a 95% quantile cut, we obtain a True Positive Rate (TPR) for photons, and a True Negative Rate (TNR) for neutrons of 95.09% and 52.40%, respectively, as summarized in Table 1.TPR and TNR are quite large and to our knowledge exceed by far results obtained with traditional rectangular selections [25].
We note that by assuming both photon and neutron training datasets are reliable, then deploying a fully supervised model like XGBoost [26] would result in a TPR and TNR at about 92% each, but again this is not the scenario we are tackling here. 15Efforts have been made to improve the prediction head of the algorithm, in which we replaced the clusterer with a One-Class Support Vector Machine (OC-SVM).The conditional generations (see Fig. C1 for an example of generations produced via the cMAF) are then used to fit the OC-SVM and predict outliers (i.e., neutrons).In smaller scale tests it was found that the TPR of photons suffered drastically, at around 56%, while offering a slight increase in the TNR of neutrons at around 90%.These results are most comparable to the results obtained deploying a 68% quantile cut with clustering, yet pose no real performance increases.Using the OC-SVM also limits our ability to employ a quantile cut.The SVM-based method was deemed inferior to the clustering method used in our approach and not explored further.

Uncertainty in the neutron sample
We further showcase the utility of our architecture when large uncertainty affects the class complementary to the reference class.In our case, the complementary class is that of the neutron candidates, and we know that neutron simulations in the phase space of interest described in this study can differ significantly from real data.We show how this affects a fully supervised approach such as XGBoost and make a comparison to our OCC approach which benefits from the usage of residuals.
While the original injected neutron distributions overlap to a great extent with the photon distributions to begin with, we push this overlap to an extreme case: we artificially create an extra testing dataset of neutrons in which we 'scale' the neutron features in such a way to make them highly resemble photons.We pretend this new sample to be the actual data observed in an experiment representative of the true detector response, while we consider the original sample to be the simulated data in which we assume there exists ∼ 10 − 15% difference from actual data.Fig. C2 shows a comparison between photons and neutrons of the injected feature distributions.For the neutrons, we also show the case of the 'scaled' (i.e., actual) distributions.Similarly, Fig. C3 shows a comparison for the distributions of the features as reconstructed by the cAE; Fig. C4 shows the same comparison for the residuals.Finally, Fig. C5 shows a comparison for the distribution of the outlier scores.
XGBoost is then trained on the simulated neutron sample, and we compare  performance on both simulated and actual (scaled) samples in the inference phase.In this study we use the TPR obtained with XGBoost to define the quantile cut for our architecture and make a comparison.Table 2 shows the results of this study.XGBoost TNR performance drops from about 92% to 79% when deployed to what we consider the actual data in this example.This is the result of a decision boundary obtained in the training phase using an inaccurate neutron sample.On the other hand, F+M is an OCC approach that relies on photons only and is agnostic to the neutron sample which is 'seen' only during the inference phase.The TNR performance increases from 60% to 83% when deployed on the "actual" neutron sample.The reason for the increase is likely due to learning correlations in the kinematics in the augmented space.In fact, while the new neutron features are artificially made more photon-like by scaling their distributions and hence look harder to separate from photons, their correlations with kinematics which cAE is able to pick up changed, and the resulting residuals are more easy to separate.This is further confirmed via the results obtained using the feature space only, in which performance drops in with respect the the unperturbed neutron sample.
A result provided by XGBoost in this example would provide a large TNR of 92% but it is affected by a large systematic as indeed the true performance would be 79%.F+M training does not depend on the neutron sample, and therefore the TNR performance is not affected by the same large uncertainty and it actually provides in this particular case a larger value of 82%.We note that: • The result of the TNR for F+M depends on how the actual data look like in the inference phase, i.e., the opposite result applies by switching labels in Table 2; • OCC is agnostic to the complementary class, which is unlabeled.External physical information can be used to label neutron data, e.g., this may depend on the event topology containing the neutron; • The outlier score of each particle represents a proxy for the confidence level of its classification; it's worth reminding that the quantile cut is defined by the photon sample and is dynamical in that depends on the kinematics; we notice that the neutron's outlier score increased on average for the new distributions, meaning that the uncertainty on each individual classified neutron is also decreased.More details can be found in Fig. C5.

Anomaly detection of dijet events at LHC
We deploy our algorithm for anomaly detection of BSM dijets with respect to a background of QCD dijet events.We consider the datasets described in Sec. 3. The performance of our algorithm is compared to other works [5,17] that used the same dataset and considered as a metric the area under the receiver operating characteristic curve (AUC).However, it should be noted that the AUC by construction is not agnostic to the BSM signal in the anomaly detection problem: in fact it provides model performance as a function of threshold cuts.In order to pick the optimal threshold one must have prior knowledge of the anomalous class itself, that is not always possible and if it is, implies a bias towards the model which in turn becomes weakly supervised.There are of course other methods to identify a suitable cut remaining agnostic towards the anomalous sample, yet these values may be far from optimal.Thus AUC can be an inflation of true model performance.
The AUC is obtained by fitting our ROC curve.For each quantile between (0.02,0.98) in steps of 0.02, we compute the TPR and FPR on a random sample of size of 1k (approximately 50/50 of each class), propagating the uncertainty in both efficiencies.We report the AUC values in Table .3 and compare to the best values obtained by [5] and [17] using the same dataset.In [5,17] different AUC scores are listed based on different settings, loss functions, etc.Some knowledge of the anomalous class is utilized in order to define the optimal threshold.The AUC score of our architecture is slightly larger than in [5] and consistent within the uncertainty with that of [17].It should be noticed that our architecture can be further optimized by tuning its hyperparameters.This will possibly further improve the results shown in Table 3.

Ours
From [5] From [17] AUC 0.885 ± 0.003 0.87 0.89 Table 3: AUC score comparison: AUC score comparison between our architecture and two methods, Fraser et al. [5] and Cheng et al. [17].Our architecture performs on par with [17] within the uncertainty and slightly better than [5].It should be noticed that our architecture can be further optimized by tuning its hyperparameters.This will possibly further improve the results.
In the following we calculate TPR and TNR.The idea is that of remaining agnostic to a potential BSM dijet signal, and changing the quantile cut on the QCD dijet background.For example, one may consider setting the threshold of our architecture to ≈ 100% for   4: Summary of dijet results at LHC: results are obtained using different quantile cuts on the outlier score.Fluctuations in TPR are within the error.For comparison, we also include a baseline rectangular selection based on loose fiducial cuts on each feature, defined in such a way to select when combined 99% of the SM QCD dijet events.
QCD dijets, very naively letting only largely anomalous BSM samples to stand out from the selected distributions.For completeness we show different scenarios (including also low quantile cuts) in Table 4. Fig. 4 depicts the outlier scores of both QCD and BSM dijets at LHC as a function of the leading jet transverse momentum, in which the isolation of the tail of the BSM distribution is visualizable with respect to the 99% quantile threshold.One may be temped to instead opt for a global cut within the outlier space, however this explicitly demands prior knowledge of the complementary class, directly violating the conditions of anomaly detection frameworks.
As is clear in all our tables the TPR is consistent by construction with the quantile cut applied to the outlier score obtained using the generated reference cluster.

Benefits of the augmented space: residuals
We have demonstrated the good performance of our architecture for different physics problems.In order to illustrate the performance increase via residuals, we follow the same inference procedures discussed before except we remove the residuals as input to the cluster.Table .5 shows a comparison between results obtained using the feature space and the augmented space of features plus residuals.The comparison is done for both problems (neutron identification in GlueX BCAL and dijet anomaly detection at LHC).In both cases, we observe a systematic improvement in the TNR by using the augmented space.
We deem the residuals to be highly valuable in terms of separation and provide further evidence that features localize the space, and the residuals push nested clusters radially outward to be more easily extracted and seen as outliers.This interpretation is also represented in Fig. 5 which shows a t-SNE representation of the feature and augmented space for the GlueX problem.As shown in the figure, augmenting the space with residuals produces a better distinguishing power between photons and neutrons.5: Features vs augmented space for GlueX BCAL and LHC problems:.The benefits of residuals can be concluded via comparison of TNR at equal thresholds.Note the consistency between threshold and TPR by design in which fluctuations across differing spaces are within error.

Architecture specifications and computing resources
During the inference phase, in both problems, we are limited by the generation speed of the cMAF and the clustering.HDBSCAN is in fact not optimized to run on GPU and the fitting procedure dominates the computing time. 16Considering these limitations, we generate only 1.5k reference samples per particle in inference.We show in both cases the statistics are sufficiently high to run our analyses.
The architecture has not undergone a rigorous optimization and we expect better results could be obtained.Table 6 contains hyperparameter settings used throughout our experiments.One can imagine optimizing the pipeline end-to-end under a Bayesian process, in which we rely on the signal class only.Key hyperparameters to tune are those of the (middle row) same problem using 'scaled' feature distributions for neutrons; (bottom row) the QCD dijet problem at LHC (15 dimensions, 30 with augmentation).The residuals create nested clusters of higher density within the data space that are pushed radially outward from the main primary class cluster (middle column), thus allowing more accurate separation of the two classes at inference.There still exist nested clusters within the reconstructed feature space (left column), yet it is apparent that these are not as easily extracted, which explains the performance increase via the inclusion of residuals in the augmented space (right column).The number of parameters correspond to the GlueX BCAL problem.Additional technical details on the architecture can be found in Table 7.

Summary and conclusions
We have developed a novel architecture that consists of two steps called "flux" and "mutability" (F+M).In the first stage we learn the distribution of a reference class.In the second stage we address if data significantly deviates from the reference class.The backbone of this architecture consists of: (i) a conditional autoencoder that allows to reconstruct the injected features and calculate the residuals which combined to the first make an augmented space; (ii) a conditional Masked Autoregressive Flow, that combined with a kernel density estimation allows in the inference phase to dynamically generate the reference class in the augmented space; (iii) a hierarchical-based clustering algorithm that allows to estimate if an object belongs to the reference class by producing an outlier score, which can be also seen as an estimate of how confident we are about the object belonging to the reference class.We demonstrated its capability as a one-class-classification method when dealing with isolated neutral showers at GlueX BCAL, providing good separation between photons (reference class) and neutrons (complementary class) while relying on only information related to the reference class.We then proved the advantage of this approach which is agnostic to the neutron sample, in particular when the latter is affected by large uncertainty.
We also showed the capability of the algorithm as an anomaly detection method, isolating possible BSM Z → t t dijets topologies from SM QCD dijet background (reference class) at the LHC.We demonstrated that our model performs on par with other architectures using the same dataset, yet we are able to remain truly agnostic towards the complementary class using our final quantile metric.A possible extension of this problem that has been left for future exploration consists in conditioning on both p T and the invariant mass of the leading jet.We also note in general that our architecture has not undergone rigorous optimization of the hyperparameters and we therefore expect further increase in performance in doing so.
In both cases, we have demonstrated an increased performance via inclusion of the residuals.We have concluded that an augmented space (features + residuals) is ideal for inference.The features localize the space for a given kinematics, and the residuals push the complementary class radially outward in the hypersphere used for clustering.This allows nested clusters existing in the data space to be more easily extracted and increase distinguishing power.
The inference time per particle is fairly slow.As such, the architecture is best suited for offline analysis purposes in its current state, in which it can be optimized via parallelization.Reduction in inference time can potentially be obtained via the use of a different flow network as MAF is generally known for its slower generation speeds.Other NF models have not yet been tested and are left as further exploration.A large bulk of the inference time remains at the clustering stage as HDBSCAN is not optimized to run on GPU.Other clustering methods, or an improved GPU-optimized HDBSCAN build could potentially solve this issue.In the future we plan to extend this work to an application for data quality monitoring in an experiment, in that significant deviations from the expected quantiles of the reference distributions could determine if a new calibration/alignment is needed.

Appendix A. GlueX BCAL feature definitions
All showers are DBCALShower objects in the GlueX software package.We denote these showers as S, and label the quantity we use via a subscript (S x denotes the x position of the shower object for example).R denotes the inner radius of the BCAL (64.3 cm radially outwards from the center line of the target) and T z denotes the center z position of the hydrogen target (65 cm from the upstream edge of the BCAL).Showers are comprised of points, we define our features using the energy weighting of these points. 18 LayerM E = N i E i M ∈ {1, 2, 3, 4} is the layer number and E i is the energy of the i th reconstructed point in the layer.
is the layer number and E i is the energy of the i th reconstructed point in the layer.
and z i are the energy and z position of the i th point in the shower. 2, ∆r i = (R -r i ) E i and r i are energy and radial position of the i th point.
and t i are the energy and timing information of the i th point.
and θ i are the energy and polar angle (from the target center) of the i th point.
and φ i are the energy and azimuthal angle of the i th point.
The position at which the particle hits the inner radius of the BCAL.

Appendix B. GlueX kinematic plots
Plots contained in this section illustrate the 14 features (see Appendix A) as a function of kinematics for both simulated photons and neutrons, as described in the captions.These plots are integrated over the entire phase space used in the analysis.The functional dependence of features on the phase space (E,z) can be clearly seen in the simulated samples, more so for the photons as the simulation of a neutron interaction is more difficult.
As one moves to training on real data from standard "candles" such as π 0 → γγ, the dependencies become more pronounced.The distributions of the features with respect to the z position and energy in BCAL are shown in Fig. B1 and B2 for photons and neutrons, respectively.The dependence differs from that of photons, but overlaps at certain regions in the phase space making separation of the two classes more difficult.

Appendix C. GlueX generations
We conditionally generate data using the cMAF in a select kinematic region to demonstrate the quality of data produced.A central region of the of the phase space we are working with (1 GeV < E < 1.4 GeV, 206 cm < z < 218 cm) is chosen, and used to compare three different quantities, namely the original injected features, reconstructions from the cAE and the generations of the cMAF.For each original point within the phase space, we generate ten artificial data points, as such, generated distributions an order of magnitude larger in terms of sample size but have been normalized.Being that we are generating the reconstructed space of the cAE, found to be beneficial due to skewing of out-of-distribution Original, reconstructed and generated features (top two rows), residuals and generated residuals (bottom two rows), for 1 GeV < E < 1.4 GeV, 206 cm < z < 218 cm.The cMAF is trained to generate the reconstructed space of the cAE as it was found to give better separation power due to skewing of OOD samples (i.e.neutrons).The cMAF matches closely the distributions of the sampled, reconstructed photons.
(OOD) samples, one may argue the use of a Conditional Variational Autoencoder (cVAE) may be appropriate.We have developed a similar algorithm using cVAE's although it is not optimal for a few reasons, namely, the quality of generations and also the problem with dead nodes (referring to vanishing gradients, in which outputs tend to zero) during training.Due to small values in the residual space, the training process can be very tricky and can lead to Dirac delta distributions at zero in some variables if dead nodes occur.Using an NF instead was found to be more reliable and it can be seen from Fig. C1 the distributions are consistent.On a particle-by-particle basis we scaled the neutron features through this empirical formula: x + S * P * (x−x min ) * (xmax−x) , where S being the sign that controls which direction to nudge, P is the scaling effect in percentage (we used a 10% effect, that is P =90% or 110% depending on the sign S), and x min , x max are the minimum and maximum in the  feature range which are physically allowed on each feature.We applied this scaling to all features with the exception of the shower R for which it has been neglected.Fig. C2 shows the injected distributions for photons, neutrons and scaled neutrons; Fig. C3 shows the corresponding reconstructed features, whereas Fig. C4 shows the residuals; Fig. C5 shows the corresponding outlier score distributions obtained with F+M.The perturbed neutron outlier scores are on average higher than the original, given that the cAE is able to detect kinematic discrepancies introduced via the perturbation.

Appendix D. LHC generations
We conditionally generate data using the cMAF in a select kinematic region to demonstrate the quality of data produced.A central region of the of the phase space we are working with (600 GeV < p T j1 < 650 GeV ) is chosen, and used to compare three different quantities, namely the original features, reconstructions from the AE and the generations of the cMAF.For each original point within the phase space, we generate ten artificial data points, as such, generated distributions an order of magnitude larger in terms of sample size but Figure D1: Features of QCD dijet events at LHC: Original, reconstructed and generated features (top three rows), residuals and generated residuals (bottom three rows), for 600 GeV < p T j1 < 650 GeV .The cMAF is trained to generate the reconstructed feature space of the cAE as it was found to give better separation power due to skewing of OOD samples (i.e.Top Jets).The cMAF matches the distributions of the reconstructed QCD dijets to a very high degree.
have been normalized.We notice in this dataset the generation quality is not as good as for GlueX in Appendix C, this is due to the relatively low training sample size and  The training data set should ideally be more dense (in terms of continuous conditionals) which would allow smaller modeling with KDE, and overall a more robust learning phase for the cMAF.We can see that when training data is sufficient, the generations are extremely accurate (see Appendix C).The discrepancies in some variables undoubtedly affect performance although it can still be seen the clusterer is able to efficiently use injected data at the inference phase based off performance obtained.
Fig. D1 shows the distributions of the SM QCD dijet events at LHC, for the injected, reconstructed and generated data; Fig. D2 shows a comparison between the SM QCD and the BSM feature distributions.

Figure 1 :
Figure 1: Flowchart of the architecture in the inference phase: The flowchart is described from top to bottom, where the augmented object produced by the left column is compared to the reference cluster produced by the right column; (A): the inference object is sent through a cAE, in which the conditional learning as a function of the kinematics is done via concatenation.The cAE produces a reconstructed vector used to construct a residual vector (x − x).Reconstructed features and residuals are concatenated as a new augmented space; (B): cMAF is previously trained on the augmented space as a function of continuous conditions.The kinematic vector is mapped to a KDE functional, used to model sub-spaces of the flow-transformed distributions as a function of the kinematics.In the inference phase, the flow network is then fed a sample of Gaussian vectors produced via KDE.The data produced by the flow network is used to form the augmented reference cluster.(C): The comparison of the object with the reference cluster produces an outlier score used for classification.

Figure 2 :
Figure 2: Sketch of barrel calorimeter readout: (a) BCAL schematic; (b) a BCAL module side view; (c) end view of the BCAL showing all 48 modules and (d) an end view of a single module showing readout segmentation in four rings (inner to outer) and 16 summed readout zones demarcated by colors.More details can be found in [13].

Figure 3 :
Figure 3: Outlier scores: Average outlier scores (probability of not belonging to the photon class) in bins of 40 MeV and 2 cm.Signal particles (photons, left) display much lower outlier scores (color) on average than background (neutrons, right) across the grid of the phase space (x axis: Z position, y axis: Energy).There are no apparent kinematic dependencies on separation power.

Figure 4 :
Figure 4: Outlier scores as a function of leading jet transverse momentum: Outlier score as a function of leading jet transverse momentum for QCD dijets (left), and BSM dijets (right) at LHC.The 99% quantile threshold is overlayed.Opting for large values of the quantile results in isolating the tails of the complementary class distribution if features overlap to a high degree.The 1σ band has been calculated from reference clusters with 1.5k generated objects.

Figure 5 :
Figure 5: Dimensionally reduced representation of the reconstructed feature, residual and augmented spaces: t-SNE [27] is used to provide a 2D representation of the reconstructed features, residual and augmented space.(top row) the γ/n classification in GlueX BCAL (14 dimensions, 28 with augmentation);(middle row) same problem using 'scaled' feature distributions for neutrons; (bottom row) the QCD dijet problem at LHC (15 dimensions, 30 with augmentation).The residuals create nested clusters of higher density within the data space that are pushed radially outward from the main primary class cluster (middle column), thus allowing more accurate separation of the two classes at inference.There still exist nested clusters within the reconstructed feature space (left column), yet it is apparent that these are not as easily extracted, which explains the performance increase via the inclusion of residuals in the augmented space (right column).

Figure B1 :
Figure B1: 2D histograms of photon features (x-axis) and z position (or energy) (y-axis): We see a clear dependence on z and energy for most of the features within the space.(top two rows) Features such as T Width, and φ Width display less of a z dependence on average.(bottom two rows) Any feature corresponding to energy deposition within layers has a large dependence on z and E, along with width variables.By conditioning on z and E, we are able to capture the functional dependence of detector response at the generation stage.

Figure B2 :
Figure B2: 2D histograms of neutron features (x-axis) and z position (or energy) (y-axis): (top two rows) We see a lesser degree of z position dependence on neutron features in comparison to that of photons.Features with high dependence in the photon sample no longer exhibit the degree of functional relation in the neutron sample due to their differing interactions.(bottom two rows) We see a clear dependence on energy within the feature space: any feature corresponding to energy deposition within layers has a high dependence, along with width variables.The dependence differs from that of photons, but overlaps at certain regions in the phase space making separation of the two classes more difficult.

Figure C1 :
Figure C1: Features of reconstructed photon showers in GlueX BCAL:Original, reconstructed and generated features (top two rows), residuals and generated residuals (bottom two rows), for 1 GeV < E < 1.4 GeV, 206 cm < z < 218 cm.The cMAF is trained to generate the reconstructed space of the cAE as it was found to give better separation power due to skewing of OOD samples (i.e.neutrons).The cMAF matches closely the distributions of the sampled, reconstructed photons.

Figure C2 :
Figure C2: Photon and neutrons distributions: Photon and neutron distributions.Original and scaled neutron distributions are also shown for comparison.

Figure C3 :
Figure C3: cAE reconstructed photon and neutrons distributions: Photon and neutron distributions for the features reconstructed by the cAE.Original and scaled neutron distributions are also shown for comparison.

Figure C4 :
Figure C4: cAE residuals for photon and neutrons distributions: Photon and neutron distributions for the residuals obtained with the cAE.Original and scaled neutron distributions are also shown for comparison.

Figure C5 :
Figure C5: Outlier score distributions: Photon, neutron, perturbed neutron and quantile cut (coinciding with the TPR of XGBoost equal to 92.15%) distributions.The perturbed neutron outlier scores are on average higher than the original, given that the cAE is able to detect kinematic discrepancies introduced via the perturbation.

Figure D2 :
FigureD2: Features of QCD dijet and BSM dijet events at LHC: Feature distributions are integrated over the entire phase space.The features overlap to a high degree, yet the resulting means of distributions occupy different regions within the space.The resulting differences in kinematic correlations are able to be exploited via the residuals.

resulting large kinematic bins ( 1
GeV in transverse momenta) in the KDE modeling phase.

Table 2 :
Neutron sample study: The table shows a comparison of the performance obtained using two neutron samples, one assumed to be simulated and the other one considered as the "actual" detector response to neutrons.Differently from neutrons, photons simulations are more accurate and in agreement with real data.TPR is the true positive rate for photons and TNR the true negative rate for neutrons.XGBoost is trained on simulated photons and neutrons.F+M relies only on simulated photons.XGBoost TNR performance drops when deployed on actual data.The increased TNR of F+M depends on the residual space produced by the cAE which captures a different kinematic dependence of the neutron features compared to that of the photons in the actual case.

Table 6 :
Baseline hyperparameters of the F+M architecture: these values are not optimal, but have shown to be reliable as an initial starting point.Thorough optimization utilizing Bayesian processes is likely to further improve performance.

Table 7 :
Specs of the F+M architecture: the table reports the average inference time per particle, the inference memory and the training memory, i.e., the GPU memory required by the network during inference and training phases.Training was done utilizing google services on Nvidia A100-SMX4-40GB and Nvidia V100-16GB cards with TensorFlow 2.8.0 and TensorFlow-Probability 0.16.0 builds.Inference was supported via Compute Canada on Nvidia P100 PCIe 16GB card.The number of parameters correspond to the GlueX BCAL problem.The GlueX γ/n problem has slightly larger numbers due to extra layers in the cMAF needed to reproduce multi-modal distributions in the photon sample.clusterer.Training was done utilizing google services on Nvidia A100-SMX4-40GB and Nvidia V100-SMX2-16GB cards with TensorFlow 2.8.0 and TensorFlow-Probability 0.16.0 builds.Inference was supported via Compute Canada on Nvidia P100 PCIe 16GB card.17