Finetuning Foundation Models for Joint Analysis Optimization

In this work we demonstrate that significant gains in performance and data efficiency can be achieved in High Energy Physics (HEP) by moving beyond the standard paradigm of sequential optimization or reconstruction and analysis components. We conceptually connect HEP reconstruction and analysis to modern machine learning workflows such as pretraining, finetuning, domain adaptation and high-dimensional embedding spaces and quantify the gains in the example usecase of searches of heavy resonances decaying via an intermediate di-Higgs system to four $b$-jets.


I. INTRODUCTION
Data analysis in High Energy Physics (HEP) aims to make inferences on fundamental theories of nature based on data recorded at large-scale experiments, such as those at the Large Hadron Collider (LHC).The observed data at such experiments originates from high energy collisions and their evolution is modeled by a deep hierarchy of physical models, describing e.g. the decay of particles, their subsequent radiation patterns and finally the interactions with the detecting instrument.Consequently, the primary approach in data analysis is that of hierarchical pattern recognition and inference: first, low-level patterns in the detector data are identified and used to reconstruct properties of particles that directly interacted with the detector.Based on these, the earlier stages of the data-generating process are reconstructed in a hierarchical fashion before inferences on the originating theory can finally be made.That is, the inference pipeline aims to approximately invert the data-generating process by progressively summarizing the data, reconstructing earlier latent states and subsequently analyzing those.Traditionally, the individual reconstruction and analysis algorithms are optimized sequentially (greedily), with late-stage algorithms being optimized on inputs of previously optimized earlier stages.While practical, it is unlikely that this strategy would yield the jointly optimal data analysis pipeline.
In this work, we show that significant gains in performance and data efficiency can be achieved by instead pursuing a more global gradient-based optimization strategy and modelling the data analysis approach after modern large-scale machine learning (ML) workflows with foundation models.As shown in Figure 1 these gains materialize as boosted performance at a fixed dataset size as well as an improved data efficiency, i.e. samples required to reach a desired level of performance.This paper is outlined as follows: In Section II we review relevant related work.In Section III we recall preliminaries from simulation-based inference and point out similarities between machinelearning with foundation models and common practice in particle physics.Section IV introduces a demonstrator use-case for end-to-end optimization and discusses the datasets involved, whereas Section V discusses the neural network architectures and training strategies considered in the study.In Section VI we discuss the results while giving an outlook towards future research directions in Section VII.Our main contributions are: • We establish a correspondence between concepts in the HEP analysis workflow and those in modern deep learning such as foundation models, downstream tasks and finetuning to describe a general strategy for optimizing HEP data analysis pipelines.
• We demonstrate, to our knowledge for the first time, a finetuning workflow in the hierarchical setting of per-object representation and event-level inference within particle physics.
• We quantify the significant gains due to end-to-end optimization with respect to data efficiency and performance at fixed sample size.
• We provide evidence of successful domain adaptation in a hierarchical setting of HEP foundation models finetuned on datasets other than the one they are pretrained with.

II. RELATED WORK
This work connects to a larger body of research concerned with the optimization of HEP analysis and the role of processing low-level variables with deep-learning systems [1,2].Early work on neural networks with inductive bias informed by quantum chromodynamics [3] investigated a hierarchical approach that jointly optimized a pipeline consisting of a neural embeddings of jets followed by an event classification but has not in detail studied performance under various pretraining strategies.Increasingly, hierarchies of neural networks algorithms are used within reconstruction for larger overall tasks, such as tracking [4][5][6] or particle flow reconstruction [7,8].However, they are often greedily optimized due to nondifferentiable elements in the pipeline.To bridge this gap, approaches that enable gradient information to flow freely have grown into the rich research domain of differentiable programming, with e.g.differentiable vertexing [9], statistical inference [10][11][12][13], branching processes [14,15], matrix-elements [16] or even detector-design [17].
This work relies heavily on jet-level backbone models, which are primarily developed in the context of jet-tagging tasks [18,19].Specifically we use the transformer-based ParT [20] as a jet representation backbone, but the method can be extended to other jet-level models that access the full low-level constituent data, such as JetCLR [21], LorentzNet [22], or GN2X [23].
The notion of general-purpose foundation models that are pretrained and then finetuned is commonplace in computer vision [24][25][26] and natural language processing [27][28][29][30].Often, such foundation models aim to develop a self-supervised pre-training strategy; however, supervised strategies are also common [31].Increasingly, there are also efforts within the natural sciences to train and exploit general-purpose foundation models [32][33][34][35].Domain adaptation has been investigated previously in high-energy physics in a jet-tagging contexts [20,36] but to our knowledge not in hierarchical configurations.In parallel to the present effort on supervised backbones, investigations are ongoing on the potential of self-supervised backbones in HEP through masked particle modelling, which extends FIG.2: Modern machine learning and HEP data analysis exhibit conceptual similarities.Reconstruction plays the role of a backbone or foundation model yielding a general purpose representation of high-dimensional low-level data.The physics data analysis itself is a "head" that produces task-specific summary statistics.
the masked language modelling approach from NLP to the HEP domain [37].

A. Simulation-based Inference and Summary Statistics
The data analysis strategy described in Section I can be motivated and formalized through the lens of simulationbased (or likelihood-free) inference [38].In HEP, the evaluation of the likelihood p(x|θ) of the observed data x given a theory θ is intractable due to the fact that the data-generating process proceeds through complex intermediate states that are not directly observed, such as particles decays, radiation effects and interactions with dense detector material.Formally, we can collect all such unobserved states into a single latent variable z.The likelihood-free nature then becomes apparent, as the evaluation of the likelihood would require the computation of a high-dimensional integral p(x|θ) = z p(x|z)p(z|θ).Inference in this setting is primarily enabled by the existence of high-quality simulators that encode the physics of the data-generating process, so that it's possible to obtain joint samples (x, z, θ) ∼ p(x|z)p(z|θ)p(θ) through ancestral sampling.A direct density estimation of p(x|θ) based on the resulting marginal samples x ∼ p(x|θ) is however impossible due to the high dimensionality of the data x, which denotes the readouts of O(10 8 ) sensors of modern physics experiments such as those at the LHC.
The dominant method to perform inference on the theory parameters θ is therefore through the density estimation of suitable low-dimensional summary statistics T : x → t followed by standard statistical inference techniques.The computation of the summary statistic is often conceptually split into a reconstruction-level summary and analysis-level summary 1 .Formally, we can state that the goal of reconstruction is to map the low-level data x into an event record representation, i.e. an estimate ẑ of the latent state in the form of lists of particles in the event and their properties.The reconstruction of the data analysis is generally thought of as a generic preprocessing step that already drastically reduces the dimensionality of the data and is highly interpretable.While it is important to note that reconstruction is not a monolithic neural network, but rather a complex composite of both non-neural and neural components, for the purposes of this discussion it can be thought of as a single parametrized function R ρ : x → ẑ, where ρ stands for variable parameters that control the details of the reconstruction process.
The reconstruction phase is then followed by a more case-dependent analysis phase that drives the final inference.Again setting aside many important details of HEP analysis, we can define as its core the definition of a task-dependent summary statistic A α : ẑ → t, where α denotes variable parameters.In many cases, the summary statistic is formed through training a neural network on an event-level binary classification task to distinguish signal events from background events.The final summary statistic is thus the composition As summaries are in general lossy, results inferred from them are usually weaker than those that would be obtained if the full likelihood were available.An important question in particle physics is thus the optimization of the summaries and in particular their parameters (ρ, α).
It is notable that both HEP and modern machine learning workflows based on foundation models exhibit a number of similarities regarding their use and optimization.The correspondence is sketched in Figure 2 and we describe it briefly in the next section.

B. HEP in the Language of Foundation Models
In modern ML practice based on foundation models, training often proceeds through two phases.In the first phase, models are trained on a large pretraining dataset using pretext tasks.These tasks often do not solve the task for which the model is ultimately used, but rather are designed in order to allow the model to create useful, semantically meaningful representations from low-level input data.Here, the models can be split into a backbone model that forms the representations and a pretext head that outputs the final prediction of this training stage.
In a second phase, the pretrained backbone (i.e. the model with the pretext head removed) is adjusted for the target downstream task by combining the backbone model with a suitable prediction head component and the resulting composite model is trained on the target dataset.Here two training strategies can be pursued that differ in computational complexity.In one mode, the backbone acts as a fixed feature extractor and only the head is optimized for the new downstream task.Alternatively, the backbone weights can be included in the second-stage optimization to yield a feature extractor that is finetuned to the downstream task at hand.Both of these strategies are to be contrasted to the "from-scratch" strategy, in which no pretraining occurs and the full composite model of backbone and target head is only optimized using the target dataset.
The optimization of a reconstruction and analysis pipeline for data analysis in HEP proceeds along very similar directions.Reconstruction can be interpreted as a backbone model designed to provide physicists, interested in downstream physics tasks, with a useful generalpurpose representation of the low-level event data.Viewed through this lens, we can recognize the reconstruction algorithms as feature extractors that are optimized on pretext tasks, usually in a supervised manner where the reconstructed event record ẑ is optimized to estimate the latent event record z.
Such pretext tasks include predicting (i.e.reconstructing) e.g.kinematic variables of the particles within the latent states, the particle type, or the true flavor of jets.Similarly to the pretraining dataset in ML workflows, the optimization of these algorithms is often carried out using large simulated samples of particle collisions that may not be used in the final analysis.
The downstream task in HEP is the analysis stage in the HEP pipeline where the extracted features, i.e. the reconstructed event, are used as inputs to compute a suitable summary statistic.The setup is most similar to the "frozen backbone" model, where the event representation is fixed and only the downstream analysis itself is optimized for a physics task, such as a measurement of a particle property or a search for new particles.Here, the fixed reconstruction with parameters ρ * induces a distribution p ρ * (ẑ, θ) for which samples are available to optimize the analysis Here, it is important to note that the sequential (greedy) optimization strategy of first optimizing the reconstruction and then the analysis does not necessarily coincide with the joint optimum Thus, a joint optimization e.g. through finetuning would in general be desirable.While in general the reconstruction is thought of largely as a static summary, some sub-algorithms within it may be available in a discrete number of well-defined configurations referred to as "working points".It is common practice for analyzers to select a configuration particularly suited for their specific physics analysis among this discrete set of options.Viewed from a ML perspective, this may be interpreted as a basic approach to non-gradientbased finetuning.
Based on these correspondences, we can recognize an opportunity for a more complete and automated finetuning of reconstruction-level components, in the context of a joint optimization of the full analysis pipeline.In light of the general trends towards larger neural network components [39][40][41] and advances in differentiable programming, gradient information of the output of reconstruction algorithms with respect to their configuration parameters becomes increasingly accessible.Hence, a gradient-based finetuning and joint optimization as it is common in machine learning becomes possible by computing the gradient of the final event-level loss (e.g.binary signal vs. background classification) with respect to all differentiably connected components at both the analysis-level and reconstruction-level.
In addition to the optimizations of the algorithms themselves, the choice of features that describe objects within the reconstruction is a regular target of optimization.For example, new jet-level observables, such as jet substructure variables [42,43] or jet-tagging scores may be added to the reconstruction output if such features aid the downstream analysis-level processing.This choice is not unlike the choice of embedding dimension of the backbone output within ML foundation models.In this context, it is interesting to explore to what extent learned, instead of hand-engineered features may aid downstream performance and how they can be finetuned.

IV. DEMONSTRATOR MODEL AND DATASET
We demonstrate the concept of analysis-level finetuning of neural reconstruction components in a simplified setting of a new resonance, graviton G, decaying to two Higgs Bosons, which in turn decay through the H → b b channel.The final state to be analyzed is thus a multi-jet final state with G → HH → b bb b.A typical analysis strategy would be split into two stages.At the reconstruction-level, a "Xbb tagging algorithm" would typically be developed, i.e. a binary classifier that operates on the constituents of a large-radius jet and infers whether it originated from a H → b b decay.At the analysis-level, jets within the event would be analyzed to perform a full event classification to determine whether the event originated from a signal or background process such as multi-jet backgrounds.The work primarily investigates to what extent the reconstruction-level jet processing can be finetuned to yield an improved full-event classification performance.For the study two datasets are primarily used, which we describe briefly: JetClass This dataset [44] consists of 100M simulated anti-k T R=0.8 [45] "large-R" jets initiated from 10 different decay configurations of heavy states, including H → b b.This dataset is only implicitly used through the reuse of the published pretrained network weights in the domain adaptation studies.The decays were simulated through MadGraph [46] and the parton shower evolution of the final-state particle was simulated via Pythia [47].The final data was then prepared through the Delphes [48] simulator.
CMS Open Data This dataset [49] consists of 10M simulated events divided into QCD background and G → HH signal, where the signal is a mixture of X mass points from 600 GeV to 4500 GeV.For pretraining on the Xbb jet task we use the dataset as a jet dataset for a total of 22M jets, while for end-toend training we reshape the data such that data instances are full events with multiple jets.As the provided CMS dataset saves the jet-level information, we edited the HiggsToBBNtupleProducerTool2 released with the dataset to also save this event-level information.We consider a loose event selection criteria, keeping events with at least two large-R jets with p T > 150 GeV.For the analysis classifier, we consider the five highest p T jets in the event, which keeps 99.5% of the true H → b b jets in these graviton signal samples.When reporting performance on the analysis classifiers, these cuts define the denominator of the signal and background efficiencies.The dataset has been produced through the simulation and reconstruction pipeline of the CMS experiment [50].

V. ARCHITECTURE AND TRAINING STRATEGIES
We analyze the setup described above along two dimensions: architectural constraints and training strategies, with the goal to explore how much performance in FIG.3: Hierarchical neural network structures considered in this work with decreasing levels of structural constraints and manually engineered features.downstream tasks can be gained by moving beyond the traditional HEP workflow.

A. Architectures
Overall we investigate three possible architectures with a decreasing amount of interpretable physics structure to determine how much structure and manually engineered features are needed, or whether generic high-dimensional learned representations would suffice.
The networks consist of a reconstruction-level network for jets that operates on constituents, and optionally may be augmented with physics-driven high-level features (HLF), to construct a jet representation.These representations of all jets within the event then enter a permutation-invariant analysis-level network.For the reconstruction-level network we use the transformer-based ParT architecture (sans the final softmax layer) to form embeddings of the jet constituents, which may optionally be projected to a scalar value through a linear layer.With the ParT network, we use the same inputs as proposed in Ref [20].For the analysis-level a deep set [19,51] architecture is used to reflect lack of inherent ordering to the reconstructed jets.The choice is made for simplicity and additional performance may be achieved through more complex permutation-invariant architectures such as transformer networks.The architectures differ in the details on how the jet representation is formed as shown in Figure 3, each progressively removing physics-motivated features and data flow in lieu of a less structured architecture.
Jet Scalar + HLF (S+HLF): This is the traditional HEP architecture, where particle jets are described by a small number of high-level features (HLF), which we keep to a minimum with five features: the kinematic variables p T , η, ϕ, the jet mass m j and the soft-drop mass m sd [52].In addition to these fixed features, we add a slot for one additional learned scalar feature.From the physics at hand, we could motivate to use this scalar value to e.g.add the classifier output of a (a pretrained) Xbb-tagging algorithm.In the finetuned and from-scratch training configurations described below, the network may choose to use this learnable slot to propagate other summaries of the jet constituent through this bottleneck that may not correspond to a Xbb-score.
Jet Vector + HLF (V+HLF): Instead of only allocating a single scalar, here the analysis-level network can circumvent the scalar bottleneck and access the raw latent vector representation of the reconstruction-network without the final projection to a scalar value.This may enable the analysis-level network to make use of a richer representation of the jet.
Jet Vector (V-Only): When the analysis-level network has access to a high-dimensional embedding of the constituents, one may hypothesize that the high-level jet features may not be needed, as the corresponding information is already encoded within the latent embedding of the ParT backbone.In this architecture we drop all HLF and just use the latent jet embedding to pass information to the analysis-level network.

B. Training Strategies
We pair the three architectures with three training strategies for the combined network consisting of reconstruction-and analysis-level components.The overall goal of the composition is to optimize on binary classification of the signal process against multijet background as measured through standard binary cross-entropy.
Frozen Pretraining (frozen): This model resembles the traditional HEP workflow.The jet backbone model is trained on a reconstruction-level task and then frozen.The pretext task for this pretraining is the classification of jets as originating from a X → b b decay chain and the model is randomly initialized: The pretrained jet backbone is then integrated into the analysis as a frozen feature extractor and only the analysis-level network is optimized on the resulting jet representation.In the S+HLF the additional learned slot is then populated with the classifier output, whereas in the V+HLF and V-Only the latent representation of the Xbb-tagger just before the classification head passed to the analysis.

Finetuned Training (finetuned):
In this model, the backbone is initialized to the pretrained weights, but during the training, gradient information is propagated to both the analysis-and reconstructionlevel networks.That is, the jet-backbone is allowed to adapt to the specific analysis environment to minimize the analysis-level loss.Thus, while e.g. in the S+HLF model, at initialization time, the scalar value passed to the analysis is exactly the Xbb score, during training the semantic meaning of this neuron may drift as the network learns to encode other types of information as well, making use of the notion of polysemanticity in neural networks [53].
No Pretraining (from-scratch): To assess the impact of pretraining and finetuning we train the full composed network end-to-end from randomly initialized weights only on the final analysis-level classification task.In this model, the network is completely free to choose what information to propagate through the latent states.In particular, the scalar value in the S-HLF model need not be related to the probability of originating from a Xbb-decay.
In order to avoid vanishing gradients impeding efficient gradient-based training we remove the sigmoid activation in the S+HLF models for the finetuned and from-scratch configurations.The ParT backbone is trained following the training setup of the original paper: a binary cross-entropy loss is minimized by means of the Lookahead optimizer [54] with k = 6 and α = 0.5, and RAdam as inner optimizer [55] with β 1 = 0.95, β 2 = 0.999, and ϵ = 10 −5 .The same setup is adopted when training the full pipeline end-to-end together with the analysis head network, while when training the head alone on a frozen jet representation we use the Adam optimizer [56].A batch size of 512 for the backbone pretraining, 256 for the end-to-end analysis model, and a starting learning rate of 0.001 is employed, using a constant learning rate scheduler with warm-up whenever the backbone parameters are learnable.A model checkpoint is saved after every epoch and the one with lowest loss on the validation set is chosen for the final performance evaluation on the test set.The datasets are divided into training, validation, test dataset with a 45% / 5% / 50% split.

VI. RESULTS
We present the results primarily through comparing performance as a function of labeled examples in the final analysis-level signal-vs.-background(S/B) classification task.Reported signal efficiencies are inclusive over all graviton resonance masses.The shown uncertainty bands are based on the standard deviation of four independent runs.As a baseline model, we will then compare the results of the various combinations of architecture and training regimes to the S-HLF(frozen) setup, as this resembles the standard HEP workflow of a fixed reconstruction on top of which an analysis is optimized most closely.The main performance metric presented here is the background rejection (i.e. the inverse false positive rate).Results on alternative measures are presented in   task.
Fixed Performance Level: Alternatively, it is interesting to explore the number of training samples required for a given performance level.Two models may be able to reach the same performance but the more data-efficient one will require less training examples and thus computational resources to reach it.We define the data efficiency as the ratio of the required dataset size to reach a performance level as compared to that of the baseline.

A. Training Strategy Comparison
In Figure 4, we first compare the performance of each of the different architectures under the suite of training strategies described above.Here, we expect the pretraining to clearly outperform from-scratch training as the pretext task is strongly suggested by the physics at hand.The relationship between finetuned and frozen backbones, however, is less clear.While the frozen backbone should provide a lower bound on the finetuning performance, the level of performance gain that finetuning may achieve depends strongly on the alignment of the pretext task and its learned representations with the downstream task.For example, if the pretrained representation of the jets within the event would be a sufficient statistic on the inference target, finetuning would not be able to extract any more information from the low-level data.In the present example, however, we do observe a significant gain from finetuning, which manifests in a increase in background rejection e.g. a from 1.5-4x at 90% signal efficiency as shown in Table II.Expressed in terms of data efficiency, the finetuned models reach a high level of performance with up to 70x less data as shown in Table III.Training the full architectures from scratch reaches high levels of performance but requires significantly more labeled examples.We point out that from-scratch training, when trained on sufficient data, eventually surpasses the perfor-   We also explore the drift of the learned feature in the S+HLF models.In the frozen backbone, this scalar represents the probability of the jet to originate from a H → bb decay.In the finetuned and from-scratch configurations this interpretation may not hold anymore, as the continued training of the jet-level backbone may overload this neuron semantically.We can investigate this learned scalar through the lens of an X → bb classifier by adding a sigmoid activation to the scalar output of the non-frozen S+HLF models.We observe that indeed during finetuning the learned scalar feature drifted and its Xbb performance deteriorated, while the overall performance of the finetuned models surpasses the frozen model as shown in Figure 5. Hence we hypothesize that during learning, the learned scalar is overloaded to encode multiple jet features relevant for the downstream task.
It is interesting to note that the learned scalar from The end-to-end model ultimately achieves a performance of 43% of the supervised Xbb-pretraining as shown in Table IV.

B. Architecture Comparison
We now compare the performance of the different architectures under a fixed training strategy to assess to what extent the models with less physics-information can learn representations that are more effective at the downstream task.The results for the frozen training strategy are shown in Figure 6, indicating that with a fixed backbone the higher-dimensional embeddings do indeed carry more information than just the scalar Xbb score.However, they seem to not fully capture the information contained within the high-level features.This result renders the V+HLF(frozen) model the best performing with an improved background rejection at 90% signal efficiency that is 14% higher than S+HLF(frozen).Furthermore, the model is up to 15× more data efficient than the baseline model.While the Vector-Only(frozen) model initially outperforms the baseline, with sufficient training data, the baseline model eventually surpasses it in performance.For the finetuned and from-scratch trained models, where the latent representation of the backbone can be adjusted to the downstream task, the missing information can be recovered, as shown in Figure 8 in the appendix.Hence, both the Vector-Only and S+HLF models generally reach the same level of performance with only minimal

C. Domain Adaptation
A key aspect of foundation models in modern ML practice is their ability to form representations that may be transferable to new datasets, and similar notions are also relevant in HEP.For example, while the target dataset in the above study is only of moderate size (22M jets), there are much bigger datasets available for a similar task such as the JetClass dataset described in Section IV, which contains 100M Jets.While those datasets are generated using different simulators and thus do not match directly at the distribution-level, i.e. they represent different domains, the underlying physics is largely similar.Therefore, domain adaptation may be possible such that pretraining on datasets other than the target dataset benefits the overall performance.The parameters of a ParT network, optimized for a 10-way multi-class inference of the originating decay chain of the jets in the JetClass dataset, have been made publicly available together with the dataset release [20].We can therefore add one additional variant to each of the three training strategies.
JetClass-pretrained Initialization (JetClass init): For the two strategies with pretraining on the Xbb task on the CMS Open Data dataset, (frozen and finetuned), the pretraining itself is initialized not randomly but from the published weights resulting from the multiclass training on JetClass.Similarly, in the from-scratch case, where no pretraning happens on the target datasets, the end-toend training is initialized with the published weights as well.
As shown in Figure 7 and Table V we observe a significant improvement in performance for the models initialized from JetClass-pretrained weights.The performance gain is present in both finetuned and from-scratch models.We note that successful domain adaptation may open up interesting opportunities to cross-experiment pretrained foundation models in particle physics.The JetClassinitialized finetuning configurations are also shown as dashed curves in Figure 4, where this configuration is consistently the best performing one.

VII. CONCLUSIONS
In this work we investigated the possibility of adapting large-scale machine learning workflows from foundation models to particle physics.To this end we first developed a conceptual connection between ideas from modern machine learning such as foundation models, pretraining, finetuning, pretext tasks and vector embeddings and those that are common during the optimization of a particle physics analysis, such as reconstruction, tagging and analysis.We then explore these ideas in a case study of a Beyond Standard Model search, where the signal is defined as a heavy resonance decaying to two Higgs bosons, which in turn each decay via H → b b.In particular, we focus on establishing a performance hierarchy between training strategies: to what extent is finetuning advantageous over a frozen backbone trained on physics-defined pretext task (here: Xbb-tagging) and how much does the physics-based pretraining help over a direct end-to-end training of the downstream task?
We observe that finetuning does indeed add significant performance to the models measured both at fixed dataset sizes as well as in data-efficiency.Depending on the finetuned models, the gain in rejection can be as much as a factor of two larger than the frozen backbone, and 10-100 times more efficient at achieving a desired level of performance.At the same time, the gap from the frozen to from-scratch models is significant in both dimensions, but reduced with sufficiently many training examples, where models trained from scratch can surpass frozen models due to being able to adjust the reconstructionlevel representation of low-level data.
We identify two important research questions that go beyond the scope of this work, but build on its result.First, in light of the apparent benefits of reconstructionlevel finetuning with respect to a downstream analysislevel task, the question of integrating and automating calibration techniques becomes important.One of the major benefits of a common, frozen backbone is the ability to correct simulation towards calibration data, which would have to now be done in-situ.Second, we recognize the interplay between designing valuable pretraining task and the need for finetuning.Observing significant benefits from finetuning may suggest it would be possible to re-capture parts of the additional performance, by understanding their physical origin and designing better pretrained representation that go beyond e.g.simple Xbb-tagging.If successful, the gap between frozen and finetuned models may be closed.We leave both research questions to future work.

FIG. 1 :
FIG.1: Strategies from modern machine learning such as finetuning, large-scale pretraining, finetuning, domain adaptation and high-dimensional embeddings (green curves) can lead to significant performance gains over the traditional HEP approach, denoted here as S+HLF(frozen).Top: Performance evolution as a function of training dataset size.Bottom: Final Performance at 10M training samples.

FIG. 4 :
FIG. 4: Performance as a function of labeled examples across three training strategies shown for the investigated architectures.For all architectures we see a significant benefit from finetuning over a frozen backbone.Pretraining is significantly more performant than training from scratch.For very large datasets from-scratch training can exceed a frozen backbone.

FIG. 5 :
FIG. 5: Top: Performance metrics of S+HLF for pretext (left) and downstream (right) tasks.In finetuned training the learnable scalar in S+HLF trades off Xbb performance against downstream task performance.In from-scratch training Xbb-tagging emerges as a useful subtask without supervision.Bottom: Xbb Performance of learned scalar feature as function of training samples

FIG. 6 :
FIG.6: Performance metrics for frozen configuration across architectures.We observe that higher-dimensional embeddings show improved performance.

FIG. 7 :
FIG. 7:Initializing the jet-level networks in the Xbb pretraining (finetuned) or the end-to-end downstream task training (from-scratch) with the JetClass-trained network parameters boosts performance significantly.

FIG. 10 :
FIG.10: SIC performance as a function of labeled downstream examples.Methods from foundation models such as large-scale pretraining, finetuning, high-dimensional embedding yield significant benefits in performance and data efficiency over the baseline (S+HLF).

FIG. 11 :
FIG.11: AUC performance as a function of labeled downstream examples.Methods from foundation models such as large-scale pretraining, finetuning, high-dimensional embedding yield significant benefits in performance and data efficiency over the baseline (S+HLF).

FIG. 12 :
FIG.12: AUC performance as a function of labeled examples across three training strategies shown for the investigated architectures.For all architectures we see a significant benefit from finetuning over a frozen backbone.Pretraining is significantly more performant than training from scratch.For very large datasets from-scratch training can exceed a frozen backbone.

TABLE I :
Shared concepts between modern MachineLearning with foundation models and current practice in High-Energy Physics

TABLE II :
Background rejection at 90% Signal Efficiency for the nine investigated configurations.

TABLE III :
Data efficiency with respect to S+HLF frozen model at 90% Signal Efficiency for the nine investigated configurations.

TABLE IV :
Performance of the scalar feature at 90% signal efficiency in trained S+HLF networks on Xbb-tagging