FAIR AI Models in High Energy Physics

The findable, accessible, interoperable, and reusable (FAIR) data principles provide a framework for examining, evaluating, and improving how data is shared to facilitate scientific discovery. Generalizing these principles to research software and other digital products is an active area of research. Machine learning (ML) models -- algorithms that have been trained on data without being explicitly programmed -- and more generally, artificial intelligence (AI) models, are an important target for this because of the ever-increasing pace with which AI is transforming scientific domains, such as experimental high energy physics (HEP). In this paper, we propose a practical definition of FAIR principles for AI models in HEP and describe a template for the application of these principles. We demonstrate the template's use with an example AI model applied to HEP, in which a graph neural network is used to identify Higgs bosons decaying to two bottom quarks. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.


Introduction
Breakthroughs in machine learning (ML) and artificial intelligence (AI) have had a major impact on a range of scientific disciplines, including high energy physics (HEP), which is the study of the fundamental constituents of matter and their interactions.In HEP, multiple experimental collaborations have used ML techniques extensively to address a broad range of problems.For example, they were integral to the 2012 discovery of the Higgs boson [1,2] and subsequent observation of its decay to bottom quarks [3,4] at the CERN Large Hadron Collider (LHC), where they were used to identify in proton-proton collisions the nature and origin of 'jets' of particles produced in the collisions.In another significant application, ML was used to identify in real time about 1000 events of interest from the forty million background events produced each second at the LHC [5,6].To maximize the scientific impact and utility of AI models in HEP, we propose a set of findable, accessible, interoperable, and reusable (FAIR) principles for them.
Our approach is inspired by community-wide initiatives that have produced guiding principles to maximize the reuse and scientific reach of digital assets.Specifically, the FAIR principles were originally introduced [7] as guidelines for the management and stewardship of scientific datasets to optimize their reuse.Recently, the FAIR4RS working group has developed an interpretation of the FAIR principles specifically for research software [8][9][10][11], and FAIR principles have also been applied in the context of benchmarking and tool development [12], and on the creation of computational frameworks for AI models [13].
While these are important steps, these prior interpretations of FAIR principles are not readily applicable to AI models, which are conceptually and structurally different from data and research software.Elucidating the details needed for a robust and general definition of FAIR principles for AI models requires application-specific benchmarks.To address these challenges, we propose an operational definition of FAIR for HEP AI models, focusing on pre-trained models used to make predictions on HEP data.These principles are intended to promote research reuse and reproducibility, which are known challenges in AI-driven scientific application research [14].In addition, we present a method to automate the production, standardization, and publication of Python-based FAIR AI models in HEP.
To illustrate our proposed FAIR AI model definition in the context of HEP, we use a FAIR dataset to create and publish a FAIR AI model.Specifically, we use a simulated Higgs boson dataset distributed by the CMS Collaboration [15][16][17].This FAIR dataset has been used for ML studies [18], college courses [19,20] and tutorials [21].We create a FAIR version of an interaction network (IN) AI model for Higgs boson identification [18], and show how adopting our FAIR principles simplifies porting the model across different hardware architectures and software frameworks and facilitates the study of its interpretability.
This paper is organized as follows: Section 2 outlines the methods used, where Section 2.1 describes related work and a formulation of FAIR principles for AI models; Section 2.2 introduces an AI project template; Section 2.3 summarizes how the template maps to FAIR principles, and Section 2.4 describes an example of the application of FAIR principles, where we take a previously published AI model in HEP [18] and make it FAIR.Next, Section 3 discusses the portability and interpretability of this model, as enabled by the FAIR principles.Finally, Section 4 summarizes the paper.

FAIR principles for AI models in HEP
Substantial work has been done to investigate how to apply the FAIR principles to research software [8][9][10][11].The design, optimization, and training of ML models combine disparate digital assets, including research software, data, libraries and tools, workflows, and an expanding ecosystem of hardware architectures.Depending on the use case, AI models can often be optimized to be faster, more parallel, or better utilize the underlying hardware within different software toolkits.To minimize misinterpretation, the reproducibility and reusability of AI models require details of provenance for the entire discovery cycle.In addition, to execute the AI model on a new dataset, including new data that has not been preprocessed, an exact recipe of the data preparation and preprocessing steps is required, such as the units used to express the data features [22].
Operationally, an AI model is usually instantiated in a software framework, such as Scikit-learn [23], TensorFlow [24], PyTorch [25], XGBoost [26], or ONNX [27], that may be serialized in a file on disk.The storage of models within these formats can vary from low-level hardware-optimized intermediate representations (IRs) to high-level IRs, leading to different inference results or performance.In addition, preparation and prepossessing steps, which can have an impact on the model, can be specified either in separate scripts, or as layers integrated into the model.There are efforts to share such code as open-source GitHub repositories, like Papers with Code [28].However, it has been observed that these repositories are often incomplete, lacking key information, and not maintained, making the results difficult to reproduce [29,30,14].This has led to the establishment of AI reproducibility challenges [30,31].In light of these considerations, we propose the following definition for a FAIR AI model, aimed at meeting the highlevel goals of F, A, I, and R (the four foundational principles) in the original FAIR data principles [7] for AI models: An AI model consists of the architecture (computational graph) and a given set of parameters, which can be expressed as source code files or executables needed to run inference (i.e., produce outputs) on a data sample.A FAIR AI model is an AI model that satisfies the properties listed in Table 1.In brief, (F) the model and its associated metadata are easy to find for both humans and machines, (A) the model and its metadata are retreivable via standardized protocols, (I) the model interoperates with other models, data, and/or software, and (R) the model is both usable and reusable.
For an ML model to be FAIR, we stress that, first, the dataset used to train the model must be FAIR, and follow domain-relevant community standards, because the dataset is an essential part of the ML model's provenance.In Table 1, we present a set of proposed FAIR AI principles, adapted from the FAIR principles created for research software [10] by the Research Data Alliance (RDA) FAIR for Research Software (FAIR4RS) Working Group [8-10, 32, 33].This set of principles has been given to the RDA FAIR for ML (FAIR4ML) interest group [34] that formed in September 2022.We believe that these guidelines are the minimum criteria for a model to be considered as FAIR.However, additional criteria may be necessary to truly ensure a shareable, reproducible, and extendable ML model.
A critical challenge to ensure reproducibility is that of backend optimizations.The output of the AI algorithm can be affected by changes in the operation order, operation precision, and parallelization strategy.Currently, frameworks such as PyTorch and ONNX have different intermediate representations (IRs), which can lead to different outputs depending on how the model is initialized or compiled.These differences can be substantial even when the same hardware is used [35].Moreover, specific processor types may have limitations in the bit precision of various operations.Differences in precision can lead to substantial deviations, rendering exact reproducibility across processors nearly impossible.As a consequence, for the purposes of this discussion, we refer to reproducibility as the ability to produce results that are statistically consistent with the aggregate data on a large scale, but when comparing a single inference on the same data, can deviate within a specified tolerance.

Cookiecutter4fair: FAIR AI project template
Software templates can be used to encourage good practices; Cookiecutter Data Science [36] is one such template that is specifically oriented at data science projects.It consists of a logical, reasonably standardized, but flexible project structure hosted on GitHub for performing and sharing data science work.We took inspiration from this and created a fork of this template generator, called cookiecutter4fair [37], with additional features to promote the adoption of our FAIR principles.Other tools, like Showyourwork [38], specifically address the issue of reproducibility in science.
Table 1.Proposed FAIR principles for fully trained AI models used for AI-inference only, based on adapting the original FAIR principles by initially replacing data by AI models and then making further changes based on the characteristics of AI models versus datasets and the ways they are developed, shared, searched for, and used.These proposed principles could be further extended for retraining use cases by amending our proposed definition for the 'Reusability' principle.
F: The AI model, and its associated metadata, are easy to find for both humans and machines.
F1.The AI model is assigned a globally unique and persistent identifier.F2.The AI model is described with rich metadata.F3.Metadata clearly and explicitly include the identifier of the AI model they describe.F4.Metadata and the AI model are registered or indexed in a searchable resource.
A: The AI model, and its metadata, are retrievable via standardized protocols.
A1.The AI model is retrievable by its identifier using a standardized communications protocol.A1.1.The protocol is open, free, and universally implementable.A1.2.The protocol allows for an authentication and authorization procedure, where necessary.A2.Metadata are accessible, even when the AI model is no longer available.

I:
The AI model interoperates with other models, data, and/or software by exchanging data and/or metadata, and/or through interaction via application programming interfaces (APIs), described through standards.
I1.The AI model reads, writes and exchanges data in a way that meets domainrelevant community standards.I2.The AI model includes qualified references to other objects, including the (FAIR) data used to train the model.

R:
The AI model is both usable (for inference) and reusable (can be understood, built upon, or incorporated into other models and/or software).
R1.The AI model is described with a plurality of accurate and relevant attributes.R1.1.The AI model is given a clear and accessible license.R1.2.The AI model is associated with detailed provenance, such as information about the input data preparation and training process.R2.The AI model includes qualified references to other models and/or software, such as dependencies.R3.The AI model meets domain-relevant community standards.

Usage
The project template is designed to be used with the cookiecutter [39] program, a command-line utility that creates projects from project templates using the Jinja2 [40] templating engine, and that can be installed via pip.A new FAIR AI project can be made with the command cookiecutter https://github.com/FAIR4HEP/cookiecutter4fair.The first argument corresponds to the project template that is hosted on GitHub.After asking the user for the project name, repository name, author name, author ORCID, description of the project, chosen license, DOI for the input data, DOI for the code (if available), and whether to include a template Dockerfile, cookiecutter will create the template structure as shown in Fig. 1.
The questions that the repository asks the user upon project creation can be found and modified in the file cookiecutter.json.The Makefile contains commands that allow the user to do various things with their project, such as downloading the data, setting up the test environment, converting the dataset, and training and evaluating the model.It also contains global variables obtained from cookiecutter.json.This procedure makes it explicit that the analysis operations are a directed acyclic graph (DAG).
If the data is hosted on Zenodo [41], the user can download the data from the DOI link by invoking make sync_data_zenodo, which uses the zenodo_get command line utility [42] to download the data.The Dockerfile can be built and run to provide a Python environment for the project to work, which installs the dependencies specified in requirements.txt.When the Docker image is built, it can be run interactively with the command docker run -d -t <image name>.The pre-project and post-project scripts are automatically run before and after the project directory is generated and provide additional flexibility.After the project template has been generated, the user can organize their source code and documentation in order to follow the FAIR principles.

Design considerations based on FAIR principles
Findable There are many ways to ensure findability for AI models once they are created and published.Simple ways include uploading it to GitHub, GitLab, or BitBucket.Several efforts aim to create "model commons," hubs in which models can be shared.Among these are DLHub [43,44], OpenML [45], MLCommons [46], AI Model Share [47], and Hugging Face [48].If a publication or arXiv preprint is associated with the software, the code repository can also be linked to it via Papers with Code [28].However, this does not really support the findability principle.
To improve findability, Zenodo [41] can be leveraged to generate a DOI for the repository, as well as to store metadata.Recently, HuggingFace also enabled the ability to generate DOIs for both data sets and models [49].Ideally, we would like a way to search all these repositories at the same time.This would require that they each expose a machine accessible search mechanism, ideally using a common standard, and that there is a way to perform a federated search across the full set of repositories.Accessible Accessibility is another place where standardization is needed.Specifically, we need a standard, open, free, protocol for retrieving a model from an identifier.Then the various model repositories would need to implement the server side of this protocol, and community members would likely then implement the client side of the protocol in common tools in Python, R, and other programming languages.
Interoperable To ensure interoperability, the metadata describing the AI model must thoroughly document all aspects of its structure, training, and inputs, including any prepossessing needed for the raw data and a provenance of the data.To enable machine interoperability, standardized APIs, such as those associated with DLHub, HuggingFace, or NVIDIA Triton Server, can be used [50].
Reusable To enable reusability, it is important to specify the software, tools, and dependencies needed to seamlessly invoke an AI model to extract knowledge from datasets in a given computing environment.This process should be hardware agnostic.This may be accomplished by using container solutions, such as Docker [51] or Apptainer [52].
Reusability for inference only requires fully trained ML models.In this context, a trained ML model may be reusable as the backbone to develop another model or to fine-tune it to perform a different task, e.g., the WaveNet model [53], originally developed for text-to-speech and music generation has been adapted for classification and regression tasks in astrophysics [54,55].Recent approaches based on "foundation models," [56] in which large models (sometimes containing up to 10 9 parameters) are pre-trained on unlabeled datasets and subsequently fine-tuned for downstream tasks, illustrate the need for reusability at large scale.These approaches envision the creation of a small collection of general-purpose AI models that may be reused for a large class of tasks.
Other considerations Optimally deploying models on a given hardware processor often involves modifying the internal structure of the model to better utilize the hardware resources.These optimizations correspond to transformations of IRs, specified, e.g., in ONNX or the more flexible Multi-Level IR (MLIR) [57].These transformations can change the numerical output values of models, affecting their reproducibility.There has been limited broad scale acceptance of a standard IR for AI models.In place of this, appropriate metadata describing the hardware used and any hardware-specific optimizations is needed to ensure the model can be reliably reproduced.
In some ways, a higher standard than FAIR is full reproducibility.To ensure reproducibility requires clearly communicating the details of the full end-to-end AI cycle encompassing data collection and curation, API selection for model R&D, hyperparameter optimization, design of domain-inspired loss functions, distributed training schemes, optimizers, random/frozen initialization of weights, data split choices for training, validation, testing and quantization, data loaders, hardware used, and hardware-specific optimizations, among other details.The diverse and rather disparate portfolio of available choices, and the different levels of AI and computing skills of end users, may mean that full reproducibility is not possible.In this article, we propose a minimum and achievable standard of FAIR principles in the context of AI models used for inference.

Mapping to FAIR principles
Table 2 summarizes how the features of the coookiecutter4fair AI project template map to the proposed FAIR principles for AI models.Most aspects are fully automated, such as the creation of a license file and Dockerfile for creating an environment.Some aspects are partially automated, such as uploading the model to Zenodo.In particular, the GitHub-Zenodo bridge can be enabled from the Zenodo web interface, which automates the generation of an updated entry for each new release on GitHub.The coookiecutter4fair repository template populates a CITATION.cfffile [58] with citation metadata, which can then be used by Zenodo.Finally, other aspects are not fully automated, but require some additional manual steps, such as uploading the model to DLHub as described above.

FAIR implementation of H → bb interaction network
The Higgs boson is a linchpin of the standard model (SM) of particle physics.It is a byproduct of the mechanism that generates masses for all elementary particles.Studying its properties, such as its production and decay rates, is one of the overarching goals of the CERN LHC program, and any deviations measured with respect to the SM may give a hint to elusive new physics.The Higgs boson most commonly decays (about 58% of the time) to a bottom quark-antiquark pair (bb).Traditionally, this is a difficult decay of the Higgs boson to study because there is a large background consisting of jets produced through the strong interactions.These are known as quantum chromodynamics (QCD) multijet events.ML models, especially graph neural networks (GNNs) [59,18], have been shown to dramatically improve the rejection of this background, while retaining high H → bb detection efficiency thus enabling the study of this decay mode.In this section, we provide a concrete example of implementing one such model, which is an interaction network (IN) model described in Ref. [18], following our recommendations for a FAIR AI model.The data structure in HEP is defined around the concepts of events.These are discrete moments where all the particles arising from a single proton-proton collision are measured by a detector and recorded.Each event is independent of all the other events.A dataset may consist of several millions of events.To identify events with a H → bb decay and separate them from the much larger QCD background, several salient features are illustrated in Fig. 2. At the LHC, for each event particle candidates are reconstructed from detector measurements and clustered into cone-shaped jets, attempt capture most of the energy from a single particle produced in the collision, such as Higgs boson.Charged particles produced in the collision are detected and the momenta and direction are measured in a tracking detector.These tracks are collected to form jets.There is a special class of jets from bottom quarks where the particles travel a measurable distance from the collision vertex before decaying to other particles, forming a so-called secondary vertex (SV).It is this class of jets that we are searching for when we search for H → bb decays.Illustration of a H → bb jet with two secondary vertices (SVs) from the decay of two bottom hadrons resulting in charged-particle tracks (including a lowenergy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence have a large impact parameter (IP) value.

Interaction network model
The IN model was first proposed [60] in order to explore the evolution of physical dynamics and was later adapted for the task of jet classification; in this case differentiating H → bb jets from QCD jets [18].The dataset for training, validation, and testing is derived from the CMS open simulated dataset with 2016 conditions that is available from the CERN Open Data Portal [15].It consists of jets, decomposed into constituent charged particle tracks, and SVs, labeled as either H → bb signal or QCD background.More information on the dataset can be found in Chen et al. [16].Figure 3 shows the IN model architecture and Table 3 provides the values of the model hyperparameters as well as input data dimensions for the baseline model.For a detailed description of the model and chosen hyperparameters, see Moreno et al. [18] As discussed in Moreno et al. [18], graphs are natural data structures to describe jets because they are permutation invariant (i.e., there is no preferred order to the constituents of the jet), they can accommodate variable-sized objects (i.e., jets may be composed of a few or many constituents), and they can describe entities as nodes (i.e., constituents) and their relations as edges.This network was trained on graph data structures based on up to N p = 30 particle tracks, each with P = 60 features, and up to N v = 5 SVs, each with S = 14 features, associated with the jet.The physical description of each feature is given in Appendix C of Moreno et al. [18].
Two input graphs are used: a fully-connected directed graph with N pp = N p (N p −1) edges between the particle tracks and a separate graph with N vp = N v N p connections between the particle tracks and the SVs.The node level feature space of the fully connected track graph is transformed to edge level features via two interaction matrices, identified as R R[Np×Npp] and R S [Np×Npp] , where the former accounts for how each node receives information from other nodes and the latter encodes the information about each node sending information to other nodes.The track-vertex graph is transformed by similarly defined interaction matrices: R K[Np×Nvp] and R V [Nv×Nvp] .The feature spaces of these graphs are transformed via nonlinear functions, respectively called f pp R and f vp R , to obtain two D E dimensional internal state representations of these graphs.These nonlinear functions are approximated by fully connected multilayer perceptrons (MLPs).
These internal state representations, respectively given by E pp[D E ×Npp] and E vp[D E ×Nvp] matrices, are transferred back to the particle tracks by transforming them with R T R and R T K matrices.These transformed particle level representations are given by matrices Ēpp[D E ×Np] and Ēvp[D E ×Np] respectively.Concatenating these particlelevel internal state representations with the original track features creates a feature space with a dimension of (P + 2D E ) for each of the N p tracks.The function f O , represented by a trainable dense MLP, creates the post-interaction D O dimensional internal representation that is stored in the matrix O [D O ×Np] .Finally, these tracklevel internal representations are summed to obtain a D O dimensional state vector Ō and linearly combined to produce a two-dimensional output, which is transformed to individual class probabilities via a softmax function.

FAIR implementation
We created a FAIR implementation of the AI model hosted on GitHub and Zenodo [61].The repository was initialized using the template described in Section 2.2.
Features The repository includes a dataset processing script that converts the raw data from the CERN Open Data portal.It also has training and prediction scripts to reproduce the published results.As described above, Makefile contains all of these  These images are prebuilt and hosted on DockerHub.We also automated documentation generation, training and inference workflows, Docker container building, with continuous integration through GitHub Actions.Finally, a DOI is generated using the Zenodo-GitHub bridge, in which a new DOI is minted for each new release of the software on GitHub.

Deployment to DLHub
We have made the trained ML model accessible [62] and reusable for inference by making it publicly available via DLHub [63,43].DLHub provides a custom software development kit (SDK) called dlhub_sdk that allows users to package and preserve a trained model with necessary dependencies, including packages with specific versions, custom modules, and serialized data and model files.Once a model has been published, its dedicated API can be used to run remote inference tasks using funcX, a fire-and-forget remote function execution that elastically deploys workers and containers across nodes in clouds, clusters, and supercomputers [64].The process of making a model available is simplified with a notebook template made available by DLHub developers.This notebook requires the user to implement the inference code as a function that is executed during model calls, and to declare model-specific dependencies and associate metadata.The notebook template is accompanied with a document template with necessary information about the model.The prescription of using these templates is user friendly: once both templates are filled out and the notebook successfully runs, they can be sent to the DLHub developers who streamline the process of depositing and curating the model.The published model includes a DOI, list of authors, point of contact, relevant information about input and output data type and shape, and instructions to run the ML model with a sample test set.DLHub's SDK also allows users to explore the model's metadata, which encompasses dependencies and libraries used to create and containerize the model, and information about the tasks performed by the model, e.g., classification or regression.

Portability and performance across platforms
In this section, we examine the portability and extensibility of the IN model, a graph neural network used for the jets classification task.In Section 3.1.1,we reproduce the training and evaluation of the IN model with the same hyperparameters and dataset as Moreno et al. [18].Section 3.2 retrains the model with different training-validating splits on different servers to test the reproducibility of the results under different conditions.In Sections 3.3 and 3.3.1,we explore the model's portability across software frameworks and hardware platforms.We convert the model from PyTorch to TensorRT, using ONNX as the intermediate format, and evaluate the model's inference speed and compatibility of results.We also create an Apptainer container [52] to improve the model's portability across platforms, and evaluate the model's inference performance within the container.4 shows a comparison of our training results and the results from Moreno et al.We repeat the training 10 times varying the random seed used for initialization and data shuffling, and report the mean and standard deviation of the validation accuracy and the AUC.We also report the onesided (upper tail) p-value for the original model given the distribution of our trials.We find the reported performance of the original model is consistent (p-value > 5%) with our reproduction.
Table 4.The IN model's performance in this work and as reported in the original publication.In this work, we repeat the training 10 times varying the random seed used for initialization and data shuffling, and report the mean and standard deviation of the validation accuracy and the AUC.We also report the one-sided (upper tail) p-value for the original model given the distribution of our trials.We find the reported performance of the original model is consistent (p-value > 5%) with our reproduction.

Robustness
There are a variety of methods to quantify the stability of AI models.Smart data samplers may be developed to expose ML models to novel information at every training epoch.This may be a particularly challenging task if the parameter space is largely unknown, and the optimizer, loss function, and architecture do not encode domain information to properly constrain the ML model during the training stage.Even if the method used to sample the parameter space under consideration during the training stage is suboptimal, the ML model may eventually converge and attain optimal performance, even if the training stage takes longer.The performance of the fully trained model, however, should not be uniquely determined by the method used to split the training, validation, and test sets.In fact, an optimal model should be robust to the selection of training, validation, and test sets, unless the information contained in these datasets is not representative of the phenomena that it aims to describe.In view of these considerations, we have explored three different data split approaches to handle the HDF5 files that contain the jet data used to produce a new version of the IN model in this article, namely: (i) Use k-fold cross-validation at the file level.In this approach, the data are split into folds, each containing five files.For training purposes, we select a k-fold as validation data and the rest as training data.We iterate over the entire dataset, and then calculate the average score of all training rounds.(ii) Randomly select five files as the validation set and the rest as the training set.(iii) Save the entire dataset as one NumPy array on disk and use the split function in Scikit-learn to randomly split the dataset to create training and validation sets.
We explored these approaches using the IN model in the HAL cluster.Our findings are summarized in Table 5.Briefly, the IN model is robust to any of the different methods used to train it, which furnishes evidence for its stability and reliability.

Portability across hardware platforms
To demonstrate the portability of our IN model implementation across different hardware architectures, we used the HAL and DGX systems at NCSA and the ThetaGPU supercomputer at the Argonne Leadership Computing Facility.The specifications of each of these platforms are summarized in Table 6.Our IN model implementation is produced using a CMS dataset with a suitable format to fit the model's input data size and type.Each file in the dataset includes 10 5 data points.Table 6 provides results for each of the three training methods described in the previous section, in each of the three high performance computing platforms used for this exercise.Our findings indicate that our IN model implementation is hardware agnostic.Table 7 summarizes our key findings.We can see that the area under the curve (AUC) of the receiver operating characteristic (ROC) curve the validation accuracy are stable at around 99 and 95.5%, respectively.These results are robust to data split methods, and agnostic to the underlying hardware used.

Portability across software frameworks
Here we explore the portability of the AI model across software frameworks that are extensively used for AI research, optimal assembly of software and hardware solutions, and containers.
ONNX and TensorRT conversion Software frameworks such as PyTorch and TensorFlow are extensively used in the AI community.ONNX has emerged as a tool to ease the portability of models developed across software frameworks, and to optimize AI models for accelerated inference using tools such as NVIDIA TensorRT.ONNX has also become a common standard to share and publish ML models.Thus, we have quantified the performance of our IN model in three different implementations: PyTorch, TensorRT and ONNX.The metrics used for this study are inference accuracy, running time, and AUC score.
We carried out these experiments on the ThetaGPU supercomputer using Python 3.6.3,ONNX 1.10.1,PyTorch 1.9.1, and TensorRT 8.2.1.8.For inference, we considered a CMS test set consisting of 1800 k test events/samples, and then quantified the performance and reliability of our three IN model implementations using the first 10 k events in the test data.We set the batch size to 1 for these comparisons.The output of the IN model in these experiments is an array with two values that indicate the probability for the classification of two types of jets.The results of these studies are summarized in Table 8.
We also tested these three different implementations using all 1800 k events, using a batch size equal to 128.The results of these two experiments are reported in Table 8.Inference with the ONNX model is done on the GPU while the data are stored on the host side.Thus, before inference, the data need to be copied from host to device.The time/batch column refers to the time used to run one batch, including the data transfer between the two sides (device, host) and the inference part in the GPU device.When we increase the batch size from 1 to 128, the running time becomes larger because the time taken to transfer the data increases.
For the second case, using a subset of the test set, we can see that when converting from PyTorch to TensorRT, the inference accuracy and AUC score are similar, and the running time of ONNX and TensorRT is shorter due to the accelerating effect of these two formats.When we used the entire test set, the running time of ONNX and TensorRT increase because we use a larger batch size.
GPU utilization and throughput Since NVIDIA TensorRT was developed to optimize AI models for accelerated inference, we have quantified the interplay between batch size, GPU utilization, throughput, and inference accuracy.In this context, throughput corresponds to the number of inferred events per second.In practice, throughput is calculated by computing the total number of inferences divided by total time, or batch size divided by the average running time per batch.Here, running time corresponds to the time taken to complete the analysis of one batch, including data transfer between device-host and the inference part at the GPU.In our experiments, we increased the batch size from 100 to 2400 with a step size equal of 200, while from 2400 to 4200, we used a step size of 400.For each batch size, we run 10 times and draw a boxplot of the throughput.Our findings are summarized in Figure 4.At a glance, we see that GPU utilization saturates at 100% for a batch size of 1000, while throughput peaks (35 k inferred events per second) at a batch size of 1200.These findings exhibit the realm of applicability of TensorRT, i.e., for large scale ML inference workflows.

Model interpretability
In recent years, advances in explainable artificial intelligence (XAI) [66] have made it possible to identify novel connections between an AI model's inputs, architecture, optimization, and predictions [67][68][69].A substantial subset of XAI methods have been developed to analyze computer vision models where an intuitive reasoning can be extracted from human-annotated datasets to validate XAI techniques.However, in other data structures, like large tabular data or relational data constructs like graphs, the use of XAI methods is still quite new [70,71].These XAI techniques have been harnessed across disciplines to quantify the reliability of AI models for science [72][73][74][75].Recently, the scope of XAI has been expanded to include AI applications within HEP [76][77][78][79].In HEP, XAI has been used to understand the output of AI models used in high energy detectors [80], including parton showers at the LHC [81], deep-neuralnetwork-based classification of jets [82,83], and particle-based global event description algorithms [84].Learnable randomness injection (LRI) [79] provides interpretability by identifying a subset of HEP detector hits in a particle cloud that is the most relevant to the prediction results.This method can also identify whether the existence or specific geometry of a point is important.
3.4.1.Evaluating feature importance Identifying feature importance has been a significant component of XAI methods and has been thoroughly studied in the context of classification models [85].In standard feature selection tasks, a reasonable subset of the features that excels in some model performance metric is chosen.Although it is conceptually different from feature ranking in post-hoc model interpretation, the latter usually also relies on minimizing a model's performance loss [86].One of the most useful model analysis tool of a binary classification is the ROC curve, and the corresponding area under the curve (AUC) serves as a scalar metric for evaluating model performance.AUC-based feature ranking has been widely used in the AI literature [87][88][89].We adapt those same principles for our model interpretation studies.One strategy for evaluating a feature's contribution in making predictions is to investigate the model's performance when that feature is masked, e.g., by replacing it with a population-wide average value or a zero value, whichever is contextually relevant to the model's relationship with the training dataset.
In order to identify the features that play the most important role in the IN model's decision-making process, we first train the model with its default settings, which we call the baseline model.During the training, for any event where certain input tracks or secondary vertices are absent for a given jet, its corresponding entries are marked with zeros.Hence, we mask one feature at a time for all input tracks or secondary vertices by replacing the corresponding entries by zero values.We obtain predictions from the trained model and evaluate the AUC score.The change observed in the AUC score when masking each of the features is presented in Figure 5.It shows that while the model has been trained to take into account the entire feature space, there are 14 track features and 4 secondary vertices features that, if removed one at a time, reduce the model's AUC score by less than 0.05%.Inspired by computer vision studies, we propose that the input features that cause the largest change in the AUC score may be regarded as the features that play the most important role in the model's decision-making process.
The weak dependence of the model on many of its input features indicates that the model can learn the jet classification task from a subset of input features.To further investigate the cumulative impact of removing these unimportant features, we mask multiple features at the same time based on a few arbitrary thresholds for the change in AUC score compared to the baseline.The set of masked features includes every track and secondary vertex feature that causes a change in AUC score below that threshold when independently masked.To compare how individual predictions vary on average, we compute the model fidelity score [90,71], defined as Here, M 1 and M 2 are two different models and the corresponding classifier scores for the i-th data sample are respectively given by ŷ1 i and ŷ2 i .The results are summarized in Table 9.The model's performance, both in terms of AUC and fidelity scores, remains very close to the baseline even when masking up to 14 particle track and 4 secondary vertex features.
The table shows the performance of a baseline model when multiple features are simultaneously masked based on AUC score drop threshold.∆P (∆S) represents the number of particle (secondary vertices) features that have been masked.The fidelity score, see Equation (1), is measured with respect to the baseline model.While the AUC and fidelity scores allow determining which features play important roles in the IN's decision making process, we can inspect the importance of these features for individual tracks and vertices by the layerwise relevance propagation (LRP) technique [91,92].The LRP technique propagates the classification score predicted by the network backwards through the layers of the network and attributes a partial relevance score to each input.The original LRP method has been developed for simple MLP networks.Variants of this method have been explored to propagate relevance across convolutional neural networks [93,82] and graph neural networks [94,84].

Threshold [%] ∆P ∆S AUC [%] Fidelity
Since some of the input features show a high degree of correlation with each other, we use the LRP-γ method described by Montavon et al. [92], which is designed to skew the LRP score distributions to nodes with positive weights in the network and thus, avoiding propagation of large but mutually canceling relevance scores.In order to apply the LRP method for the IN model, we propagate scores across (i) the aggregation of internal representation of track features obtained from the aggregator network and (ii) the interaction matrices that send edge-level representations to the individual particle tracks The relevance scores for the output, O [D O ×Np] , of the f O function can be obtained as where r represents the LRP scores for the summed internal representation.On the other hand, the relevance scores Rkn for the track level internal representations in Ēpp[D E ×Np] can be propagated to edge level representations, E pp[D E ×Npp] , using the relation where R R is the receiver matrix for particle-particle interactions.A similar expression allows translating the relevance scores of the track level representation in Ēpv[D E ×Np] to track-vertex edge representations in E vp[D E ×Nvp] using the the receiver matrix R K for vertex-particle interactions.We show the average scores attributed to the different features for QCD and H → bb jets in Figure 6.When compared with the change in AUC score by individual features in Figure 5, the track and secondary vertex features with largest relevance scores are also the features that individually cause the largest drop in AUC score.We additionally observe that the track features are generally assigned larger relevance scores for QCD jets and secondary vertex features play a more important role in identifying the H → bb jets.This behavior is also justified from a physics standpoint, since the presence of high energy secondary vertices is an important signature for jets from b quarks because of its relatively longer lifetime.This is also illustrated in Figure 6, where the cumulative relevance score for each track and vertex is shown.The tracks and vertices are ordered according to their relative energy and our results show that the higher energy tracks and vertices are generally attributed with higher relevance scores for both jet classes.However, feature representing relative track energy, track_erel, itself does not carry notable relevance weight.On the other hand, the relevance attributed to sv_pt, which is strongly correlated with sv_erel, is very large.
We also note that while the secondary vertex features sv_ptrel and sv_erel are assigned relatively low relevance scores, masking them independently leads to very large drops in the AUC score.This apparent discrepancy can be explained by the very high  correlation between these variables, each of which also displays a very large correlation (correlation coefficient of 0.85) with sv_pt, as shown in Figure 7.Because the LRP-γ method skews the relevance distribution between highly correlated features, it suppresses the LRP scores for those two variables while assigning a large relevance score to the variable sv_pt.
We make an additional observation regarding the importance attributed to the feature called track_quality.This feature is a qualitative tag denoting the track reconstruction status, and has an almost identical, doubly peaked distribution for both jet categories.In Figure 7, the peak at 0 represents absent tracks.With such an underlying distribution, this variable does not contribute to the classifier's ability to distinguish the jet categories.However, the large relevance score associated with it, along with the large drop in AUC score upon masking this feature, indicates that the classifier's class-predictive output for each class somehow receives a large contribution from the numerical embedding used to represent this feature and eventually gets canceled by the softmax operation.
We have found that the two previously mentioned secondary vertex features, along with track_quality, have no discernible impact on the IN model's ability to tell the jet categories apart by retraining the model without these variables.The model that was trained without these variables, along with the 11 (3) track (secondary vertex) features that report a change in AUC of less than 0.01%, converged with an AUC score of 99.00%.In the absence of these redundant features, we observed some differences in the relative distribution of the relevance scores.Thus, we are better able to understand which features play a more important role in the identification of H → bb or QCD jets, respectively.These physics-informed validation of model explanation pinpoints two major drawbacks of the existing XAI methods.First, explanations for models trained with highly correlated input features can be inconsistent across approaches and second, treating categorical and continuous variables on equal footing in XAI methods might lead to misleading attribution of feature importance.
Inspecting the activation layers Here we aim to gain new insights on the IN model's decision-making process at the layer level.As the IN processes the input, it is passed through three different MLPs that approximate arbitrary nonlinear functions identified as f pp R , f vp R , and f O .In order to explore the activity of each neuron and compare it with the activity of neurons in the same layer, we define relative neural activity (RNA) [83] as where S = {s 1 , s 2 , . . ., s N } represents a set of samples over which the RNA score is evaluated.The quantity a j,k (s i ) is the activation of j-th neuron in the k-th layer when Scatter plots of sv_ptrel and sv_pt (left) sv_ptrel and sv_erel (middle), and distribution of the categorical variable track_quality (right).sv_ptrel and sv_erel represent the relative transverse momentum and energy of the secondary vertex with respect those of the jet.sv_pt is the transverse momentum of the secondary vertex.track_quality is a categorical variable to represent the quality of track reconstruction where the peak at 0 represents absent tracks.
the input to the network is s i .When summed over all the samples in the evaluation set S, this represents the cumulative neural response of a node, which is normalized with respect to the largest cumulative neural response in the same layer to obtain the RNA score.Hence, in each layer, there will be at least one node with an RNA score of 1.Since the neurons are activated with ReLU activation in the IN model, the RNA score will be strictly between 0 and 1.
At a qualitative level, this study aims to identify which neurons are most actively engaged when the IN model produces an output.Since the MLPs in the IN model consist of only fully-connected layers, each layer takes all the activations from the previous layer as inputs.As all nodes within a given layer are subject to the same set of inputs, we can reliably estimate how strongly they perceive and transfer that information to the next layer by looking at their activation values.For the same reason, we normalize the cumulative activation of a node with respect to the largest aggregate in the same layer.
Figure 8 (left) shows the (NAP) diagram for the baseline model, showing the RNA scores for the different activation layers.The scores are separately evaluated for QCD and H → bb.To simultaneously visualize these scores, we project the RNA scores of the former as negative values.The NAP diagram clearly shows that the network's activity level is quite sparse.In some layers, more than half of the nodes show RNA scores less than 0.2.This implies that while some nodes are playing very important roles in propagating the necessary information, other nodes do not participate as much.We additionally observe that right until the very last layer of the aggregator network f O , the same nodes show the largest activity level for both jet categories.This is better illustrated in Figure 8 (right), where the absolute difference in RNA scores for the two jet categories are mapped.For most nodes in every layer but the very last one, the difference in RNA scores is very close to zero.However, different nodes are activated in the last layer for the two jet categories, indicating an effective disentanglement of the jet category information in this layer.However, even in this layer, the activity level appears to be sparse-only a few nodes showing large activation for each category.

Model reoptimization
The studies presented in Sections 3.4.1 and 3.4.1 suggest that the baseline IN model can be made simpler by reducing both the number of input features it relies on and the number of trainable parameters.To explore this observation, we trained alternate variants of the IN models where the features sv_ptrel, sv_etrel, and track_quality were dropped along with additional 11 track and 3 secondary vertex features that reduce the AUC less than 0.01%, as shown in Figure 5.The details and performance metrics of these models are given in Table 10.It should be noted that the ablated models presented here represent neither an exhaustive list of such choices nor any result of some rigorous optimization.These results demonstrate that a simpler IN model may be developed without compromising the quality of its performance.As can be seen from the results in Table 10, both AUC score and fidelity of the alternate models are very close to that of the baseline model, though the number of trainable parameters is significantly lower.

Discussion and conclusion
We have proposed a practical definition of findable, accessible, interoperable, and reusable (FAIR) principles for machine learning (ML) and artificial intelligence (AI) models in experimental high energy physics (HEP).To promote adherence to these principles, we have introduced a FAIR AI project template and demonstrated how to implement this template with a model to identify Higgs bosons decaying to bottom quarks.We studied the robustness of this FAIR AI model and its portability across hardware architectures and software frameworks, and reported new insights on the interpretability of AI predictions, by studying the interplay between FAIR datasets and AI models.
These studies represent a step towards a FAIR ecosystem of data and AI models to enable and streamline automated AI-driven scientific discovery across disciplines [95].Future work in this area will need to address many outstanding issues, such as providing documentation in a machine-readable way, as well as the development of standardized application programming interfaces (APIs) for federating searching, accessing, and interoperating AI models hosted on different platforms, such as GitHub, DLHub, AI Model Share, and HuggingFace.We also stress that the FAIR principles outlined in this paper are by no means an exhaustive prescription for shareable, reproducible, and extendable scientific AI research.Nonetheless, we recommend the adoption of this FAIR AI model standard to advance HEP research.
t d is ta n c e

Figure 2 .
Figure2.Illustration of a H → bb jet with two secondary vertices (SVs) from the decay of two bottom hadrons resulting in charged-particle tracks (including a lowenergy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence have a large impact parameter (IP) value.

Figure 3 .
Figure 3. Network architecture and dataflow in the IN model [18].The choice of model hyperparameters and input data dimensions for the baseline model is given in the accompanying table.

3. 1 . 1 .
Reproducibility In this subsection we provide details of training the benchmark experiments of the IN model with the same data input and hyperparameters setting as used by Moreno et al. [18].The training samples are saved in 57 HDF5 files, each of which contains about 100 k jets.We use 52 of them for training and 5 for validation.The testing dataset is saved as a set of NumPy array files (one feature per file), where each file contains 600 k jets.There are several differences in our experiment setting compared to Moreno et al.For the training platform, we use the Hardware Accelerated Learning (HAL) GPU cluster at the National Center for Supercomputing Applications (NCSA) [65] as a remote GPU cluster and train on the NVIDIA V100 GPU, while Moreno et al. trained their model on one NVIDIA GeForce GTX 1080 GPU.For the data splitting, we take the first five HDF5 files as validation data and the rest as training data.Moreno et al. split the data into training, validation, and test samples, with 80, 10, and 10% of the data respectively.In our training process, each epoch takes about 450 s to finish.The training terminates following the early stopping condition when the validation loss failed to improve for 8 epochs.As a first check, Table

Figure 4 .
Figure 4. GPU utilization (shown as a blue line) and throughput (shown as boxand-whisker plots) as a function of batch size.GPU utilization saturates at 100% for a batch size of 1000, while throughput peaks at 35 k inferred events per second for a batch size of 1200.For the box-and-whisker throughput plots, ten runs are performed with a given batch size.The black line represents the median value of the throughput, the orange box represents the range from the first quartile to the third quartile, and the whiskers extend an additional distance of 1.5× the interquartile range.The white circles represent the outliers.

Figure 5 .
Figure 5. Change in AUC score with respect to a baseline model when each of the tracks and secondary vertex (SV) features are individually masked during inference.

Figure 6 .
Figure 6.Average relevance scores attributed to input track and secondary vertex features (upper) and individual tracks and secondary vertices (lower).The tracks and secondary vertices (SVs) are ordered according to their relative energy with respect to the jet energy.
Figure 7.Scatter plots of sv_ptrel and sv_pt (left) sv_ptrel and sv_erel (middle), and distribution of the categorical variable track_quality (right).sv_ptrel and sv_erel represent the relative transverse momentum and energy of the secondary vertex with respect those of the jet.sv_pt is the transverse momentum of the secondary vertex.track_quality is a categorical variable to represent the quality of track reconstruction where the peak at 0 represents absent tracks.

Figure 8 .
Figure 8. 2D map of relative neural activity (RNA) score for different nodes of the activation layers (left).To simultaneously visualize the scores for QCD and H → bb jets, we project the RNA scores of the former as negative values.2D map of absolute difference in RNA score for QCD and H → bb jets for different nodes of the activation layers (right).In both figures, the labels associated with the horizontal axis entries represent the nonlinear function and the layer associated with it.

Figure 9 .
Figure 9. Neural activation pattern diagrams for the IN model where the features sv_ptrel, sv_erel, track_quality along with the additional 11 (3) particle track (secondary vertex) features associated with a change in AUC of 0.01%.In both models, the number of nodes in hidden layers is 32 while D E = D O = 16 (left) or D E = D O = 8 (right).

Figure 9
Figure9shows the NAP diagrams for the model with 15 (5) dropped track (vertex) features with 32 nodes per hidden layer where the internal representation dimensions D E and D O are set to 16 and 8 for the left and right figures, respectively.Sparsity of the latter, as measured by the number of activation nodes with RNA < 0.2, is noticeably lower than the baseline model though the former has increased sparsity.With reduced size for the post interaction internal space representation, the alternate models do not completely disentangle the jet classes at the output stage of f O .

Table 2 .
Map between existing capabilities of the coookiecutter4fair AI project template and our proposed FAIR principles for AI models.The * symbol indicates that the process is not yet fully automated and requires additional manual steps.

Table 3 .
The choice of IN model hyperparameters and input data dimensions for the baseline model.Dockerfiles that can create reproducible environment for either CPU-based or GPU-based model training and inference are included in the repository.

Table 5 .
Stability of IN model against different training methods in the HAL GPU cluster.

Table 6 .
Specifications of the DGX, HAL, and ThetaGPU systems.

Table 7 .
IN model portability is showcased using three different data split methods across three different high performance computing platforms.

Table 8 .
Inference results, produced in the ThetaGPU supercomputer, for different frameworks using partial test data, all test data, and all test data within an Apptainer container.

Table 10 .
The performance of a baseline and ablated models.∆P represents the number of particle track features that have been dropped and h is the number of nodes in the hidden layers.The fidelity score is measured with respect to a baseline model.Sparsity is measured by the fraction of activation nodes with an RNA score less than 0.2 ∆P , ∆S h, D E , D O Parameters AUC score [%] Fidelity [%]