GWCloud: a searchable repository for the creation and curation of gravitational-wave inference results

There are at present ${\cal O}(100)$ gravitational-wave candidates from compact binary mergers reported in the astronomical literature. As detector sensitivities are improved, the catalog will swell in size: first to ${\cal O}(1000)$ events in the A+ era and then to ${\cal O}(10^6)$ events in the era of third-generation observatories like Cosmic Explorer and the Einstein Telescope. Each event is analyzed using Bayesian inference to determine properties of the source including component masses, spins, tidal parameters, and the distance to the source. These inference products are the fodder for some of the most exciting gravitational-wave science, enabling us to measure the expansion of the Universe with standard sirens, to characterise the neutron star equation of state, and to unveil how and where gravitational-wave sources are assembled. In order to maximize the science from the coming deluge of detections, we introduce GWCloud, a searchable repository for the creation and curation of gravitational-wave inference products. It is designed with five pillars in mind: uniformity of results, reproducibility of results, stability of results, access to the astronomical community, and efficient use of computing resources. We describe how to use GWCloud with examples, which readers can replicate using the companion code to this paper. We describe our long-term vision for GWCloud.


INTRODUCTION
The recent release of the third LIGO-Virgo-KAGRA (LVK) Gravitational-Wave Transient Catalog (GWTC-3) increased the total number of gravitational-wave events detected by LVK to 90 (Abbott et al. 2021a). Olsen et al. (2022) detect an additional ten events with the so-called IAS pipeline while Nitz et al. (2021) detect seven events not included in GWTC-3. The GWTC-3 catalog consists mostly of binary black hole mergers with two binary neutron star mergers and 2 two neutron star + black hole mergers. 1 Each event is characterized by ≈ 15-17 astrophysical parameters (Veitch et al. 2015). There are seven extrinsic parameters describing the location and orientation of the event with respect to the observatory and eight or more intrinsic parameters paul.lasky@monash.edu eric.thrane@monash.edu 1 The event GW190814 (Abbott et al. 2020a) could be a binary black hole or a neutron star + black hole binary.
including the component masses and spin vectors. Tidal parameters are often included for systems with a neutron star candidate (Lackey & Wade 2015;Abbott et al. 2018a;Chatziioannou 2020) and some binary black hole analyses now include parameters characterising orbital eccentricity (Lower et al. 2018;Romero-Shaw et al. 2019, 2020aGayathri et al. 2022;Lenon et al. 2020;Romero-Shaw et al. 2021).
Each event is analyzed with a Bayesian inference pipeline (Veitch et al. 2015;Ashton et al. 2019b;Romero-Shaw et al. 2020b;Biwer et al. 2019;Lange et al. 2020) in order to determine its astrophysical parameters. The output of these pipelines typically includes posterior samples-discrete representations of the posterior distribution, which can be used to calculate credible intervals for different combinations of astrophysical parameters (corner plots). However, their usefulness does not end there, as they serve as a reduced data product for some of the most exciting gravitationalwave science. Posterior samples are used for population studies Vitale et al. 2021) to probe how compact binaries are distributed in mass and spin, providing insights into stellar evolution and binary formation (see, e.g., Abbott et al. (2021bAbbott et al. ( ,c,d, 2019). They are used in standard-siren analyses (Schutz 1986;Holz & Hughes 2005) to measure cosmological expansion (see, e.g., Abbott et al. (2021c)) and to determine the neutron star equation of state (Baiotti 2019;Lackey & Wade 2015;Landry & Essick 2019;Vivanco et al. 2020;Wysocki et al. 2020;Abbott et al. 2018b).
While the output of inference pipelines is crucially important to gravitational-wave astronomy, there are several aspects of these inference products that make life complicated for gravitational-wave astronomers. First, it is often necessary to carry out many different inference calculations, which analyse the same event with subtle differences. In particular, there are several popular waveform approximants used to model gravitational waveforms, each with different capabilities and different systematic errors. It is therefore common to analyse each event with multiple waveforms. Likewise, it is sometimes necessary to analyse the same event with different prior assumptions (for example, assuming one or both compact objects are not spinning (Galaudage et al. 2021)), which can also lead to multiple runs. There may also be runs carried out with different samplers or different sampler settings, which occasionally yield qualitatively different results (see, e.g., (Chia et al. 2021;Vajpeyi et al. 2022)). Finally, the data itself can have different versions due to variations in calibration (Sun et al. 2020), cleaning (Abbott et al. 2021e), and/or glitch subtraction schemes .
The large number of inference results associated with each event creates a book-keeping problem. This problem is compounded by a second issue: the high computational cost of gravitational-wave inference. Even a "fast" run on an ordinary binary black hole event with a cheap approximant can take ≈ 10 hrs. More ambitious analyses on longer signals and/or with cutting-edge approximants can take weeks. Given the substantial cost in time and CO 2 emission associated with the generation of astrophysical inference results, it is becoming increasingly necessary to carefully curate gravitationalwave inference results.
Finally, the lack of a centralized repository for inference results makes the current workflow of gravitationalwave astronomers inefficient and susceptible to error. A researcher looking for the output of a particular inference run, (e.g., using the IMRPhenomX-PHMapproximant  to analyze GW151226 (Abbott et al. 2016a) with special sampler settings) may need to email collaborators to find the results. The results in question may have been sub-sequently moved or even deleted. And the researcher cannot be certain that the files she tracks down are precisely what she is looking for.
The situation is already challenging with 90 events. The difficulties will, of course, increase as the gravitationalwave catalog swells (Baibhav et al. 2019) to O(1000) events in the A+ era and O(10 6 ) events in the era of third-generation observatories such as Cosmic Explorer (Reitze et al. 2019;Evans et al. 2021) and the Einstein Telescope (Maggiore et al. 2020). In order to address these challenges, we introduce GWCloud, a searchable repository for the creation and curation of gravitational-wave inference results. There are five pillars underpinning its design philosophy: 2 1. Uniformity of results. Inference results are downloaded and uploaded in a uniform format. Uniformity facilitates validation: new results must pass checks to ensure that the inference output is complete and uncorrupted with the necessary metadata to repeat the analysis.
2. Reproducibility of results. By curating the metadata and code version of each result, we insist that every entry in GWCloud can be reproduced.
3. Stability of results over time. Each result is assigned a permanent location. Users can locate previous results using a search engine. Before launching a new inference job, users can search to see if the analysis they want has already been performed. Avoiding duplicate analyses reduces the carbon footprint of gravitational-wave astronomy.
4. Access to results. While a large fraction of gravitational-wave astronomy effort takes place within the LVK collaboration, significant advances are now made by external groups. GWCloud provides multiple levels of access so that results can be shared both within the LVK collaboration and to the larger astronomical community, facilitating the exchange of ideas among a broad community.
5. Efficient use of computing resources. GW-Cloud enables users to submit inference jobs on multiple computing clusters through a single portal. Each cluster can use different batch queuing protocols (e.g., slurm versus condor) and allow for different user groups (e.g., LVK users versus the general public). In this way, GWCloud helps match users with computing resources.
GWCloud is not the only tool that has been created to tackle these challenges. The Gravitational-Wave Open Science Centre (GWOSC) provides access to most of the publicly available posterior samples used in LVK papers. The samples can be queried, discovered, and downloaded through the GWOSC Event Portal, at https://gwosc.org/eventapi. These samples are also available through zenodo.org. Recently, Williams et al. (2021) introduced Asimov, a framework for coordinating parameter estimation workflows. It includes a number of useful features, including a review signoff system so that key results are vetted by humans. Meanwhile, the program PESummary (Hoy & Raymond 2021) has helped facilitate the dissemination of uniform results (while simultaneously providing a tool for the visualisation of inference results). It includes functionality to access result files 3 (both public and private) and functionality to reproduce results 4 . PESummary, Asimov, and GWCloud provide complementary services, although the way in which they will interact in the future is not yet clear.
The remainder of this paper is organized as follows. In Section 2, we cover the basics of GWCloud: how to submit new inference jobs and how to upload the results of an inference analysis. In Section 3, we provide the first of three case studies: we use GWCloud to reanalyze the iconic event GW150914 (Abbott et al. 2016b), but with the assumption that both black holes have negligible spin. In Section 4, we present the second case study: using GWCloud to investigate correlations between mass and spin parameters using events in the second gravitational-wave transient catalog (GWTC-2) (Abbott et al. 2021f). In Section 5, we describe the third case study: using GWCloud to download posterior samples for the remarkably high-mass event GW190521 (Abbott et al. 2020b) obtained using an eccentric waveform approximant. We conclude in Section 6 with a discussion of future development plans. Technical details are provided in the appendix (Section A). The case studies presented in this paper are supported by Jupyter notebooks available as part of the online supplement here: https://git.ligo.org/gwcloud/paper/.

What is GWCloud?
In this Section, we provide a high-level overview of GWCloud and instructions for the most basic tasks users are likely to perform. GWCloud consists of two components: a portal to launch inference jobs and a database to store the results of inference jobs. Both components can be accessed using a web-based graphical user interface (UI) at https://gwcloud.org. au/. Users who prefer to access GWCloud entirely with command-line programming may instead use the application programming interface (API). We anticipate that the UI's job submission feature will be most useful for new and casual users. However, the UI's search feature should be useful to any user searching for old inference results. The API is likely to be most useful to experts who sometimes need to submit large batches of jobs. It allows for more complicated job submissions with features that are not supported using the UI (for example, custom priors).
The portal launches inference jobs using Bilby (Ashton et al. 2019b; Romero-Shaw et al. 2020b). Jobs launched through the UI are at present run on the OzStar cluster based at Swinburne University. Jobs submitted by authenticated LVK users through the API can also be run on computers that form part of the LIGO Data Grid. It is also possible to upload jobs to the GWCloud database that were not run through the GWCloud portal so long as they are in the standard Bilby format (see Section 2.3). 5 This is useful for storing jobs that were run before the creation of GWCloud or jobs that require special resources to run, for example, computationally expensive Parallel Bilby ) analyses that require a high-performance computing cluster.
Users visiting the GWCloud landing page are met with a prompt requiring sign-in. Members of the LVK collaboration can sign in using their albert.einstein credentials, which also provides access to the LIGO Data Grid, while other users can create a GWCloud account. After logging in, the user is taken to the "public jobs" page, which lists the most recent GWCloud runs; see Fig. 1 for an example entry from this recentjob list. The search field allows users to find jobs based on their description, the user who submitted the job, the job name, and the event ID. Labels are available to distinguish some jobs as special. For example, the preferred label indicates that a job is used for an of-ficial LVK result. 6 Previous jobs can be viewed and downloaded by clicking on the appropriate view link. Users may create a new job by clicking on start a new job and following the instructions.
In the next subsections we describe how to submit (or upload) a job using the API. Additional information is provided on the GWCloud web page by clicking on Python API. In order to implement the examples below, readers must install the GWCloud API: pip install gwcloud-python

Submitting a new job with the GWCloud API
Here we describe a Python script for submitting a new GWCloud job using the API; see JobSubmission.ipynb for the corresponding Jupyter notebook.
The corresponding job can be viewed on the GWCloud UI by searching for the name: GW150914Example.
The first step is for the user to authenticate by initialising a token identifier generated by GWCloud. At the beginning of any GWCloud script, include the following lines to import the GWCloud API and set up your token: The next step is to create a Bilby.ini file, which is required to submit a job with GWCloud because it is required to run Bilby. The .ini file for this tutorial is GW150914 example.ini. The .ini file tells Bilby which data to analyze and how to analyze it. 7 The .ini file contains local paths to noise power spectral density (PSD) file(s), spline calibration (.calib) file(s), and the .prior file. All of these files are uploaded to GWCloud for reproducibility. With the .ini file ready, we submit the job to the LIGO Data Grid's Caltech cluster like so: Once this command is executed, a new job with job name = GW150914Example becomes visible on the GWCloud UI. 8 Since we set private=False, the job can be viewed by anyone using GWCloud. 9 The last line of code tells GWCloud to run this job on the Caltech computing cluster. The progress of the new job can be monitored with the GWCloud UI. When the job is complete, the API can be used to retrieve the posterior samples from GW-Cloud with the following command: job.save_result_json_files('/path/') Which saves the result files containing the posteriors to the specified path.

Uploading the results of an existing Bilby run
Here we describe a Python script to upload existing results to GWCloud job using the API; see JobUpload.ipynb for the corresponding Jupyter notebook. The corresponding job can be viewed on GW-Cloud by searching for job name = GW190412 by user = Asa Baker.
As our starting point, we need a Bilby output/ directory with the requisite subdirectory structure. We modify the label field in the * config complete.ini file to set the GWCloud job name, e.g., Next, create a tar-zipped file of the Bilby output/ directory, which can be accomplished by running this command: tar -cvf archive.tar.gz .
Finally, the job is submitted by uploading the tar-zipped file to GWCloud: gwc.upload_job_archive('Example upload with GW159014.', '/path/archive.tar.gz') GWCloud checks the submission to make sure all the requisite results and supporting files are included.

CASE STUDY I: SUBMIT A JOB TO ANALYZE GW150914 WITH A ZERO-SPIN PRIOR
The GWCloud graphical UI allows users to submit inference jobs with various default prior settings. While these settings are probably adequate for new users, expert users will need the API in order perform runs with custom priors. Here we provide an example of how the API can be used to carry out an inference calculation with a non-standard prior; see CaseStudy1.ipynb for the corresponding Jupyter notebook. The corresponding jobs can be viewed on GWCloud by searching for the names: GW150914Example and GW150914NoSpin by Asa Baker. Specifically, we reanalyze the iconic first binary black hole event GW150914 (Abbott et al. 2016b), but assuming that both black holes have negligible dimensionless spins χ 1 = χ 2 = 0 (here the 1 subscript refers to the more massive "primary" black hole while the 2 subscript refers to the less massive "secondary" black hole). This example is motivated by work by Fuller & Ma (2019) We prepare two .ini files: GW150914.ini reproduces standard Bilby settings for a short-duration (highmass) binary black hole signal. The prior for the dimensionless spins χ 1 , χ 2 is uniform on the interval of zero to one. Meanwhile, in GW150914 nospin.ini, we set χ 1 = χ 2 = 0. We submit both jobs and download the results using the syntax described in Section 2.2. In Fig. 2 we provide a corner plot comparing the credible intervals for various parameters of GW150914 assuming a uniform prior for the dimensionless spins (blue) and a no-spin prior (orange). The different shading indicates one-, two-, and three-sigma credible intervals. The different choice of prior yields subtle but interesting shifts in the posterior distribution. Comparing the marginal likelihoods for each run, we find that the zero-spin hypothesis is preferred with a Bayes factor of BF = 3. In this case study, we provide an example of how the API can be used to download previous inference results to look for trends in the population of merging binary black holes; see CaseStudy2.ipynb for the correspond-ing Jupyter notebook. The corresponding jobs can be viewed on GWCloud by searching for the keyword: GWTC-2. This example is motivated by work by Callister et al. (2021), suggesting that black-hole spin is correlated with mass ratio. We download the "preferred samples" (used for official LVK analyses) for 47 binary black hole events in GWTC-2 (Abbott et al. 2021f,d).
To retrieve these GWTC-2 jobs, we run the following command: jobs = gwc.get_public_job_list( search="GWTC-2", time_range=TimeRange.ANY) In Fig. 3, we plot the 90% credible intervals in the plane of total mass M and mass ratio q for events in GWTC-2. In Fig. 4, meanwhile, we plot credible intervals in the plane of chirp mass M and the effective inspiral spin χ eff . These two plots can be compared to Figs. 6-7 in Abbott et al. (2021f). By examining the distributions of events in two-dimensional planes, it is sometimes possible to see previously unknown correlations. In this case, there is not an obvious correlation present in either plot.

CASE STUDY III: DOWNLOAD RESULTS FOR ECCENTRIC ANALYSIS OF GW190521
Binary black holes formed from stellar binaries are expected to merge with quasi-circular orbits. However, a non-zero eccentricity may indicate that the binary was assembled from previously unbound black holes, a process called "dynamical formation." We consider GW190521 (Abbott et al. 2020b), one of the most massive binary black hole events to date, which shows signs of non-zero spin precession and/or eccentricity

FUTURE DEVELOPMENT
We close by considering the future of GWCloud, describing new functionality we hope to add in both the short term and long term. As we plan for the future, we invite input from the astronomical community; please visit our git issue tracker to leave a suggestion or to propose a new feature. 10 Short-term goals.
1. Making LVK jobs public. When LVK data is published, jobs that are previously marked as LVK can be changed to public.
2. GWCloud teams. Share jobs among a small team. Team members can add comments to different jobs, e.g., "this result does not look fully converged." Teams can combine jobs to create catalogs.

Gravitational-wave inference results do not exist in vacuum.
In order to generate and interpret them, we rely on a number of other data products including estimates of the noise power spectral density (e.g., Littenberg & Cornish (2015)), injection studies used to quantify selection effects (Talbot & Thrane 2022;Gerosa et al. 2020), and probabilities that a given event is astrophysical p astro (Kapadia et al. 2020). We hope to extend GWCloud to include these and other data products.

Visualization. Static and dynamic visualization
of inference products is useful to understand co-variances. Such functionality is currently offered within the pe summary toolkit Hoy & Raymond (2021); a short-term goal is full integration of these visualisation toolkits into the GWCloud workflow.
Long-term goals.
1. Identify similar jobs. Warn users if they are about to launch a job that is similar to one already in the database. Users may choose to use existing results rather than waiting for new ones (and potentially generating more CO 2 emissions). In some cases, importance sampling can be used to re-weight posterior samples to convert the results from a "proposal" distribution to a "target" distribution (Payne et al. 2019).
2. Estimate job run time. Use machine learning to provide estimated time to completion for new jobs. Warn users if they launch a job that is likely to take more than a week to complete.
3. Connecting to other clusters. Currently, GW-Cloud provides users access to the computing clusters of the LIGO Data Grid and the OzStar clusters at the Swinburne University of Technology. However, GWCloud could be connected to other computing resources such as the Open Science Grid (Pordes et al. 2007).
4. Automated inference. The project could be extended to launch automated inference jobs for promising triggers by e.g., integrating with Asimov Williams et al. (2021). When extra computational resources are available, carry out inference on all data segments. The results can be used to carry out a statistically optimal search for the astrophysical background  and to construct fully Bayesian detection statistics (Veitch & Vecchio 2010;Ashton et al. 2019a;Pratten & Vecchio 2021).

Beyond posterior samples.
The majority of gravitational-wave inference relies on posterior samples. However, in some cases, it can be useful to work with other inference products, for example, machine-learning (and grid) representations of marginal likelihoods (Vivanco et al. 2019(Vivanco et al. , 2020Wysocki et al. 2020;Lange et al. 2020). Additional work is required to define a standardised format for such inference products.

A. TECHNICAL DETAILS
GWCloud leverages a variety of modern web technologies to provide seamless access via web browsers or an Application Programming Interface (API), exposed to researchers via terminal command line by a publicly available Python client called gwcloud-python (see https://pypi.org/project/gwcloud-python/).

A.1. Application Architecture
In the backend, GWCloud takes advantage of Django 11 : a mature Python-based Model View Controller (MVC) 12 web framework widely used in the commercial sector 13 . Alongside Django, the chosen technology for API transport is GraphQL, 1415 which provides an efficient and effective way for clients (such as gwcloud-python or any web browser) to request the data they require and only the data they require. This is in contrast to other less efficient API transport architectures such as Representational State Transfer (REST), which can lead to a variety of issues (e.g., request cascades 16 ) when dealing with complex data.
In the frontend, the chosen technology is React.js 17 , which is an industry standard web framework that efficiently handles Document Object Model (DOM) updates to generate fully interactive and dynamic web applications. To complement React.js, Relay.js 18 (which uses GraphQL fragments and caching to efficiently maintain a consistent state for web applications) makes representation of the data from the web server more efficient.
GWCloud exists as one of several projects core to the Gravitational Wave Data Centre (GWDC; see https://gwdc. org.au): a software engineering initiative of Astronomy Australia Limited (AAL; see https://astronomyaustralia.org. au), based at Swinburne University of Technology and operated alongside the Astronomy Data and Computing Services (ADACS) team. ADACS is tasked with providing software services to the Australian astronomy community and the combined resources of it and the GWDC presently consist of approximately fifteen dedicated software development professionals.
Several of the core GWDC projects are web applications (others include GWLab 19 and GWLandscape 20 ) with similar infrastructure requirements (e.g., LVK user authentication, execution of "jobs" on compute clusters hosting LIGO data; management, visualisation, and searching of these jobs, etc.). To easily expand and grow these projects and to reduce maintenance overheads 21 , a microservice architecture was chosen for all GWDC web applications. When a new feature is to be added or a bug is found in an existing service, it is easy to identify and isolate the code involved, reducing complexity and technical debt. Such isolation also naturally simplifies testing, promoting enhanced reliability and uptime.
Microservice architectures consist of multiple discrete applications running behind the scenes to perform independent and unrelated tasks. The GWCloud application, for example, is just a single backend service and frontend Javascript bundle tasked purely with performing tasks related to the submission and management of Bilby jobs. Other notable services operating in parallel include an authentication service, a database search service and a Job Controller. The authentication service facilitates integration with the LIGO IDP 22 and manages accounts of non-LIGO-affiliated users 23 . This service also provides details about the user (e.g., LVK membership status, user details such as name and email address, etc.). The database search service 24 provides efficient and powerful job searching based on provided search parameters. Finally the Job Controller provides a service that GWCloud and other projects can use to submit and monitor jobs on High Performance Computing (HPC) facilities as well as to fetch the results and files of those jobs while running or once completed 25 . Figure 6. Gravitational-wave Data Center Frontend Architecture. This diagram shows how the frontend uses a microservice architecture depending on the URL being visited. The React host is always loaded, and is then responsible for loading Auth or application javascript bundles. This architecture prevents having a single monolithic javascript bundle that becomes difficult to maintain. GWCloud also loosely makes use of microservices in the frontend. The React host is responsible for loading services as required depending on the current URL of the web browser. This takes advantage of Webpack 26 Federated Modules 27 .
A limited amount of shared code is present, including the host React module, which is responsible for orchestrating which project to load depending on the currently visited URL. Other shared code exists between projects, however it is not shared from one location but rather duplicated from a base project template which contains the aforementioned basic core functionality. This can be a sequential process of empathising with the users, defining the issues they face, ideating solutions, creating a prototype and then testing to determine if the prototype is successful. Often insights and data discovered can better inform previous steps leading to an iterative workflow.

A.2. User Experience and Design
The GWDC implements a Human Centred Design (Norman 2013) approach in the creation of user interfaces (UIs) and client-facing APIs. In most cases the Design Thinking 28 [34] variant of Human Centred Design is used. The main goal of design efforts is to reduce the cognitive load of programming related tasks to allow researchers to focus on scientific challenges. To our knowledge, this is the first time that Human Centred Design has been intentionally used to improve the UIs and APIs of gravitational wave research applications.
GWCloud has undergone several Usability Tests (Nielsen 2000;Norman 2013) to inform and validate design choices for the UI and API. Initial Usability Tests were performed on the UI to develop an understanding of how researchers used the interface and to ultimately build empathy with their needs. This data was analysed to define the problems researchers faced and to prototype design solutions. Once a solution had been selected it was implemented and then validated with further testing. This iterative design process is ongoing but has already seen improvement in reducing the number of errors, lowering the barrier of entry, and increasing the user satisfaction, efficiency, and learnability of the UI and API.

A.3. Underlying infrastructure
To reduce maintenance complexity, Kubernetes is used as the underlying infrastructure for GWCloud applications.
Kubernetes is an open source platform designed for containerised applications. It enables automated operations such as deployments, backups, rollbacks, horizontal virtual resource scaling, name-spaced role-based access control and configuration decoupling from applications 29 . Since virtual hosts are abstracted from deployments, applications can be redeployed as needed in an automated self-healing manner 30 .
Within Kubernetes, supporting tools are deployed in compliance with the cloud native roadmap 31 . As per the roadmap, all applications involved with GWCloud are packaged and deployed in the form of Docker containers. Docker is an Open Container Initiative (OCI)-complaint 32 containerisation platform enabling the ingestion of container configurations by other OCI-compliant tools such as Podman or Buildah. The container images are stored in a container registry. These container images are then repackaged with default deployment configurations in the form of Helm charts and are stored to a Helm chart repository. Both container images and Helm charts are stored within Sonatype Nexus 33 . As a prerequisite, all sensitive data are declared and initialised within the centralised secrets manager Hashicorp Vault, through its key-value pair secrets engine 34 . Values required by deployments from Hashicorp Vault must meet the access requirements configured within Vault 35 . The Helm charts are then ingested and deployed to the target Kubernetes cluster through ArgoCD 36 : the management tool for the deployment lifecycle of GWDC applications.
to access files directly. These solutions scale poorly and are hostile to many contemporary practices for HPC facility management. A new solution was required.
The Job Controller is itself three discrete components: The Job Controller Server 39 , Job Controller Client [30], and Bundles 40 . The Job Controller Server is deployed in the Kubernetes infrastructure and exposes an API that can be used by various modules, including GWCloud, for submitting jobs; retrieving the status of jobs; cancelling jobs; and retrieving file lists and downloading job files from remote clusters. The server is written in C++ for multi-threaded performance and uses a MySQL database for persisting information about jobs, their states, and for caching job file lists for complete jobs.
The Job Controller Client is written in Python and runs as a daemon on all leveraged remote clusters but can be deployed on any system supporting SSH communication with Python 3 installed. It communicates directly with the server via a WebSocket 41 established when the client is initiated, which is instigated by the server via SSH. The client then forks itself to become a daemon and the SSH connection is dropped. This architecture has the advantage that the only communication needed by the remote cluster is a brief initiating inbound SSH interaction and HTTPS for any subsequent communication with the server. The server can direct the client to submit a new job, cancel a running job, or delete data relating to past jobs. The client tracks the state of running jobs via Bundles (described below) and reports job-state updates to the server. Importantly, the client also provides the ability for the server to request a file list for a job in realtime, and for the server to ask the client to send a job result file over the WebSocket for transfer to a user via browser or API. A single Job Controller server may have many clients on many remote clusters.
Communication between the client and the server happens over one single WebSocket connection and is scheduled using a Multi Level Priority Queue 42 : an algorithm taken from operating system design. This allows higher priority data (such as job file lists) to be sent first over the WebSocket, while lower priority data (such as file transfers) is sent last or as "best effort." This design keeps the client/server communication responsive for real time events (such as when the user requests a file list for a job) at only a slight throughput cost to file transfers, for example. All communication between the client and server.
If a WebSocket connection is dropped or broken, the client is terminated, and the server attempts to restart the remote client via SSH. If the SSH connection fails, the server will intermittently retry the SSH connection until it succeeds. This provides minimal downtime, and resiliency against cluster maintenance or transient connectivity issues between the client and server.
Bundles are the final component of the Job Controller. They contain the business logic required to prepare a job for submission, submit it, and to check its execution status. A single Job Controller client may have many bundles, which could represent different projects (GWCloud, GWLab, etc.), or different versions of runtime codes (e.g. Bilby) leveraged by them. A versioned history of bundles is maintained by the client to support robust reproducibility of past jobs, if required. In the case of GWCloud, its bundle is responsible for rewriting the ini file for the local cluster hosting a job, downloading and storing supporting files, and tracking the state of the job for Slurm and Condor batch schedulers.