The Astronomy Commons Platform: A Deployable Cloud-Based Analysis Platform for Astronomy

We present a scalable, cloud-based science platform solution designed to enable next-to-the-data analyses of terabyte-scale astronomical tabular datasets. The presented platform is built on Amazon Web Services (over Kubernetes and S3 abstraction layers), utilizes Apache Spark and the Astronomy eXtensions for Spark for parallel data analysis and manipulation, and provides the familiar JupyterHub web-accessible front-end for user access. We outline the architecture of the analysis platform, provide implementation details, rationale for (and against) technology choices, verify scalability through strong and weak scaling tests, and demonstrate usability through an example science analysis of data from the Zwicky Transient Facility's 1Bn+ light-curve catalog. Furthermore, we show how this system enables an end-user to iteratively build analyses (in Python) that transparently scale processing with no need for end-user interaction. The system is designed to be deployable by astronomers with moderate cloud engineering knowledge, or (ideally) IT groups. Over the past three years, it has been utilized to build science platforms for the DiRAC Institute, the ZTF partnership, the LSST Solar System Science Collaboration, the LSST Interdisciplinary Network for Collaboration and Computing, as well as for numerous short-term events (with over 100 simultaneous users). A live demo instance, the deployment scripts, source code, and cost calculators are accessible at http://hub.astronomycommons.org/.


INTRODUCTION
Today's astronomy is undergoing a major change. Historically a data-starved science, it is being rapidly transformed by the advent of large, automated, digital sky surveys into a field where terabyte and petabyte data sets are routinely collected and made available to researchers across the globe.
The Zwicky Transient Facility (ZTF; Bellm et al. 2019;Graham et al. 2019;Dekany et al. 2020;Masci et al. 2019) has engaged in a three-year mission to monitor the Northern sky. With a large camera mounted on the Samuel Oschin 48-inch Schmidt telescope at Palomar Observatory, the ZTF is able to monitor the entire visible sky almost twice a night. Generating about 30 GB of nightly imaging, ZTF detects up to 1,000,000 variable, transient, or moving sources (or alerts) every night, and makes them available to the astronomical community (Patterson et al. 2018). Towards the middle of 2024, a new survey, the Legacy Survey of Space and Time (LSST; Ivezić et al. 2019), will start operations on the NSF Vera C. Rubin Observatory. Rubin Observatory's telescope has a mirror almost seven times larger than that of the ZTF, which will enable it to search for fainter and more distant sources. Situated in northern Chile, the LSST will survey the southern sky taking ∼1, 000 images per night with a 3.2 billion-pixel camera with a ∼10 deg 2 field of view. The stream of imaging data (∼6PB/yr) collected by the LSST will yield repeated measurements (∼100/yr) of 4 Stetzler et al.
node, among other features. Each Kubernetes object is described using YAML, a human-readable format for storing configuration information (lists and dictionaries of strings and numbers). 6 Figure 1 shows an example set of YAMLformatted text describing Kubernetes objects that together would link a Jupyter notebook server backed by a 10 GiB storage device to an internet-accessible URL.
The Kubernetes core service, a set of software called the control plane, is responsible for maintaining an API server accessible within (and potential externally to) the cluster, maintaining a database of the objects created so far, assigning pods (applications) to nodes in the cluster in a way that respects their constraints, keeping track of the general state of the cluster, and handling aspects of networking within the cluster and through the cloud provider. Additional components of the Kubernetes core code (or third-party plugins) handle provisioning of virtual hardware from the cloud provider to satisfy requirements that cannot be met by current cluster resources. As an example on AWS, an outstanding request for a Service requiring a load balancer will be fulfilled by creating an AWS Elastic Load Balancer (ELB) or Application Load Balancer (ALB). Similarly, an outstanding request for a Persistent Volume will be fulfilled by creating an Amazon Elastic Block Store (EBS) volume. Finally, applications can be scheduled on the cluster that modify the cluster state. In particular, the Kubernetes Cluster Autoscaler interacts with the cloud provider to terminate underutilized nodes or add new nodes when there are pods that cannot be scheduled given the current number of nodes. 7 The handling of hardware provisioning from the cloud provider by administrative software in the Kubernetes control plane, through Kubernetes plugins, and through applications running in the cluster allows additional user applications to remain decoupled from the cloud provider's API.

System Architecture
Cloud systems offer unique infrastructure elements that help support a system for scalable science analysis. Virtual machines can be rented in the hundreds or thousands to support large computations, each accessing data in a scalable manner from a managed service. Orchestration layers, like Kubernetes, ease the process of running science software on cloud resources. In this section, we discuss how we leverage cloud infrastructure to build such a platform. Underlying this platform are four key components: 1. An interface for computing. We use the Jupyter ecosystem: a JupyterHub deployment based on the zero-to-jupyterhub project that creates Jupyter notebook servers on our computing infrastructure for authenticated users. A Jupyter notebook server provides a web-interface to interactively run code on a remote machine alongside a set of pre-installed software libraries. 8 2. A scalable analytics engine. We use Apache Spark, an industry standard tool for distributed data querying and analysis, and the Astronomy eXtensions to Spark (AXS).
3. A scalable storage solution. We use Amazon Simple Storage Solution (S3). Amazon S3 is a managed object store that can store arbitrarily large data volumes and scale to an arbitrarily large number of requests for this data.

A deployment solution.
We've developed a set of Helm charts and bash scripts automating the deployment of this system onto the AWS cloud. 9 Each of these components are largely disconnected from one another and can be mixed and matched with other drop-in solutions. 10 Aside from the deployment solution, each of these components are comprised of simple processes communicating with each other through an API over a network. This means that each solution for (1), (2), and (3) is largely agnostic to the choice of running on a bare-metal machine, inside a virtual machine (VM), inside a Linux container, or using a managed cloud service as long as each component is properly networked. Figure 2 shows the state of the Kubernetes cluster during normal usage of a platform created with our Helm chart as well as the pathway of API interactions that occur as a user interacts with the system. A user gains access to the system through a JupyterHub, which is a log-in portal and proxy to one or more managed Jupyter notebook servers spawned 6 See https://yaml.org/ for specification and implementations. 7 https://github.com/kubernetes/autoscaler 8 See https://zero-to-jupyterhub.readthedocs.io/ and https://github.com/jupyterhub/zero-to-jupyterhub-k8s. 9 For Helm, see https://helm.sh/. 10 Zepplin notebooks, among other tools, compete with Jupyter notebooks for accessing remote computers for analysis and data visualization.
Dask is a competing drop-in for Apache Spark that scales Python code natively. A Lustre file system could be a drop-in for Amazon S3. Amazon EFS, a managed and scalable network filesystem, is also an option. Kustomize is an alternative to Helm.  Figure 2. A diagram of the essential components of the Kubernetes cluster when the science platform is in use. Each box represents a single Kubernetes Pod scheduled on the cluster. The colors of the boxes and the dashed ovals surrounding the three groups are for visualization purposes only; each Pod exists as an independent entity to be scheduled on any available machines. The colored paths and letter markers indicate the pattern of API interactions that occur when users interact with the system. (a) shows a user connecting to the JupyterHub from the internet. The JupyterHub creates a notebook server (jupyter-user-1) for the user (b). The user creates a Spark cluster using their notebook server as the location for the Spark driver process (c). Scheduled Spark executor Pods connect back to the Spark driver process running in the notebook server (d). The Spark driver process accesses a MariaDB server for catalog metadata (e). In the background, the Kubernetes cluster autoscaler keeps track of the scheduling status of all Pods (f). At any point in (a)-(d), if a Pod cannot be scheduled due to a lack of cluster resources, the cluster autoscaler will request more machines from AWS to meet that need (g). Optionally, the user can connect to their running server with SSH (h). by the JupyterHub. This notebook server is run on a node of the Kubernetes cluster, which can be constrained by hardware requirements and/or administrator provided node labels. A proxy forwards external authenticated requests from the internet to a user's notebook server. Users can use the Apache Spark software, which is pre-installed on their server, to create a Spark cluster using the Spark on Kubernetes API. The user can also access their running notebook server using a Secure Shell (SSH) client.

An Interface to Computing
The Jupyter notebook application, and its extension Jupyter lab, provide an ideal environment for astronomers to access, manipulate, and visualize data sets. The Jupyter notebook/lab applications, although usually run locally on a user's machine, can run on a remote machine and be accessed through a JupyterHub, a web application that securely forwards authenticated requests directed at a central URL to a running notebook server. 11 The authentication layer of JupyterHub allows us to block non-authenticated users from the platform. Our science platform integrates authentication through GitHub, allowing us to authenticate both individual users by their GitHub usernames and groups of users through GitHub Organization membership. For example, the implementation of this science platform described in Section 3 restricts access to the platform and its private data to members of the dirac-institute and ZwickyTransientFacility GitHub organizations.
Users can choose to bypass the Jupyter computing environment by accessing their running notebook server with a SSH client. The SSH client also facilitates file transfers between the user and their notebook server when using utilities such as scp or rsync. Access through SSH is implemented using a "jump host" setup: a single, alwaysrunning container on the cluster runs an OpenSSH server that is networked to the internet. When the user's notebook server is running, it additionally runs an OpenSSH server in the background. The user adds a cryptographic public key to a file system shared between the notebook server and jump host. The user connects to the jump host with their username and a cryptographic private key stored on their local machine. From the jump host, the user can connect to their running notebook server with no additional configuration. A properly formatted invocation of the ssh command can do this in one step. Additional public and private keys are generated automatically for each host and each user and placed on a shared file system with correct permissions.
Finally, a Virtual Network Computing (VNC) desktop is made available to the user for using graphical applications outside of the Jupyter notebook. The VNC desktop is provided through the Jupyter Remote Desktop Proxy software, an extension to the Jupyter notebook server. The in-browser desktop emulation offers reasonable interaction latencies over a typical internet connection.

A Scalable Analytics Engine
Apache Spark (Spark) is a tool for general distributed computing, with a focus on querying and transforming large amounts of data, that works well in a shared-nothing, distributed computing environment. Spark uses a driver/executor model for executing queries. The driver process splits a given query into several (1 to thousands) independent tasks which are distributed to independent executor processes. The driver process keeps track of the state of the query, maintains communication with its executors, and coalesces the results of finished tasks. Since the driver and executor(s) only need to communicate with each other over the network, executor processes can remain on the same machine as a driver, to take advantage of parallelism on a single machine, or be distributed across several other machines in a distributed computing context. 12 The API for data transformation, queries, and analysis remains the same whether or not the Spark engine executes the code sequentially on a local machine or in parallel on distributed machines, allowing code that works on a laptop to naturally scale to a cluster of computers.
To support astronomy-specific operations, Zečević et al. (2019) have developed the Astronomy eXtensions to Spark (AXS), a set of additional Python bindings to the Spark API to ease astronomy-specific data queries such as cross matches and sky maps in addition to an internal optimization for speeding up catalog cross matches using the ZONES algorithm, described in Zečević et al. (2019). AXS is included in our science platform to ease the use of Spark for astronomers and also provide fast cross-matching capability between catalogs.
AXS requires that tabular data is stored Apache Parquet format, a compressed column-oriented data storage format. 13 The columnar nature and partitioning of the files in Parquet format allows for very fast reads of large tables. For example, one can obtain a subset of just the "RA" column of a catalog without scanning through all parts of all of the files. Apache Spark's flexible functionality for accessing data of different formats, exposed in Python through pyspark.sql.DataFrame.read, allows one to convert a broad range of catalogs in different formats -including FITS (Peloton et al. 2018) -to Parquet. AXS additionally requires that catalogs stored in Parquet be similarly partitioned in order to perform fast cross-matches. AXS provides a single function, exposed in Python as AxsCatalog.save, that will re-partition a data frame read using Spark, save it in Parquet format, and make the table available to a user through its Apache Hive metastore database. 14 2.3.3. A Scalable Storage Solution 11 As an example, one may access a JupyterHub at the URL https://hub.example.com which, if you are an authenticated user, will forward through a proxy to https://hub.example.com/user/username. When running a notebook on a local machine, there is no access to a JupyterHub and the single user server is served at (typically) http://localhost:8888. 12 Creating executor processes on a single machine isn't done in practice; instead, Spark supports multithreading in the driver process that replace the external executor process(es) when using local resources. 13 See https://parquet.apache.org/ 14 See https://hive.apache.org/ 8 Stetzler et al.
Amazon S3 is a scalable object store with built-in backups and optional replication across geographically distinct AWS regions. Files are placed into a S3 bucket, a flat file system that scales well to simultaneous access from thousands of individual clients. Files are accessed over the network using a REST API over HTTP, supporting actions to retrieve and create new objects in the bucket. The semantics of the S3 API are not compliant with the POSIX specification that typical file systems adhere to. However, there are projects, such as s3fs, that allow for mounting of the S3 object store as a traditional file system and provide an interface layer that makes the file system largely POSIX compliant. 15 The names of S3 buckets are globally unique, which makes public and private sharing of data in a bucket easy: a user anywhere in the world can access public data from an S3 bucket by specifying only its name. To access private data, the user must additionally authenticate them self with AWS. Access control lists provide object-level permissions for read/write access to certain users and the public. Additionally, there is no limit to the amount of data that can be stored, although individual files must be no larger than 5 TB, and individual upload actions cannot exceed 5 GB. In this platform, we store and access TB+ tabular datasets stored in Parquet format with a common partitioning scheme, making the data AXS compatible.

A deployment solution
We have created a deployment solution for organized creation and management of each of these three components. The code for this is stored at a GitHub repository accessible at https://github.com/astronomy-commons/ science-platform. Files referenced in the following code snippets assume access at the root level of this repository.
To create and manage our Kubernetes cluster, we use the eksctl software. 16 This software defines configuration of the Amazon Elastic Kubernetes Service (EKS) from YAML-formatted files. An EKS cluster consists of a managed Kubernetes master node that runs the control plane software along with a set of either managed or unmanaged nodegroups backed by Amazon Elastic Compute Cloud (EC2) virtual machines which the applications scheduled on the cluster. 17 To help us manage large numbers of Kubernetes objects, we use Helm, the "package manager for Kubernetes." Helm allows Kubernetes objects described as YAML files to be templated using a small number of parameters or "values," also stored in YAML. Helm packages together YAML template files and their default template values in Helm "charts." Helm charts can have versioned dependencies on other Helm charts to compose larger charts from smaller ones. After cluster creation, we use Helm to install the cluster-autoscaler-chart, which deploys the Kubernetes Cluster Autoscaler application. The cluster autoscaler scales the number of nodes in the Kubernetes cluster up or down when resources are too constrained or underutilized.
We have created a Helm chart to manage and distribute versioned deployments of our platform. This chart depends on three sub-charts: 1. The zero-to-jupyterhub chart, a standard and customizable installation of JupyterHub on Kubernetes. The zero-to-jupyterhub chart uses Docker images from the Jupyter Docker Stacks by default and uses the KubeSpawner for creating Jupyter notebook servers using the Kubernetes API directly. 18 2. The nfs-ganesha-server-and-external-provisioner chart, which provides a network filesystem server and Kubernetes-compliant storage provisioner. 19 3. A mariadb chart, which provides a MariaDB server and is used as an Apache Hive metadata store for AXS. 20 The Helm chart contains configuration of the three sub-charts. For example, the chart is configured to use a Docker image with installations of Spark/AXS, the OpenSSH client/server, and Jupyter notebook server extensions like the Jupyter Remote Desktop Proxy when the JupyterHub starts a notebook server. Additional configuration in the chart provides instructions for mounting the network file system, instructions for setting up the Hive metastore, defines reasonable defaults for using Spark on Kubernetes, and defines notebook server startup-scripts that start the SSH server and set up the user's space on the file system (such as copying example notebooks to the user's home directory). We found it to be critically important to provide a way for users to easily share files with one another. The default Helm chart and KubeSpawner configuration creates a Persistent Volume Claim backed by the default storage device configured for the Kubernetes cluster for each single user server, allowing a user's files to persist beyond the lifetime of their server. For AWS, the default storage device is an EBS volume, roughly equivalent to a networkconnected SSD with guaranteed input/output capabilities. By default, this volume is mounted at the file system location /home/jovyan in the single user container. This setup makes it difficult for the users' results to be shared with others because: a) they are isolated to their own disk, and b) by default all users share the same username and IDs, making granular access control extremely difficult.
To resolve these issues, we provisioned a network file system (NFSv4) server using the nfs-ganesha-server-and-external-provisioner Helm chart, creating a centralized location for user files and enabling file sharing between users. To solve the problem of access control, each notebook container is started with two environment variables: NB USER set equal to the user's GitHub username, and NB UID set equal to the user's GitHub user id. The start-up scripts included in the default Jupyter notebook Docker image use the values of these environment variables to create a new Linux user, move the home directory location, update home directory ownership, and update home directory permissions from their default values. Figure 3 shows how the NFS server is mounted into single user pods to enable file sharing. The NFS server is mounted at the /home directory on the single user server, and a directory is created for the user at the location /home/<username>. Each user's directory is protected using UNIX-level file permissions that prevent other users from making unauthorized edits to their files. System administrators can elevate their own permissions (and access the back-end infrastructure arbitrarily) to edit user files at will. The UNIX user ids (UIDs) are globally unique, since they are equal to a unique GitHub ID.
In initial experiments, we used the managed AWS Elastic File System (EFS) service to enable file sharing. Using the managed service provides significant benefits, including unlimited storage, scalable access, and automatic back-ups. However, EFS had a noticeable latency increase per Input/Ouput operation compared to the EBS-backed storage of the Kubernetes-managed NFS server. In addition, EFS storage is 3× more expensive than EBS storage. 21 In addition to storing home directories on the NFS server, we have an option to store all of the science analysis code (typically managed as conda environments) on the NFS server. This has several advantages relative to the common practice of keeping the code in Jupyter notebook Docker images. The primary advantage is that this allows for updating of installed software in real-time, and without the need to re-start user servers. A secondary advantage is that the Docker images become smaller and faster to download and start up (thus improving the user experience). The downside is decreased scalability: the NFS server includes a central point, shared by all users of the system. Analysis codes are often made up of thousands of small files, and a request for each file when starting a notebook can lead to large loads on the NFS server. This load increases when serving more than one client, and may not be a scalable beyond serving a few hundred users.
For systems requiring significant scalability, a hybrid approach of providing a base conda environment in the Docker image itself in addition to mounting user-created and user-managed conda environments and Jupyter kernels from the NFS server is warranted. This allows for fast and scalable access to the base environment while also providing the benefit of shared code bases that can be updated in-place by individual users.

Providing Optimal and Specialized Resources
Some users require additional flexibility in the hardware available to match their computing needs. To accommodate this, we have made deployments of this system that allow users to run their notebooks on machines with more CPU or RAM or with specialty hardware like Graphics Processing Units (GPUs) as they require. This functionality is restricted to deployments where we trust the discretion of the users and is not included in the demonstration deployment accompanying this manuscript.
Flexibility in hardware is provided through a custom JupyterHub options form that is shown to the user when they try to start their server. An example form is shown in Fig. 4. Several categories of AWS EC2 instances are enumerated with their hardware and costs listed. Hardware is provisioned in terms of vCPU, or "virtual CPU," roughly equivalent to one thread on a hyperthreaded CPU. In this example, users can pick an instance that has as few resources as 2 vCPU  and 1 GiB of memory at the lowest cost of $0.01/hour (the t3.micro EC2 instance), to a large-memory machine with 96 vCPU and 768 GiB of memory at a much larger cost of $6.05/hour (the r5.24xlarge EC2 instance). In addition, nodes with GPU hardware are provided as an option at moderate cost (4 vCPU, 16 GiB memory, 1 NVIDIA Tesla P4 GPU at $0.53/hour; the g4dn.xlarge EC2 instance). These GPU nodes can be used to accelerate code in certain applications such as image processing and machine learning. For this deployment, the form is configured to default to a modest choice with 4 vCPU and 16 GiB of memory at a cost of $0.17/hour (the t3.xlarge EC2 instance). This range of hardware options and prices will change over time; the list provided is simply an example of the on-demand heterogeneity provided via AWS.

Multi-cloud support
It is unlikely, and perhaps undesirable, that all scientists and organization will agree to use a single cloud provider when storing data, accessing computing resources, or deploying our system. There are many clouds outside of Amazon Web Services that scientists may have already chosen for computing and data storage based on factors such as the availability of compute credits, academic institutional tie-ins, convenience, familiarity, or differences in product offerings. 22 Therefore, it is necessary to think about and accommodate multi-cloud support in our architecture. The system architecture outlined in 2.3 is sensitive in two places to the choice of cloud provider: the deployment solution and the storage solution.
The deployment solution we use is tied to the choice of cloud provider only during creation of the Kubernetes cluster with the deployment scripts. Helm interacts with the Kubernetes cluster directly and not the cloud provider, so it remains cloud-agnostic. For cloud providers that offer a managed Kubernetes service, then they typically offer command line interface (CLI) tools for creating and managing a Kubernetes cluster. In other cases, tools like kops and Kubespray enable cluster creation on a wide variety of public clouds as well on private computing clusters. Our deployment scripts can be extended in the future to accommodate other clouds using these tools.
The storage solution has a tighter coupling to the cloud provider that leads to potential lock-in with the cloud provider as well as issues with sharing data across clouds. While we have chosen Amazon S3 as our storage solution, similar object storage products from other cloud providers can be used. Apache Spark can access data stored in any object store that uses the S3 API. Some cloud providers offer object storage products that expose APIs that are compliant or complementary to the S3 API, which makes both accessing data between clouds feasible. For example, both Google Cloud Platform (GCP) and DigitalOcean provide storage services that have some or full interoperability with S3. This means a user can expect to deploy our system on a cloud with a compatible object store and use that object store with few or no changes in how the data are accessed. However, transferring large amounts of data between clouds through the internet (referred to as "egress") remains very costly, making multi-cloud data access infeasible in practice. AWS quotes data transfer fees of $0.05-$0.09 per GB depending on the total volume transferred in a month. This means that a user who has deployed our system in a cloud other than AWS cannot expect to access large amounts of data stored within Amazon S3. This sets the expectation that our system will be deployed in the cloud where a user's data and potentially other relevant and desirable data sets are located. Notably, this challenge persists, at a smaller scale, within an individual cloud due to data transfer costs between geographically distinct data centers (often called regions). AWS quotes data transfer costs between regions at $0.01-$0.02 per GB. However, unless very low latency for data access from many countries/continents is required, a user or organization can likely choose and stick to a single region when storing data and acquire computing resources.
Slightly different system architectures allow for easier multi-cloud data access. For example, the Jupyter Kernel Gateway and Jupyter Enterprise Gateway projects can be used to access computing resources that are distributed across multiple clusters. 23 Both of these projects provide a method to create and access a running process in a remote cluster. This allows one to create several Kubernetes clusters in different clouds where desirable datasets are located and use a single JupyterHub as an entrypoint to access data stored in multiple clouds. While this proposed solution does not bring the data closer together, which would be desirable for applications that require jointly analyzing datasets from multiple sources in different clouds, it does allow for baseline multi-cloud data access. True multi-cloud data access is likely to remain infeasible without significant decreases in egress costs or prior agreement on where to store data from dataset stakeholders. New services, such as Cloudflare R2, that provide cloud storage with zero or near-zero Figure 4. A screenshot of the JupyterHub server spawn page. Several options for computing hardware are presented to the user with their hardware and costs enumerated. Of note is the ability to spawn GPU instances on demand. When a user selects one of these options, their spawned Kubernetes Pod is tagged so that it can only be scheduled on a node with the desired hardware. If a node with the required hardware does not exist in the Kubernetes cluster, the cluster autoscaler will provision it from the cloud provider (introducing a ∼5 minute spawn time). egress cost brighten the prospects for cheap, multi-cloud data transfer and would lift the requirement for consensus among stakeholders.

A DEPLOYMENT FOR ZTF ANALYSES
To demonstrate the capabilities of our system and verify its utility to a science user, we deployed to enable the analysis of data from the Zwicky Transient Facility (ZTF). Section 3.1 describes the catalogs available through this deployment, Section 3.2 demonstrates the typical access pattern to the data using the AXS API, and Section 3.3 showcases a science project executed on this platform. Table 1 enumerates the catalogs available to the user in this example deployment. We provide a catalog of light curves from ZTF, created from de-duplicated match files. The most recent version of these match files have a data volume of ∼ 4 TB describing light curves of ∼ 1 billion+ objects in the "g", "r", and "i" bands. In addition, we provide access to catalogs from the data releases of the SDSS, Gaia, AllWISE, and Pan-STARRS surveys for convenient cross matching. The system allows users to upload, cross match, and share custom catalogs in addition to the ones provided, using the method described in 2.3.2.

Typical workflow
Users can query the available catalogs through the AXS/Spark Python API. For example, a user loads a reference to the ZTF catalog like so: import axs from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() catalog = axs.AxsCatalog(spark) ztf = catalog.load('ztf') The spark object represents a Spark SQL Session and a connection to a Hive metastore database, which stores metadata for accessing the catalogs. This object is used as a SQL backend when creating the AxsCatalog, which acts as an interface to the available catalogs. Catalogs from the metastore database are loaded by name using the AXS API. Data subsets can be created by selecting one or more columns: ztf_subset = ztf.select('ra', 'dec', 'mag_r') AxsCatalog Python objects can be crossmatched with one another to produce a new catalog with the crossmatch result: gaia = catalog.load('gaia') xmatch = ztf.crossmatch(gaia) The xmatch object can be queried like any other AxsCatalog object. Spark allows for the creation of User-Defined Functions (UDFs) that can be mapped onto rows of a Spark DataFrame. The following example shows how a Python function that converts an AB magnitude to its corresponding flux in janskys can be mapped onto all ∼63 billion r-band magnitude measurements from ∼1 billion light curves in the ZTF catalog (in parallel): from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType from pyspark.sql.types import FloatType import numpy as np @udf(returnType=ArrayType(FloatType())) def abMagToFlux(m): flux = ((8.90 -np.array(m))/2.5)**10 return flux.tolist() ztf_flux_r = ztf.select( abMagToFlux(ztf['mag_r']).alias("flux_r") )

Science case: Searching for Boyajian star Analogues
We test the ability of this platform to enable large-scale analysis by using it to search for Boyajian star (Boyajian et al. 2016) analogs in the ZTF catalog. The Boyajian star, discovered with the Kepler telescope, dips in its brightness in an unusual way. We intend to search the ZTF catalog for Boyajian-analogs, other stars that have anomalous dimming events, which will be fully described in Boone et al. (in prep.); here we limit ourselves to aspects necessary for the validation of the analysis system. The main method for our Boyajian-analog searches relies on querying and filtering large volumes of ZTF light curves using AXS and Apache Spark in search of the dimming events. Objects of interest are then spatially cross matched against the other catalogs available, for example to the Gaia catalog to create a color-magnitude diagram and the AllWISE catalog to identify if there is excess flux in the infrared. This presents an ideal science-case for our platform: the entire ZTF catalog must be queried, filtered, analyzed, and compared to other catalogs repeatedly in order to complete the science goals.
We wrote custom Spark queries that search the ZTF catalog for dimming events. After filtering the light curves, we created a set of UDFs for model fitting that wrap the optimization library from the scipy package. These UDFs are applied to the filtered light curves to parallelize least-squared fitting routines of various models to the dipping events. Figure 5 shows an outline of this science process using AXS.
The use of Apache Spark speeds up queries, filtering, and fitting of the data tremendously when deployed in a distributed environment. We used a Jupyter notebook on our platform to allocate a Spark cluster of consisting of 96 t3.2xlarge EC2 instances. Each instance had access to 8 threads running on an Intel Xeon Platinum 8000 series processor with 32 GiB of RAM, creating a cluster with 768 threads and 3,072 GiB of RAM. We used the Spark cluster to complete a complex filtering task on the full 4 TB ZTF data volume in ∼three hours. The underlying system was able to scale to full capacity within minutes, and scale down once the demanding query was completed just as fast, providing extreme levels of parallelism at minimal cost. The total cost over the time of the query was ∼$100.
This same complex query was previously performed on a large shared-memory machine at the University of Washington with two AMD EPYC 7401 processors and 1,024 GiB of RAM. The query utilized 40 threads and accessed the catalog from directly connected SSDs. This query previously took a full two days to execute on this hardware in comparison to the ∼three hours on the cloud based science platform. Performing an analysis of this scale would not be feasible if performed on a user's laptop using data queried over the internet from the ZTF archive.
In addition, the group was able to gain the extreme parallelism afforded by Spark without investing a significant amount of time writing Spark-specific code. The majority of coding time was spent developing science-motivated code/logic to detect, describe, and model dipping events within familiar Python UDFs and using familiar Python libraries. In alternative systems that provide similar levels of parallelism, such as HPC systems based on batch scheduling, a user would typically have to spend significant time altering their science code to conform with the underlying software and hardware that enables their code to scale. For example, they may spend significant time re-writing their code in a way that can be submitted to a batch scheduler like PBS/Slurm, or spend time developing a leader/follower execution model using a distributed computing/communication framework such as OpenMPI. Traditional batch scheduling systems running on shared HPC resources typically have a queue that a user's program must wait in before execution. In contrast, our platform scales on-demand to the needs of each individual user. catalog = axs.AxsCatalog(spark) df = catalog.load("ztf") dippers = find_and_filter_dips(df) fits = fit_light_curves(dippers) [3] [2] [1] Figure 5. An example analysis (boiled down to two lines) that finds light curves in the ZTF light curve catalog with a dimming event. (1) shows how the ZTF catalog is loaded as a Spark DataFrame (df), (2) shows the product of filtering light curves for dimming events, and (3) shows the result of fitting a model to the remaining light curves. This process exemplifies that analyses can often be represented as a filtering and transformation of a larger dataset, a process that Spark can easily execute in parallel.
This example demonstrates the utility of using cloud computing environments for science: when science is performed on a platform that provides on-demand scaling using tools that can distribute science workloads in a user-friendly manner, time to science is minimized.

SCALABILITY, RELIABILITY, COSTS, AND USER EXPERIENCE
Our system is expected to scale both in the number of simultaneous users and to the demands of a single user's analysis. In the former case, JupyterHub and its built in proxy can scale to access by hundreds of users as its workload is limited to routing simple HTTP requests. In the latter case, data queries by individual users are expected to scale to very many machines, allowing for fast querying and transformation of very large datasets. Section 4.1 summarizes tests to verify this claim.

Scaling Performance
We performed scaling tests to understand and quantify the performance of our system. We tested both the "strong scaling" and "weak scaling" aspects of a simple query. Strong scaling indicates how well a query with a fixed data size can be sped up by increasing the number of cores allocated to it. On the other hand, weak scaling indicates how well the query can scale to larger data sizes; it answers the question "can I process twice as much data in the same amount of time if I have twice as many cores?" Figure 6 shows the strong and weak scaling of a simple query, the sum of the "RA" column of a ZTF light curve catalog, which contains ∼3 × 10 9 rows, stored in Amazon S3. This catalog is described in more detail in section 3.1.
In these experiments, speedup is computed as where t ref is the time taken to execute the query with a reference number of cores while t N is the time taken with N cores. For the weak scaling tests, scaled speedup is computed as which is scaled by the problem size P N with respect to the reference problem size P ref . We chose to scale the problem size directly with the number of cores allocated; the 96-core query had to scan the entire catalog, while the 1-core query had to scan only 1/96 of the catalog. Typically, the reference number of cores is 1 (sequential computing), however we noticed anomalous scaling behavior at low numbers of cores, and so we set the reference to 16 in Fig. 6.
In our experiments, we used m5.large EC2 instances to host the Spark executor processes, which have 2 vCPU and 8 GiB of RAM allocated to them. The underlying CPU is an Intel Xeon Platinum 8000 series processor. The Spark driver process was started from a Jupyter notebook server running on a t3.xlarge EC2 instance with 4 vCPU and 16 GiB of RAM allocated to it. The underlying CPU is an Intel Xeon Platinum 8000 series processor. Single m5.large EC2 instances have a network bandwidth of 10 Gbit/s while the t3.xlarge instance has a network bandwidth of 1 Gbit/s. Amazon S3 can sustain a bandwidth of up to 25 Gbit/s to individual Amazon EC2 instances. Both the data in S3 and all EC2 instances lie within the same AWS region, us-west-2. The m5.large EC2 instances were spread across three "availability zones" (separate AWS data centers): us-west-2a, us-west-2b, and us-west-2c. This configuration of heterogeneous instance types, network speeds, and even separate instance locations represent a typical use-case of cloud computing and offers illuminating insight into performance of this system with these "worst-case" optimization steps.
The weak scaling test showed that scaled speedup scales linearly with the number of cores provisioned for the query; twice the data can be processed in the same amount of time if using twice the number of cores. In other words, for this query, the problem of "big data" is solved simply by using more cores. The strong scaling test showed expected behavior up to vCPU/16 = 5. Speedup increased monotonically with diminishing returns as more cores were added. Speedup dropped from 2.50 with vCPU/16 = 5 to 2.05 with vCPU/16 = 6, indicating no speedup can be gained beyond vCPU/16 = 5. Drops in speedup in a strong scaling test are usually due to real world limitations of the network connecting the distributed computers. As the number of cores increases, the number of simultaneous communications and the amount of data shuffled between the single Spark driver process and the many Spark executor processes increases, potentially reaching the latency and bandwidth limits of the network connecting these computers.

Caveats to Scalability
As mentioned in section 2.4, the use of a shared NFS can limit scalability with respect to the number of simultaneous users. We recommend the administrators of new deployments of our platform consider the access pattern of user data and code on NFS to guarantee scalability to their desired number of users. Carefully designed hybrid models of code and data storage that utilize NFS, EFS, and the Docker image itself (stored on EBS) can be developed that will likely allow the system scale to access from hundreds of users.

Reliability
In general, the system is reliable if individual components (i.e. virtual machines or software applications) fail. Data stored in S3 are in practice 100% durable; at the time of writing, AWS quotes "99.999999999% durability of objects over a given year." Data stored in the EBS volume backing the NFS server are similarly durable -99.999% at the . Speedup computed in strong scaling (left) and weak scaling (right) experiments of a simple Spark query that summed a single column of the ZTF catalog, ∼3 × 10 9 rows. Speedup is computed using Eq. 1 and scaled speedup is computed using Eq. 2. For each value of vCPU, the query was executed several (3+) times. For each trial, the runtime was measured and speedup calculated. Each point represents the mean value of speedup and error bars indicate the standard deviation. The first row shows speedup computed using sequential computing (vCPU = 1) to set the reference time and reference problem size. The second row shows speedup computed using 16 vCPU to set the reference. With sequential computing as the reference, we observe speedup that is abnormally high in both the strong and weak scaling case. By adjusting the reference point to vCPU = 16, we find that we can recover reasonable weak scaling results and expected strong scaling results for a small to medium number of cores. Using the adjusted reference, we observe in the strong scaling case diminishing returns in the speedup as the number of cores allocated to the query increases, as expected. The weak scaling shows optimistic results; the speedup scales linearly with the catalog size as expected.
time of writing. We choose to back up these data using Amazon EBS snapshots on a daily basis so we can recover the volume in the event of volume deletion or undesirable changes.
Kubernetes as a scheduling tool is resilient to failures of individual applications. Application failures are resolved by rescheduling the application on the cluster, perhaps on another node, until a success state is reached. When the Kubernetes cluster autoscaler is used, then the cluster becomes resilient to the failure of individual nodes. Pods that are terminated from a node failure will become unschedulable, which will trigger the cluster autoscaler to scale the cluster up to restore the original size of the cluster. For example, if the user's Jupyter notebook server is unexpectedly killed due to the loss of an EC2 instance, it will re-launch on another instance on the cluster, with loss of only the memory contents of the notebook server and the running state of kernels. The same is true of each of the individual JupyterHub and Spark components. Apache Spark is fault-tolerant in its design, meaning a query can continue executing if one or all of the Spark executors are lost and restarted due to loss of the underlying nodes. Similar loss of the driver process (on the Jupyter notebook server) results in the complete loss of the query.
We have run different instances of this platform for approximately three years in support of science workloads at UW, the ZTF collaboration, a number of hackathons, and for the LSST science collaborations. Over that period, we have experienced no loss of data or nodes.

Costs
This section enumerates the costs associated with running this specific science platform. Since cloud computing costs can be variable over time, the costs associated with this science platform are not fixed. In this section, we report costs at the time of manuscript submission as well as general information about resource usage so costs can be recomputed by the reader at a later date.
We describe resource usage along two axes: interactive usage and core hours for data queries. Interactive usage encompasses using a Jupyter notebook server for making plots, running scripts and small simulations, and collaborating with others. Data queries encompass launching a distributed Spark cluster to access and analyze data provided on S3, similarly to the methods described in Sec. 3.3. Equation 3 provides a formula for computing expected monthly costs given the number of users N u , the cost of each user node C u , the cost of the Spark cluster nodes C s , the estimated time spent per week on the system t u , and the number of node hours used by each user for Spark queries in a month t s : Fixed in the equation are constants describing the amount (200 GB) and cost of ($0.08/GB/month) of EBS-backed storage allocated for each virtual machines. Additionally, the term (30/7) converts weekly costs to monthly costs. Node hours can be converted to core hours by multiplying t s by the number of cores per node. Table 2 enumerates the fixed costs of the system as well as the variable costs, calculated using Eq. 3, assuming different utilization scenarios, varying the number of users (N u ), the amount interactive usage per week (t u ), and amount of Spark query core hours each month (t s ). The fixed costs of the system total to $328.51/month, paying for:  Each of these costs are minimal, and so we don't include them in our analysis. However, they are worth mentioning because they can scale to become significant. Spark queries requiring GB/TB data shuffling between driver and executors should restrict themselves to a single availability zone to avoid the costs of (1). Costs from (2) are unavoidable, but care should be taken so no S3 requests occur between different AWS regions and between AWS and the internet. Finally, (3) can balloon in size if one allows arbitrary file transfers between Jupyter servers and the user or allows large data outputs to the browser. The number of core hours for queries is a parameter that will need to be calibrated using information about usage of this type of platform in the real-world. The upper limit guess of 2048 core hours per user per month is roughly equivalent to each user running an analysis similar to that described in Sec. 3.3 each month. By monitoring interactive Figure 7. A screenshot of the job timeline from the Spark UI when dynamic allocation is enabled. A long-running query is started, executing with a small number of executors. As the query continues, Spark adds exponentially more executors to the cluster at a user-specified interval until the query completes or the max number of executors is reached. Once the query completes (or is terminated, as shown here), the Spark executors are removed from the cluster.
usage of our own platform and other computation tools, we estimate that realistic usage falls closer to the lower limits we provide; few users will use the platform continuously in an interactive manner, and even fewer will be frequently executing large queries using Spark.

Dynamic Scaling
Recent versions of Apache Spark provide support for "dynamic allocation" of Spark executors for a Spark cluster on Kubernetes. 24 Dynamic allocation allows for the Spark cluster to scale up its size to accommodate long-running queries as well as scale down its size when no queries are running. Figure 7 shows pictorially this scaling process for a long-running query started by a user. This feature is expected to reduce costs associated with running Spark queries since Spark executors are added and removed based on query status, not cluster creation. This means the virtual machines hosting the Spark executor processes will be free more often either to host the Spark executors for another user's query or be removed from the Kubernetes cluster completely.

User Experience
While the experience of using the science platform is largely identical to using a local or remotely-hosted Jupyter notebook server, the use of containerized Jupyter notebook servers on a scalable compute resource introduces a few notable points of difference. First, similarly to using a remotely-hosted Jupyter notebook, the filesystem exposed to the user has no direct connection to their personal computer, an experience that can be unintuitive to the user. File uploads and download can be facilitated through the Jupyter interface, but the process remains clunky. For streamlined file transfer, the user must fall back to using an SSH client and utilities like scp or rsync. In future deployments of this system, it is likely that new user interfaces will need to be produced to maximize usability of the filesystem.
Additionally, in order to allow for scale-down of the cluster, notebook servers are typically shut down after a configurable period of inactivity using the jupyterhub-idle-culler service. A period of ∼ 1 − 8 hours is typical for deployments of this science platform. This has the positive effect of reducing costs but at a detriment to the user experience. At the time of writing, inactivity is determined in terms of browser connectivity, so a user cannot expect to leave code running longer than the cull period e.g. overnight. Juric et al. (2021) have implemented functionality to checkpoint the memory contents of the notebook server to disk before stopping with the ability to restore the server to a running state at will. Such checkpoint/restore functionality solves the issue of interrupting running code when culling servers; however, this still does not allow for codes to run longer than the cull period. At the time of writing, additional functionality is being added to the jupyterhub-idle-culler service to allow for fine-grained control over which servers are culled and when.
Finally, the underlying scalable architecture introduces server start-up latencies that are noticeable to the user. Virtual machines that host notebook servers and Spark cluster executors are requested from AWS on-demand by the user. The process of requesting new virtual machines from AWS, downloading relevant Docker images to that machine, and starting the notebook/Spark Docker container can take up to ∼5 minutes. 25 The user can encounter this latency when logging onto the platform and requesting a server. They also encounter this latency when creating a distributed Spark cluster, as many machines are provisioned on-demand to run Spark executors. The log-in latency can be mitigated by keeping a small number of virtual machines in reserve so that an incoming user can instantly be assigned to a node. The zero-to-jupyterhub Helm chart implements this functionality through its user-placeholder option. This functionality schedules placeholder servers on the Kubernetes cluster that will be immediately evicted and replaced when a real user requests a server. Additionally, to avoid Docker image download times, relevant Docker images can be cached inside a custom-built virtual machine image (in AWS lingo, the Amazon Machine Image or AMI) that the virtual machine is started from. An alternative solution to this would be to place all incoming users on a shared machine, an equivalent to a "log-in node", before moving them to a larger machine at the user's request or automatically once a new server is provisioned from the cloud provider -a process known as live migration. Juric et al. (2021) provides a path towards live migration of containerized Jupyter notebook servers, but this advanced functionality remains to be implemented with a JupyterHub deployment on Kubernetes.

CONCLUSIONS
In this paper, we've described an architecture of a Cloud-based science platform as well as an implementation on Amazon Web Services that has been tested with data from the Zwicky Transient Facility. The system is shown to scale to and allow parallel analysis with O(10TB) sized tabular, time-series heavy, datasets. It has enabled a science project that utilizes a 1 billion+ ZTF light curve catalog in full, while requiring minimal effort from domain scientists to scale their analysis from a single light curve to the full catalog. The system demonstrates the utility of elastic computing, the I/O capacity of the cloud, and distributed computing tools like Spark and AXS.
This work should be viewed in the context of exploring the feasibility of making more astronomical datasets available on cloud platforms, and providing services and platforms -such as the one described here -to combine and analyze them. Using this platform, it is both feasible and practical to perform large-scale cross-catalog analyses using any catalog uploaded to AWS S3 in the AXS-compatible format. 26 This enables any catalog provider -whether large or small -to make their data available to the broad community via a simple upload. Additionally, other organizations can stand up their own services on the Cloud -either use-case specific services or broad platforms such as this-oneto access the data using the same S3 storage API.
In this regime, the roles of data archive and data user can be further differentiated, to the benefit of the user and perhaps at reduced cost. Data archivers upload their data to the cloud and bear the cost of storage. These costs are manageable, even by small organizations; storing 1 TB of data in S3 costs ∼$25 per month. Cost scales dramatically when considering datasets at the PB level and timescales extending over years: a 1 PB dataset will cost ∼$3, 000, 000 over 10 years assuming storage costs do not decrease. At these scales special pricing contracts may have to be negotiated between the cloud provider and archive. Additional cost scales with the number of requests for this data and the amount of data transferred. So-called "requester-pays" pricing models, supported by some cloud providers, can offload access and data transfer costs to the user. A user -or perhaps an organization of users -can deploy a system like ours at reasonable cost to access the data in a given cloud. In this case, the cost of analysis decouples from the cost of storage: it is the user who controls the number of cores utilized for the analysis, and any additional ephemeral storage used for the analysis. It is easy to imagine the user -as a part of their grant -being