Paper

gadfly: A pandas-based Framework for Analyzing GADGET Simulation Data

Published 2016 October 12 © 2016. The Astronomical Society of the Pacific. All rights reserved.
, , Citation Jacob A. Hummel 2016 PASP 128 114503 DOI 10.1088/1538-3873/128/969/114503

1538-3873/128/969/114503

Abstract

We present the first public release (v0.1) of the open-source gadget Dataframe Library: gadfly. The aim of this package is to leverage the capabilities of the broader python scientific computing ecosystem by providing tools for analyzing simulation data from the astrophysical simulation codes gadget and gizmo using pandas, a thoroughly documented, open-source library providing high-performance, easy-to-use data structures that is quickly becoming the standard for data analysis in python. Gadfly is a framework for analyzing particle-based simulation data stored in the HDF5 format using pandas DataFrames. The package enables efficient memory management, includes utilities for unit handling, coordinate transformations, and parallel batch processing, and provides highly optimized routines for visualizing smoothed-particle hydrodynamics data sets.

Export citation and abstract BibTeX RIS

1. Introduction

In the past decade, astrophysical simulations have increased dramatically in both scale and sophistication, and the typical size of the data sets produced has grown accordingly. However, the software tools for analyzing such data sets have not kept pace, such that one of the primary barriers to exploratory investigation is simply manipulating the data. This problem is particularly acute for users of the popular smoothed particle hydrodynamics (SPH) code gadget (Springel et al. 2001; Springel 2005). Both gadget and gizmo (Hopkins 2015), which uses the same data storage format, are widely used to investigate a range of astrophysical problems; unfortunately this also leads to fractionation of the data storage format as each research group modifies the output to suit its needs. This state of affairs has historically forced significant duplication of effort, with individual research groups separately developing their own unique analysis scripts to perform similar operations.

Fortunately, the issue of data management and analysis is not endemic to astronomy, and the resulting overlap with the needs of the broader scientific community and the industrial community at large provides a large pool of scientific software developers to tackle these common problems. In recent years, this broader community has settled on python as its programming language of choice due to its efficacy as a "glue" language and the rapid speed of development it allows. This has led to the development of a robust scientific software ecosystem with packages for numerical data analysis like numpy (van der Walt et al. 2011), scipy (Jones et al. 2001), pandas (McKinney 2010), and scikit-image; matplotlib (Hunter 2007) and seaborn for plotting; scikit-learn for machine learning, and statistics and modeling packages like scikits-statsmodels, pymc, and emcee (Foreman-Mackey et al. 2013).

Python is quickly becoming the language of choice for astronomers as well, with the Astropy project (Robitaille et al. 2013) and its affiliated packages providing a coordinated set of tools implementing the core astronomy-specific functionality needed by researchers. Additionally, the development of flexible python packages like yt (Turk et al. 2011) and pynbody (Pontzen et al. 2013), capable of analyzing and visualizing astrophysical simulation data from several different simulation codes, have greatly improved the ability of computational researchers to perform useful, insight-generating analysis of their data sets.

Recently, the scientific python community has begun to converge on the DataFrame provided by the high-performance pandas data analysis library as a common data structure. As a result, once data is loaded into a DataFrame, it becomes much easier to take advantage of the powerful analysis tools provided by the broader scientific computing ecosystem. With this in mind, we present a pandas-based framework for analyzing gadget simulation data, gadfly: the GAdget DataFrame LibrarY. Rather than providing an alternative to the existing yt and pynbody projects, the aim of the gadfly project is to ease interoperability with the python ecosystem at large, lowering the barrier for access to the tools created by this broader community.

In this paper we present the first public release (v0.1) of gadfly, which is available at http://github.com/hummel/gadfly. The framework design and organizational structure are outlined in Section 2, followed by a description of the included SPH particle rendering in Section 3. Our plans for future development are outlined in Section 4, and a summary is provided in Section 5.

2. A Framework Built on Pandas

There are several motivations for building an analysis framework around the pandas DataFrame. The guiding principle underlying the design of this framework is to enable exploratory investigation. This requires both intelligent memory management for handling out-of-core data sets, and a robust indexing system to ensure that particle properties do not become misaligned in memory. Using the pandas DataFrame as the primary data container rather than separate numpy arrays makes it much easier to keep different particle properties indexed correctly while still affording the flexibility to load and remove data from memory at will. In addition, pandas itself is a thoroughly documented, open-source, BSD-licensed library providing high-performance, easy-to-use data structures and analysis tools, and has a strong community of developers working to improve it. More broadly, as pandas is becoming the de-facto standard for data analysis in python, doing so simplifies interoperability with the rest of the available tools.

Gadfly is designed for use with simulation data stored in the HDF5 format.1 While we otherwise aim to keep gadfly as general as possible, some assumptions about the storage format are necessary. Each particle type is expected to be contained in a different HDF5 group, labeled PartType0, PartType1, etc.; a Header group is also expected, containing metadata for the simulation snapshot as HDF5 attributes. All particles are expected to have the following fields defined: particle IDs, masses, coordinates, and velocities. SPH particles are additionally expected to have a smoothing length, density, and internal energy. Additional fields not included in the original gadget specification, such as chemical abundances, are also supported.

Here, we provide an overview of the design and capabilities of the gadfly framework, including the Simulation, Snapshot, and PartType DataFrame objects at the core of gadfly (Section 2.1), the usage of which is demonstrated in Figure 1. Our approach to file access and intelligent memory management (Section 2.2), our handling of unit conversions (Section 2.3) and coordinate transformations (Section 2.4), and the included utilities for parallel batch processing (Section 2.5) are also described.

Figure 1.

Figure 1. Initializing a simulation, defining a PartType data set, and loading data.

Standard image High-resolution image

2.1. Organizational Structure

2.1.1.  PartType Dataframes

Data for each particle type (e.g., dark matter, gas, etc.) is stored in a separate PartType dataframe and indexed by particle ID number. Individual fields can be loaded into the dataframes and deleted at will, with coherent slicing across multiple data fields, courtesy of pandas. The base PartType objects, PartTypeNbody and PartTypeSPH, are standard pandas dataframes with additional functionality for loading standard gadget particle fields from HDF5, converting units, and performing coordinate transformations. As such, PartType dataframes retain all the capabilities of the pandas.DataFrame from which they inherit, and any operation that creates a new dataframe instance returns a standard pandas dataframe.

Nonstandard particle fields (e.g., chemical abundances) can be easily loaded into gadfly as well; fields loaded in this manner simply lack automated unit conversion. Alternatively, a custom dataframe class inheriting from PartTypeNbody or PartTypeSPH as appropriate can be defined to provide methods for loading both nonstandard particle fields and additional derived properties (e.g., temperature). An example of such a custom class—PartTypeCustom.py—is provided in the examples distributed with gadfly, and the usage of such a custom class is demonstrated in Figure 2.

Figure 2.

Figure 2. Loading non-standard particle fields and defining custom PartType classes. Within gadfly this is straightforward.

Standard image High-resolution image

2.1.2.  Snapshot

Each Snapshot object represents a single HDF5 snapshot file on disk. File access—provided by h5py (Collette 2008)—is handled via the Snapshot object, and the actual particle data, loaded as needed into the PartType dataframes described in Section 2.1.1, is gathered here with each particle type contained in a separate PartType dataframe. The information contained in the gadget header is also maintained here in a Header object, along with access to the additional metadata and unit information stored in the relevant Simulation object.

2.1.3.  Simulation

Metadata relevant to the simulation as a whole, such as filepaths and unit information, are stored in a Simulation object. Initializing a Simulation object is the first step in any analysis using gadfly as this is where default values for all subsequent analysis are set, including locating all relevant snapshot files, choosing a coordinate system, and setting the field names of the default particle properties expected by gadfly. Snapshots are loaded via Simulation.load_snapshot(), and the parallel batch processing functionality described in Section 2.5 is implemented at this level as well.

2.2. Memory Management and File Access

One of the primary goals of the gadfly project is to enable the analysis of large data sets on machines with limited memory. Enabling this requires intelligent memory management, loading only the particle data of interest from the disk. Fortunately the HDF5 protocol is well-suited to such non-contiguous file access, allowing not only individual data fields to be accessed independently, but also for loading only select entries from the field in question.

Gadfly employs two complementary approaches to minimizing the memory footprint. The first method requires definition of an optional refinement criterion, such as particles above a given density threshold. The resulting "refined" index can then be used to select only the corresponding values from subsequent particle fields as they are loaded. While this method efficiently minimizes I/O operations, it is fairly rigid, and attempting to load additional fields into a dataframe from which particles have been manually dropped will fail, as the particle indices will no longer match. As such, this approach is poorly suited to exploratory analysis where the proper refinement criterion may not be know a priori and is best suited for use in scripts where the analysis to be performed is well defined.

To mitigate the indexing issues that can arise in situations where data is loaded incrementally, gadfly performs an intermediate step, first loading new fields into a pandas.Series data structure, using the particle ID numbers as an index. This allows pandas to properly align the particle fields, dropping any particles not in the existing PartType dataframe from the newly loaded field as it is appended. This approach, which can be used in tandem with the refinement index, affords gadfly the flexibility needed to allow incremental manual refinement of the data stored in memory. Additional cuts can be made as subsequent fields are loaded, resulting in the selection of a precisely targeted primary data set from which derived properties (e.g., temperature) may be calculated, which serves to reduce computational overhead as well.

2.3. Units

Gadfly implements a minimal unit-handling system for converting between the native code units in which gadget stores data and the physical units of interest to astronomers. However, gadget's code units may be modified to suit the problem at hand, and therefore must be specified at initialization if they differ from the defaults listed in Table 1. Conversion from code units to the physical units system of choice is handled at loading. The default units for length, time, mass, etc. can be specified at initialization, along with whether to use physical or comoving coordinates (for cosmological simulations) and whether to factor the Hubble constant out of the reported units (i.e., units of Mpc versus Mpc/h). While defaults are set at startup, they can be modified at any time, either by calling units.set_length(), which alters the default length unit for all subsequently loaded fields, or by calling a particle field's load function with the optional units keyword argument. Changing the units of an existing field can be done by simply reloading it; the field will remain properly indexed as described in Section 2.2. Examples of both approaches to unit conversions are shown in Figure 3.

Figure 3.

Figure 3. Unit handling in gadfly.

Standard image High-resolution image

Table 1.  Default Code Units Expected by gadfly

Unit Conversion to cgs
Mass 1.989 × 1043 g
Length 3.085678 × 1021 cm
Velocity 1.0 × 105 cm s−1

Download table as:  ASCIITypeset image

2.4. Coordinate Transformations

A full suite of coordinate transformation utilities is included in gadfly, with methods for converting between Cartesian, spherical, and cylindrical coordinates, performing linear coordinate translations, and rotations about arbitrary axes. These methods are both directly accessible as independent library functions, and via the PartType dataframe object using the .orient_box() functionality.

2.5. Parallel Batch Processing

Investigating the results of a simulation often requires running an identical analysis on many snapshots in order to study how the system changes with time. These operations are typically completely independent, and thus amenable to parallelization—so long as the machine being used has sufficient memory. To aid in the parallelization of such analyses, gadfly includes utilities for performing parallel batch processing jobs by making use of python's multiprocessing module.

One issue with doing operations such as this in parallel is that they are often IO-bound. To mitigate the issues that arise when multiple processes attempt to read large amounts of data from disk simultaneously, gadfly's parallelization utilities are designed to load the data needed for a given analysis serially in a separate thread from the main execution. As the necessary data from each snapshot file is loaded, execution is passed to a pool of worker processes which can then perform the remainder of the analysis in parallel. An example of such a parallel batch processing script is included in the examples distributed with gadfly.

3. Visualization

Visualization plays a major role in any analysis of simulation data, and the ability to directly visualize the spatial structure and evolution of a system is often crucial to generating insight. While the guiding principle of the gadfly project is to avoid reinventing the wheel, instead relying on the tools developed by the broader scientific python community wherever possible, SPH particle rendering is one critical area where that broader ecosystem proves insufficient. To mitigate this shortcoming, gadfly includes a minimal library of visualization tools.

The particles in an SPH simulation are best thought of as fluid elements sampling the continuum properties of the gas they represent (Gingold & Monaghan 1977; Lucy 1977; Monaghan 1992; Springel 2010). They accomplish this by serving as Lagrangian tracers over which the continuum properties are interpolated using a smoothing kernel W. While it is possible to use alternative kernels, most modern SPH implementations (including gadget) utilize a cubic spline kernel (Springel 2014):

Equation (1)

where r is the radius and ${h}_{{\rm{s}}}$ is the characteristic width of the kernel, otherwise known as the smoothing length. The physical density at any point, $\rho ({\boldsymbol{r}})$, is then represented by the sum over all particles,

Equation (2)

where mj is the mass of particle j, located at ${{\boldsymbol{r}}}_{j}$. As such, creating visualizations that faithfully reproduce the actual density distribution requires performing this sum over all particles of interest; this can be quite computationally intensive depending on the number of particles involved and the desired resolution. The SPH particle rendering algorithm included in gadfly performs this summation over two dimensions, producing a density-weighted projection. An example of such a visualization produced by gadfly is shown in Figure 4.

Figure 4.

Figure 4. Example of a visualization produced by gadfly. This particular image shows the density distribution in a minihalo forming at z = 25 on $1\,{\rm{kpc}}$ scales, and is from a simulation run for Hummel et al. (2015).

Standard image High-resolution image

Gadfly includes three separate implementations of this algorithm, each of which is best suited to different conditions:

  • 1.  
    The primary implementation is derived from Navratil et al. (2007) and is written in C, parallelized using OpenMP, and wrapped in python using scipy.weave. This method must be locally compiled, and will fail if the python interpreter cannot locate a C compiler, or if the requisite libraries are not installed. However, it its the most powerful implementation, best suited to rendering many particles, and to machines with many processors available.
  • 2.  
    A second implementation makes use of numba (Lam et al. 2015) to perform just-in-time compilation of pure python code using the LLVM compiler infrastructure (Lattner & Adve 2004). The resulting serial routine is highly optimized, providing performance within a factor of two of the parallel method on a 16-core machine. As such, this method is preferable on smaller machines with fewer processors, and for visualizations including fewer particles where the additional overhead associated with the parallel implementation is significant.
  • 3.  
    The final implementation is a pure python routine included only for situations where the other methods have unmet dependencies. This implementation is over 500 times slower than the others, and as such is suitable only to visualizing small numbers of particles, or as a last resort.

These methods are all available as independent library functions, and can be used both with particle data in a PartType dataframe or independently from the rest of the gadfly framework, with data stored in numpy arrays. Future versions of gadfly will greatly simplify the visualization of data stored in a PartType dataframe through the use of a .visualize() method (see Section 4).

4. Future Development

The gadfly package is under active development, and in addition to incremental improvements of the existing functionality, a few significant upgrades are planned for future releases. First and foremost, the current gadfly units system is fairly inflexible, with limited support for additional units. For the next release we plan on replacing the existing units system, instead employing the far more versatile astropy.units framework (Robitaille et al. 2013) as the backend for handling unit conversions. Future releases will also more seamlessly integrate gadfly's visualization tools with the rest of the framework. At the moment the user is required to pass each required field to the visualization methods individually. We intend to integrate this functionality directly into the PartType dataframe, allowing the user to simply call PartType.visualize() and be presented with a default density projection of the SPH particles currently in the dataframe, similar to the .plot() functionality of a standard pandas.DataFrame.

In the longer term, the dask 2 project presents an intriguing option for handling very large out-of-core data sets through the use of blocked algorithms and task scheduling. Dask manages this by mapping high-level numpy, pandas, and list operations on large data sets to many operations on smaller chunked data sets that can fit in memory. Future releases of gadfly may migrate the data structure on which the PartType dataframe is built from the pandas.DataFrame to dask.dataframe to take advantage of this functionality.

5. Summary

In this paper we have presented the first public release of gadfly. We have described the framework's structure, which is designed around three core abstractions: the PartType dataframe (Section 2.1.1), the Snapshot object (Section 2.1.2), and the Simulation object (Section 2.1.3). Additional functionality includes intelligent memory management and file access (Section 2.2), basic unit handling (Section 2.3), coordinate transformations (Section 2.4), utilities for parallel batch processing (Section 2.5), and SPH particle visualization (Section 3).

Gadfly is fully open-source, is released under the MIT License, and can be downloaded and installed from its repository at http://github.com/hummel/gadfly. Users can submit bug reports via GitHub, and if they know how to fix them, are welcome to submit pull requests.

Thanks to Jason Jaacks and Alex Fitts for providing additional test data and bug reports, and to Thomas Robitaille for advice on creating the included figures.

Footnotes

Please wait… references are loading.
10.1088/1538-3873/128/969/114503