Scaling neural simulations in STACS

As modern neuroscience tools acquire more details about the brain, the need to move towards biological-scale neural simulations continues to grow. However, effective simulations at scale remain a challenge. Beyond just the tooling required to enable parallel execution, there is also the unique structure of the synaptic interconnectivity, which is globally sparse but has relatively high connection density and non-local interactions per neuron. There are also various practicalities to consider in high performance computing applications, such as the need for serializing neural networks to support potentially long-running simulations that require checkpoint-restart. Although acceleration on neuromorphic hardware is also a possibility, development in this space can be difficult as hardware support tends to vary between platforms and software support for larger scale models also tends to be limited. In this paper, we focus our attention on Simulation Tool for Asynchronous Cortical Streams (STACS), a spiking neural network simulator that leverages the Charm++ parallel programming framework, with the goal of supporting biological-scale simulations as well as interoperability between platforms. Central to these goals is the implementation of scalable data structures suitable for efficiently distributing a network across parallel partitions. Here, we discuss a straightforward extension of a parallel data format with a history of use in graph partitioners, which also serves as a portable intermediate representation for different neuromorphic backends. We perform scaling studies on the Summit supercomputer, examining the capabilities of STACS in terms of network build and storage, partitioning, and execution. We highlight how a suitably partitioned, spatially dependent synaptic structure introduces a communication workload well-suited to the multicast communication supported by Charm++. We evaluate the strong and weak scaling behavior for networks on the order of millions of neurons and billions of synapses, and show that STACS achieves competitive levels of parallel efficiency.


Introduction
Modeling and simulation have been an integral part of neuroscience research since the earliest computer systems were designed.However, large high-performance computing (HPC) systems have only achieved limited success for neuroscience tasks such as full-scale neural circuit modeling, despite their critical importance in other biological fields (ranging from drug discovery and biochemistry to ecology).Much of the debate around large-scale neural simulations has reasonably focused on the empirical question of whether there is sufficient data to model the brain; however, neuroscience is experiencing exciting innovations in large-scale data collection from both structural and functional imaging tools [1].For this reason, attention is again turning to modeling these systems at scale.However, the best approach to parallelizing neural simulations at biologically-realistic scales remains an area of open research [2].
At first glance, today's largest HPC systems should be sufficient for human-scale brain simulations.Viewed as a graph, the human brain consists of roughly 100 billion neurons (i.e.vertices) and nearly a quadrillion synapses (i.e.edges).Biological speeds are also slow, with neuron action potentials, or 'spikes' being roughly 1 millisecond in length, many orders of magnitude below the gigahertz speeds of computers.While neuron and synapse models vary considerably in detail (e.g.computational complexity to match the biophysics), there is also considerable research in developing reduced order models for 'point-neuron' simulations [3], where the focus is on network-level behavior.
There has also been a growth in the development of neuromorphic computing platforms, which employ specialized hardware to accelerate the execution of spiking neural network (SNN) models.Alongside the platforms, there have also been expansions to the software frameworks, network simulators, and engineering tools available to the community [4][5][6][7][8][9][10][11].In the software space here as well, there has also been a continual push toward larger scale experiments in line with the advancements in HPC and use of accelerators such as graphics processing units (GPU) [12][13][14][15].
Of course, there are many challenges to scaling up network models on both traditional or emerging hardware.Unlike computations for many other large physical systems that are dominated by local interactions and heavy arithmetic workloads or artificial neural networks that can largely be reduced to tensor processing, biological neural simulations are more similar to graph analytics applications where data locality is an issue and timestep-to-timestep communication demands are unpredictable and irregular [16].These challenges are largely due to the structure of synaptic interconnectivity, which is globally sparse but has a relatively high connection density and non-local interactions of neurons.Furthermore, neural activity is also generally sparse and chaotic, with any given neuron only spiking (i.e.producing discrete event messages to be communicated) at a rate of less than 1 Hz on average in typical models [17], and with the subset of neurons spiking at any given timestep unknown.On the other hand, at a network level, there is a significant volume of spikes that still need to be communicated across the synaptic structure at any given time step.Even then, the volume may fluctuate at a system level, such as with oscillatory behavior in terms of spike density (e.g.theta [~8 Hz] or gamma oscillations [~80 Hz]).
In this work, we explore the parallel capabilities of the SNN simulator, Simulation Tool for Asynchronous Cortical Streams (STACS) [15], which seeks to address some of these challenges to scaling neural simulations.STACS was developed to be parallel from the ground up, building on top of the Charm++ parallel programming framework, which expresses a paradigm of asynchronous message-driven parallel objects [18,19].Here, STACS takes advantage of the multicast communication pattern supported by Charm++ to match the irregular communication workload of biological scale models.In addition to the parallel runtime, STACS also leverages a distributed network data structure for network construction, simulation, and serialization.Built on top of the distributed compressed sparse row (dCSR) format used in graph partitioners, the spiking neural network extension to distributed compressed sparse row (SNN-dCSR) also supports partitioning a network model along its connectivity structure to facilitate more efficient communication by reducing the volume of spikes that are exchanged across compute resources.As a data format for serialization, it also serves as a portable intermediate representation for interoperability between tools within the neural computing ecosystem and for different neuromorphic backends.The code repository for STACS may be found at https://github.com/sandialabs/STACS,and is developed by the authors.
We evaluate the scaling behavior of STACS on the Summit HPC system at Oak Ridge National Laboratory.We look at both strong and weak scaling for network models on the order of millions of neurons and billions of synapses.From our experiments, we show that STACS can effectively benefit from the increased processing resources provided by parallelization.With respect to our distributed network data structure, we demonstrate a linear scaling for parallel network construction, partitioning, and storage.For network simulation, we also performed a comparative study with a community standard neural simulator, NEural Simulation Tool (NEST), which uses a simpler communication approach (all-to-all) [20].We found that despite the increased communication burden of larger simulations, STACS was able to maintain attractive levels of parallel efficiency.
In section 2, we provide a brief overview of related large-scale neural simulation tools, as well as some additional background of STACS.In section 3, we dive into the importance of spatial partitioning as it relates to communication, and put forward our distributed data format.In section 4, we provide details of our scaling study, such as the network model, our experimental environment, and methods.In section 5, we present the scaling results and compare against related work.In section 6, we offer some discussion and opportunities for future work and integration.

Neural simulation
The complexity of neural simulations has long been influenced by the need for abstraction to mitigate high computational costs of biophysically-detailed models.Although the level of neural detail that is required within a model to inform the understanding a given brain region can vary considerably depending on the study, several modeling considerations can increase the need for more computational resources.These include more complex neuron and synapse models (e.g.multi-compartment neurons, biophysical realism), scope (e.g.single region or multiple region), bigger simulations (e.g. more neurons), behaviors (e.g.random inputs or task-driven inputs) and longer simulation times that allow examination of plasticity across time-scales.Each of these dimensions of complexity presents distinct costs in terms of processing, memory, and communication that have made forecasting the HPC requirements for use in neuroscience difficult [21].
While current simulations still fall well short of the brain's immense complexity and scale, in recent years biological neural simulations have been increasing in scale due both to rapid increase of experimental data and powerful computing resources and tools.The flagship effort of the Human Brain Project by Markram, et al used the NEURON simulator to investigate a full-scale model of a somatosensory cortical column, requiring substantial computational resources for simulating the highly sophisticated biophysics models of roughly 30 thousand neurons and 40 million synapses [22].Using much simpler neuron models and foregoing any synaptic plasticity, but also exhibiting slightly larger scale, a 'point-neuron' network organized as sets of coupled excitatory-inhibitory reservoirs and consisting on the order of 80 thousand neurons and roughly 0.3 billion synapses has been developed as a model the cortical microcircuit [23], and simulated using NEST on HPC systems [24].More recently, NEST has also been used to simulate a scaffold model of the CA1 area of the hippocampus brain region consisting of 5.28 million neurons and 40 billion synapses with complex spatial connectivity patterns [25].

Simulation tools
In addition to computational neuroscience studies such as in the above use cases, efforts to design performant simulation tools for SNNs have also gained momentum with the rapid rise in the field of neuromorphic computing.In fact, there exists a quite a large selection of available tools which are suitable for different types of workloads, which range from randomly connected networks, networks with structured delays, and even the layered networks (e.g.feed-forward) found in deep learning style models.Different tools also aim to mimic the dynamics of neurons and synapses to different levels of biological accuracy [26].
For example, simulators such as NEST and Brian2 are suitable for modeling SNNs with structured delays and flexible connectivity patterns and provide support for different dynamics of spiking neurons [6,20].These dynamics are represented as having point-like dynamics (as opposed to having complex morphologies) with discrete event-like spike messages passing between them.Simulators such as Brian2GeNN and BindsNET, while having the capability to leverage GPUs for accelerated compute, restrict the flexibility in defining the network topology, and lack the support for temporal dynamics such as delays in their network models [27,28].
A recent study looked at a selection of these available simulators, such as Brian2, Brian2GeNN, Nengo, NEST, and BindsNET, and compared them in terms of their scalability and runtime in simulating diverse SNN workloads [29].Simulation tools were also evaluated on different platforms such as workstation, multi-threaded, and multi-node systems, including the Summit supercomputer.In this work, it was shown that many simulators still lacked the capability to be able to simulate full-scale cortical models and leverage HPC capabilities, especially among the GPU-based simulators that typically focus on single-node.Amongst these, the popular NEST simulator was found to be capable of extending its framework onto both multi-core and multi-node systems [20].Apart from this study, we also note that more specialized simulators such as MONET have also been developed for multi-node systems [30].
Alongside the expansion of frameworks and tools is the question of their compatibility and interoperability, and a common thread of discussion is how to adequately compare and benchmark between different simulators and potential target neuromorphic platforms [29,31,32].Efforts such as Fugu have attempted to resolve the interoperability between a spiking neural algorithm and different neurormophic backends by employing only the most widely supported neuron and synapse models [7].More recently, efforts such as NIR (neuromorphic intermediate representation) aims to provide a standard for describing network models and moving between computing platforms, whether in simulation or on hardware [33].Another related work is the Scalable Open Network Architecture TemplAte data format for large-scale network models and simulation output, which leverages common file formats (e.g.csv, hdf5, json) and table-based data structures [34].

STACS
For STACS (Simulation Tool for Asynchronous Cortical Streams) [15], we employ a partition-based data format.In the context of large-scale simulations, this means each parallel process is only responsible for its own partition of state, which becomes especially useful when the size of an network exceeds the memory resources of a single compute node.This partitioning also naturally transfers to neuromorphic computing platforms which consist of multiple interconnected chips.We describe this format in more detail in section 3.2, but in summary, the SNN-dCSR format organizes additional network information, such as neuron and synapse state, in alignment to their corresponding adjacency lists.As a relatively straightforward data structure with a history of use in graph partitioners, it also serves as a portable intermediate representation for different neuromorphic backends where a network may need to be split across multiple chips.
STACS was developed as a parallel simulator from the ground up, and is built on top of the Charm++ parallel programming framework [19].Charm++ expresses parallelism within a paradigm of message-driven migratable parallel objects (called chares), and provides parallel application support through its adaptive runtime system, asynchronous communication methods over a variety of interconnects, and system portability.In an HPC context, serialization to the SNN-dCSR format supports STACS application capabilities by providing fault-tolerant computing through checkpointing-restarting, the potential for offline analysis, and parallel graph distribution through partitioning tools.STACS also opts to serialize to plain-text files for portability reasons.Even though this is generally less memory efficient for on-disk storage, it enables external, potentially parallel, tools to interface with the data format through straightforward file parsers.
As with many other neural simulators, STACS supports the simulation of user-defined neuron and synapse dynamics models and their network connectivity structure.This is formulated as a graph of stochastic differential equations with time-driven computation and event-driven communication, where neuron models correspond to the vertices, and the synapse models connecting neurons correspond to the edges.The network as a whole is partitioned to a chare array, and the data dependencies to each partition are determined for coordination.Because vertex distribution information may be found from the partition-based offsets, it is straightforward to compute the source or target partition on any given incoming or outgoing edge, respectively.Although the resulting neighborhoods are dependent on the adjacency structure, the process is conceptually similar to determining ghost regions that are shared between neighboring parallel processes of more regular HPC applications.
Spike events that are generated on a given partition are aggregated and serialized into Charm++ messages and sent to the subset of chares within the array that composes its communication neighborhood.This outgoing communication leverages Charm++'s multicast method [35], where a communication tree is generated from the chare array subset so as to minimize the amount of messages to be sent across the HPC system interconnect as a whole.Although more related to hardware implementations, earlier studies have shown that multicast tends to also be suitable for SNN architectures [36].To take advantage of this multicast communication pattern, a requirement of the network model is that it should permit a partitioning that reduces the edge-cut (i.e.number of edges between partitions) as compared to a randomly partitioned network.This translates to reducing the volume of spikes that must be communicated between partitions and communication workload.We show that this may be achieved for biological scale models with spatially dependent structure.

Spatial partitioning
Due to its relevance to many problem domains in scientific computing, methods for efficient and high-quality graph partitioning have been extensively studied [37].There are also multiple software tools to partition large graphs in parallel [38,39].Current state-of-the-art methods generally involve a multi-level approach where a large graph is successively approximated through heuristics until it reaches a manageable size such that an optimal partition may be found combinatorically.This base partition is then propagated and refined back up to obtain a partitioning of the original graph.Throughout this process, information about the edge weights or the geometry of the vertices may be used to help guide the heuristic approximations, if applicable.
On the contrary, we note that the default (and often only) approach to parallelizing SNNs is to take a 'round-robin' assignment of neurons onto the multiple compute processes.For example, this is the strategy used by NEST [40].This is where neurons are filled into partitions one-by-one in a modular pattern (p i = v i mod N p ).Although a round-robin assignment results in a relatively even distribution of an SNN for load balancing the per-process compute costs, we expect this approach will also limit scalability on HPC due to communication costs.This is because a round-robin assignment of neurons will tend to result in a fairly expensive all-to-all communication step, where each process will need to send spike information to all other Toy network illustrating spatial partitioning and round-robin partitioning.For a total of 9 partitions, spatial partitioning follows the inner-grid lines.For round-robin partitioning, neurons are assigned to same color partitions (which repeat in a round-robin pattern).As can be seen, the round-robin partitioning can lead to considerably increased communication costs as networks becomes larger.
processes.While perhaps nothing is lost for a randomly connected network, biological scale models typically have a spatially dependent structure where we expect there is a more efficient approach to assigning neurons so as to reduce the communication costs.For example, if assuming a layered sheet structure such as in cortex, a tile-based partitioning method may be effective [30].
As an illustrative example, we may look at the toy network in figure 1.There are a total of 144 neurons aligned to a 12 × 12 grid, with each neuron connected to neighboring neurons that are within a Manhattan-distance of 2 away (no periodic boundaries).This results in roughly 7.2% connectivity with a total of 1492 synapses.A 9-way spatial partitioning into a 3 × 3 grid results in some edges being maintained within-partition (solid arrows) and those that cross partition boundaries (dashed arrows).Under this partitioning method, we obtain an edge-cut of 448 edges that must be communicated between partitions, or 32.7% of the total connections.Furthermore, we note that some partitions may have no edges between them (e.g. the top-left and bottom-right partitions).We may contrast this with a round-robin assignment of the neurons to partitions, illustrated as neurons of the same color which repeat in a round-robin pattern.Here, the same underlying graph connectivity results in an edge-cut of 1316, or 88.2% of the total connections.Furthermore, it is clear to see that each partition contains at least an edge to every other partition, increasing the communication burden.

Distributed compressed sparse row
To support more elaborate partitioning schemes, we take a quick detour into distributed graph data structures.Compressed sparse row (CSR) is a widely used format to efficiently represent sparse matrices [41].Although there are different implementations for CSR, the central idea is to store the non-zero values of a matrix corresponding to indexical arrays over the rows and columns.A relatively standard implementation is to represent a spare matrix as three distinct arrays: the row, column, and value.For an (n × n) matrix with m non-zero entries, the value array is relatively straightforward to construct, and is simply all of the m non-zero entries listed in row-major order.The associated column array is also of size m, and contains the column indices of their corresponding non-zero entries within their respective rows.Finally, the row array, sometimes also called the row pointer array, provides a prefix or cumulative sum over the number of non-zero entries per row.By convention, this is of size n + 1, where the first entry is 0 and the last entry is m, which allows it to be used as an offset into the column and value arrays.As an extension to this, the distributed CSR (dCSR) format was most notably introduced for storing and ingesting large matrices (i.e. when n is very large) in prior work in parallel graph partitioning algorithms [42].
Here, each of the row, column, and value arrays of CSR are partitioned with respect to the rows into multiple arrays.For a k-way partitioning, we may distribute along the rows as .k]), each providing a local prefix sum of their corresponding partition.Finally, an additional indexical array of size k + 1 is constructed which provides the global partition offsets.By convention, the first entry of this partition array is 0 and the last entry is n, which allows it to be used as an offset into the multiple partitioned arrays.In figure 2, we provide an example illustration of a sparse matrix and its representation in CSR and dCSR formats, respectively.

SNN extension to dCSR
For an SNN, a natural representation of the connectivity and overall network structure is as a directed graph [43].Vertices V(G) = {v 1 , v 2 , . . ., v n } correspond to the spiking neurons, and edges E(G) = {e 1 , e 2 , . . ., e m } correspond to the synaptic connections.The direction of an edge e l : v i → v j , for i, j ∈ 1, . . ., n and l ∈ 1, . . ., m corresponds to the propagation of an event (e.g. a spike) from the presynaptic neuron v i to the postsynaptic neuron v j .We may further partition this graph by splitting up the vertices such that known as a k-way partition.Here, edges may be assigned to a given partition based on their source vertex or their target vertex.With respect to the communication and computational patterns for SNNs, typically with synaptic weights applying current on their target neuron, collocating a directed edge with its target vertex is more sensible.
We can see that there is essentially an isomorphism between the rows and columns of a sparse matrix to the vertices and edges of an SNN graph.The main difference is that for an SNN, the vertices and edges typically carry far more additional information (e.g.connection delays, neural and synaptic states) than may be afforded by the standard, single non-zero entry in the value array.To address this, we develop an extension to simply allow for tuples of values to be associated with the column array (as well as tuples of values to be associated with the row array).Here, the adjacency structure is indexed alongside tuples of values corresponding to the network state.This includes the mutable states of vertices and edges (e.g.membrane voltage, synaptic weights, connection delays), unique states (e.g.spatial coordinates of neurons), and transient states (e.g.'in-flight' spiking events).Because the amount of necessary unique state for any given vertex or edge will depend on its specific model dynamics, we may also introduce an additional model dictionary to provide tuple sizes.
Instead of representing a graph adjacency as two contiguous arrays of row offsets and column indices, respectively, one of the shortcuts implemented by ParMETIS during serialization is to implicitly index the row by its line number in a given .adjcy.k (adjacency) file and then list its column indices as space-separated values.Because the entire file must be read in from disk for processing, the initialization of data structures such as the row offsets can be computed at runtime.The partitioning offsets are still needed, however, and these are supplied through a separate .dist(distribution) file.As additional information to a geometric partitioner, we also initialize a .coord.k (coordinate) file corresponding to the spatial coordinates of a vertex (x, y, z) within a Cartesian coordinate system.This becomes especially useful when network sizes exceed the memory requirements for advanced partitioners and may need to fall back to simple voxel-based partitioning.For the main network .state.k (state) files, we supply a space-separated list of string-based model identifiers and tuples of state data.We concatenate vertices and their associated edges, resulting in a file that begins with a vertex identifier and its state, and followed by edge identifiers and state for each of the incoming connections.Of note here is that because the adjacency file for graph partitioning is typically undirected as opposed to directed, we additionally include special 'none' model identifiers with no associated state for edge where there is an outgoing connection but no incoming one.We also introduce a .model(model) file which provides a mapping between the string-based model identifiers and the size of its state tuple, as well as shared model parameters.There are also .event.k (event) files which provide serialization of any simulation events 'in-flight' that have not yet been processed on the target vertex due to connection delays.These include space-separated tuples of the source vertex, arrival time, the event type, and any associated data.
An example partitioning of a network under this extended SNN-dCSR format can be seen in figure 3. Here, a simple network of 12 vertices (numbered) and 21 directed edges (lettered) are distributed into 3 partitions (colored).Comparing this to the example in figure 2, we may draw some similarities.The dist file contains the partition offsets, but extended for both the vertices and the edges.The adjcy files contain the column indices explicitly, but also the row indices implicitly (by line number).The state files contain the corresponding values (as tuples) for both the vertices (indexed simply by row, implicitly by line number) and edges (indexed by row and column, ordered according to the adjcy file).

Network model definitions
To support the creation of custom, user-defined vertex (e.g.neuron) and edge (e.g.synapse) models, the model parameterization and state tuples that are required to snapshot a network are relatively generic.Regardless of whether a model pertains to vertices or edges, we may decouple the model information that needs to be stored from their dynamics (i.e.how they are used).Information that is related to the model dynamics, but are otherwise shared across all instantiations of a given model type (e.g. the membrane leakage rate for a given neuron population) are simply identified as name-value pairs as model parameters.
Information that are unique per model instantiation and/or are mutable throughout a network simulation (e.g. the membrane voltage for a given neuron) are identified as model states.These can furthermore be given initialization methods (e.g.drawn from a distribution).The former is stored in the model file, whereas the latter has appropriate headers in the model file, but the actual values are stored as ordered tuples in the state files.The model file follows the YAML specification format for portability [44].
To instantiate a fully realized network model, STACS also uses a .graph(graph) configuration file, which also follows the YAML specification format.Here, vertex populations and how they connect to each other through edge builders may be defined.Vertex populations simply reference a vertex model and determine how many instances of that model type to create.Population instantiation parameters such as how the vertices may be spatially distributed (e.g.within a rectangular area) can also be provided.Edge builders are defined between source and target vertex populations, as well as which edge model connects to them.There is also connection information provided to determine whether a given edge should be created when evaluating a pair of source/target vertices.This can be as simply as defining a probability of connection, but more complex functions that take into account the distance between vertices or the vertices' index within their population are also available.Pre-computed connections and/or values may also be provided through comma-separated files.

Network partitioning
In addition to the graph and model files, a simulation-specific configuration file which details information like the size of the timestep or the random number generator seed is needed.Importantly, information about how many partitions the network model should be split across is provided.This information is taken into the network generation capabilities of STACS to instantiate vertex populations and build the edges between them, propagating any information between the different partitions as needed.Although there are sample-based edge builders which can shortcut the connectivity evaluations, in the default scenario, STACS must pass through the full n 2 space of vertex-vertex pairs (where n is the total number of vertices in the network model).Fortunately, in a parallel context where each partition may be evaluated independently, this can ideally be performed in O(n) time when n/k is kept constant (weak scaling), where k is the number of partitions.
Although the number of partitions is provided to the network generator, STACS actually leaves the specific partitioning strategy to external partitioning tools like ParMETIS or Zoltan [38,39].This is of course supported by the adjacency and coordinate files provided by the SNN-dCSR format.Alternatively, simpler voxel-based partitioning can also be implemented based solely on the coordinate information, which can be a reasonable heuristic for biologically-derived network models.In either case, for STACS to (re)partition a network model, the output of the partitioner is formatted as .part.k (partition) files where each row corresponds to the vertices of the original partition order and the entry of that row contains the index of the new partition that the vertex moves to.
Similar to network generation, STACS is capable of repartitioning (i.e.migrating the vertices) based on the partition files in parallel.Here, the vertices and their associated edges are distributed and collected from their source to their target partitions (similar to the MPI scatter and gather operations, respectively).Global bookkeeping information such as partition offsets are then recomputed and edges are reordered accordingly.Unlike network generation, because the new partition-based offsets must be computed in order, there is a bit of a serial dependency for the repartitioning.That being the case, under weak scaling where the number of synapses per neuron is held constant, it is still expected to ideally exhibit O(n) linear scaling as each partition only scans through the vertex indices once.
For the curious reader, the default partition assignment employed by the STACS network generator follows a deterministic order based on the number of partitions, vertex populations, and vertices per population.In the absence of any additional partitioning information, STACS seeks to generate a load-balanced partitioning where vertex populations are split as evenly as possible across partitions and vertices are distributed in contiguous ranges based on their index.This is distinct from a round-robin assignment (see figure 1).Rather, what this means is that for a given partition, vertices with similar index ranges as a percentage of their population are collocated.In general, this benefits network models that are more regularly constructed (e.g. with vertex distributions along a grid and with edge builders that make connections based on vertex index).

Data format interoperability
What makes the extended SNN-dCSR format particularly appealing is that it draws from pre-existing, widely used formats that are intuitive to understand.For our implementation, all of the network state is serialized into essentially four main types of parallel files (adjacency, coordinate, state, and event), and there is also little overhead in storage costs with the introduction of few additional metadata files (dist, model).Due to its simplicity, it also becomes relatively straightforward to interoperate with popular graph analysis packages such as NetworkX and its directed graph data structure [45].In fact, this is precisely what is done to interface STACS as a simulator backend to the Fugu neuromorphic computing framework [7].The network configuration files can also be generated through network modeling tools like N2A [9].
Beyond its simplicity, we also contend that a partition-based distribution of network state makes SNN-dCSR more immediately suitable for computational parallelism, whether its target platform is between different nodes for simulation or between different chips on neuromorphic hardware.Furthermore, as a result of its lineage in graph partitioners, a given serialization may also be readily used to inform a potential repartitioning of the SNN model such that it may optimally fit to different backend platforms.

Scaling study
For our scaling study, we use STACS to generate, partition, and simulate a relatively standard excitatory-inhibitory reservoir (E-I balanced network) of neurons using the Izhikevich neuron model and spike-time-dependent plasticity (STDP) as model dynamics.Additionally, we compared against a balanced network provided as an HPC benchmark by the NEST simulator (see section 4.2).Although there are a number of SNN simulators available, we noted that NEST was really the only one relevant for multi-node applications at large scales [29].Simulations were performed on the Summit HPC system at the Oak Ridge Leadership Computing Facility (OLCF) [46].

Network model description
We construct our network as two interacting populations of Izhikevich neurons, consisting of 80% excitatory, regular-spiking neurons and 20% inhibitory, fast-spiking neurons, following the proportions of these cell types observed in-vivo and commonly used in spiking network simulations [47,48].The Izhikevich model is a two-dimensional nonlinear dynamical system that is useful for large-scale simulations because the same equations model several different excitability patterns observed in biological neurons using different parameters [3].
These cells were randomly and independently distributed on a two-dimensional region with periodic boundary conditions, to yield a constant excitatory and inhibitory neuron spatial densities of ρ = 1 (i.e. one neuron per unit area, µm 2 ).The connectivity between individual cells was initialized randomly and independently, with the probability of connection between any two cells as a function of their distance.We modeled this as a sigmoid function similar to what has been found in visual cortex [49].More concretely, if r is the Euclidean distance between a pair of cells, then probability of connection was given by ) . (1) Here, the parameters P MAX , µ, and σ determine the maximum connection probability, the midpoint, and the slope of the connection probability as a function of distance.By integrating this equation with respect to the constant neuronal spatial density, we computed the expected number of synaptic connections between individual neurons of each type.
Here, Li 2 is the polylogarithm function.We chose the P MAX , µ, and σ parameters to yield either 100 or 1000 total presynaptic connections per neuron.Inhibitory connections were tailored to be local, with high connection probabilities for cells in a small neighborhood, dropping off quickly outside of that neighborhood.In contrast, excitatory cells were tailored to connect with cells over a wider physical distance, as shown in figure 4. In addition to this parameterization, we also determined maximum distance cutfoffs, D, between any two given neurons.These parameters are provided in table 1, and result in a local connection neighborhood of 25 partitions, which stays constant as the network scales up.
The initial connection weights were chosen to yield stable neural activity.Too much excitation or too little inhibition yields networks prone to numerical instabilities.We estimated weights for a stable network by looking for self-consistent solutions for a simplified homogeneous population of neurons.For a single neuron embedded in a large network receiving approximately Poisson input, the mean and variance of the input current summed over all inputs is given by The output firing rate of a neuron is given by a function of the mean and variance of the input current.For our Izhikevich model neurons, this function is not known in closed form.We estimated the output firing rate by simulating a Izhikevich neuron driven by random input with specified mean and variance.This function depended on the model parameters such that we estimated separate firing rate functions for excitatory (FR E (µ, σ 2 )) and inhibitory (FR I (µ, σ 2 )) neurons.
By assuming every neuron in the population fires at the same Poisson rate, we can combine our mean and variance equations with our firing rate equations and, for a given set of weights, determine the firing rates at which an individual neuron would receive random synaptic drive that would cause it to fire at the same rate as the rest of the population.In other words, the steady-state firing rates r E and r I of the population must satisfy We chose the initial weights such that these putative steady-state rates were r E = 5, r I = 20, resulting in an average network firing rate of 8 Hz.For the synapse dynamics, we implemented STDP according to [48].Briefly, weight changes ∆w were accumulated and applied over each second of simulation time according to the relative timing and magnitude difference between the presynaptic and postsynaptic spike times, t pre and t post , with respect to each synapse.Model dynamics were computed using a discrete, event-based formulation that made use of trace variables to approximate the typical asymmetric decaying exponential model for STDP.Ultimately however, for benchmarking stability, even though we evaluated the STDP dynamics to simulate the computational workload, we ended up modifying the network model to essentially ignore the resulting weight changes and artificially lock the average spike rate (see section 4.4).

NEST scaling comparison
We compared our simulation scaling performance in STACS with a similar random balanced SNN provided by the NEST simulator, specifically the one implemented by its included HPC benchmark.The NEST network model also consisted of 80% excitatory and 20% inhibitory neurons, but with a larger fan-in of 11 250 connections per neuron as well as a base scaling factor of 11 250 neurons.Similar to the steps taken to facilitate benchmarking stability in STACS, the instructions provided by the NEST script specifies collecting benchmarking results from only the short, initial 250 ms of simulated time before the network was expected to become unstable.The neurons of the NEST network model have slightly simpler dynamics, and are modeled as integrate-and-fire with alpha-shaped post-synaptic input currents following the parameterization from [50].The synapse model for excitatory-to-excitatory connections also exhibited STDP according to [51], while the remaining connections use static synapses defined by weight and delay.Unlike STACS, these connections are simply randomly sampled from the network based on the known network size.As a result, the choice of a round-robin partitioning with all-to-all connectivity is essentially as good as any other.For further details about the NEST model and the simulation parameters refer to [52].
Although there are certainly differences between the the network model described in section 4.1, because an application is normalized with respect to itself when evaluating scaling metrics such as parallel efficiency (see section 4.4), we note that these details will become inconsequential.That is, if we abstractly model the computational workload per process as the neuron and synaptic updates and the communication workload as handling the volume of spikes between processes, we expect the computational workload to effectively cancel itself out when scaling.Particularly for weak scaling, where the per-process computational workload is meant to be held constant, the scaling behavior is largely the result from the changes to the communication workload.That being said, the network models between STACS and NEST are still quite similar.They are both using a point-neuron approach to simulating network dynamics corresponding to a standard 80-20 E − I reservoir network model receiving Poisson inputs.Furthermore, because the global spike rates of each model are maintained to about 8 Hz to facilitate benchmarking stability, both simulators also communicate the same amount of spikes per interval of simulated time.The key difference is of course how the two different simulators manage their communication workload.

Summit system
We conducted our scaling studies on the Summit HPC system at the OLCF [46].Each Summit compute node contains two IBM POWER9 22-core CPUs (42 cores available for compute, and 2 set aside for overhead), with 256 GB of DDR4 RAM per CPU, and each CPU is connected to three NVIDIA V100 GPUs, with 16 GB of HBM2 per GPU.Each node also contains a 1.6 TB NVMe storage device providing peak write bandwidth of 2.1 GB s −1 (2.0 GiB s −1 ) and read bandwidth of 5.5 GB s −1 (5.1 GiB s −1 ) provided through Alpine, an IBM Spectrum Scale parallel file system providing 250 petabytes of scratch storage capacity and peak sequential I/O bandwidth of 2.5 TB s −1 [53].Summit compute nodes are networked to each other and to Alpine using a Mellanox EDR Infiniband network, and each node has a 12.5 GB s −1 maximum data transfer rate to the Alpine servers over the network.Summit also provides IBM Spectrum MPI, which is an enhanced version of OpenMPI that integrates the Infiniband network and NVIDIA GPUs and uses the ROMIO MPI-IO implementation.
For our study, we built version 7.0.0 of Charm++ using the GCC compiler with Spectrum MPI as the communication layer and using the build option for the IBM POWER9 hardware.Being a relatively lightweight application built on top of the Charm++ framework, STACS only required a few additional modules (e.g.YAML, FFTW) to compile thereafter.NEST was built within the Open-CE computing environment provided by Summit, which is the Python Anaconda environment supporting various machine learning frameworks for the Summit's POWER9 hardware.Our NEST simulations are carried out on version 3.4 of the simulator [40].

Scaling configuration
We performed strong and weak scaling experiments in the range of 8-256 nodes of Summit.For each configuration, network simulations were run 5 times and we report simulation times averaged over these multiple runs.For both STACS and NEST, we chose a partition size of 2500 neurons.For STACS this translated to a square-shaped spatial area of 50.0 µm × 50.0 µm.This also allowed us to perform our strong scaling experiments of the low synapse density network for STACS at the smallest scale of 8 nodes.For STACS, we opted to used 40 processes per node (out of the 42 available) for a total of 100 K neurons per node.This was allow for any overhead needed by the Charm++ adaptive runtime system.For NEST, we ended up using 36 processes per node.This allowed for a multiple of 8 times the base scaling factor of 11 250 neurons to split evenly across a whole number of cores, for a total of 90 K neurons per node.For both STACS and NEST, each of the above processes were mapped to a unique MPI rank, corresponding to individual CPU cores.At the larger scale end of our experiments, this resulted in networks consisting of 25.6 M neurons (25.6B synapses) for STACS and 18.24 M neurons (205.2Bsynapses) for NEST.

Communication workload
Although both network models of STACS and NEST compute model dynamics for STDP, the effects of this plasticity is effectively ignored during our scaling experiments.This is because the change in distribution of weights will in general affect the distribution of the spiking activity, resulting in inconsistent compute loads and potentially instability in the network.In the instructions provided by the NEST scripts, this is simply accomplished by only simulating the network for a short time (250 ms of simulated time) before the network was expected to become unstable.For STACS, we were able to artificially set the spike frequency regardless of the incoming weight, while still performing the synapse model dynamics updates to simulate the compute workload.Because we found that the NEST model spikes at roughly 8 Hz at a network level, we set the 2000 excitatory neurons per partition in STACS to spike at 5 Hz and the 500 inhibitory neurons per partition to spike at 20 Hz, for an average spike rate of 8 Hz, while also generating spiking activity in line with the reference model in [48].
Regarding their communication workload, both simulators communicate events with respect to a global minimum delay defined by the network model.This allows for the maximum amount of independent computation to occur in parallel, interleaved between communication.Here, we define and use a communication-based timestep, t step , as the simulated interval containing one communication cycle.For STACS, this is every 1 ms of simulated time, and for NEST, this is every 1.5 ms.This may be different from the simulation resolution within a communication cycle, for example, the NEST network model uses a resolution of 0.1 ms, meaning it exchanges spiking events every 15 computational timesteps.For STACS, the simulation resolution (of the Izhikevich model) is simply 1 ms.With 2500 neurons per partition spiking at 8 Hz on average, this translates to roughly 20 events per partition per t step for STACS, and 30 events per partition per t step for NEST, with the latter performing communication less frequently with respect to simulated time.
The main difference in communication strategies is of course between how STACS uses a multicast pattern alongside spatial partitioning the network model, and NEST uses an all-to-all broadcast exchange alongside a round-robin distribution.Based on this, we should expect that the volume of messages between partitions should ideally grow linearly with network size for STACS, as opposed to quadradically as with NEST, and should lead to better scaling.

Experimental procedure
More comprehensive weak scaling experiments were carried out for both STACS and NEST.For strong scaling on the other hand, we only looked at the lower-density, 100-synapse per neuron network configuration for STACS due to per-node memory limitations.For NEST, we reference prior work which looked at strong scaling results on the JUQUEEN and K supercomputers [52].On both of these HPC systems, the NEST strong scaling was sub-linear as a function of number of MPI processes and threads for a 5 million neuron network.
For the weak scaling experiments, the experimental procedures were relatively straightforward.For STACS, we first generated a high-level description of the network model as described in section 3.3.For example, on 64 nodes of Summit, this included information such as the use of 2560 processes, the instantiation of 5.12 million excitatory and 1.28 million inhibitory neurons (6.4 million neurons total), connectivity parameters for 1000-synapses per neuron, and the size of the two-dimensional region to maintain spatial density of one neuron per unit area (e.g.2000 × 3200).We then procedurally built the network model from this high-level description, generating an initial snapshot of the network state according to the default population-based partitioning.
From this initial snapshot, we ingested the neuron/vertex coordinate information into a simple spatial partitioning script which split the network into non-overlapping sub-regions (50 × 50 each), and generated partition files for the repartitioning.We then performed the parallel graph (re)distribution based on these partition files by migrating vertices from their original partition to their new partition, and reordering their adjacency lists accordingly.The initial network snapshot was overwritten by this spatially repartitioned snapshot.Finally, we simulated the network for multiple repeated runs.For the STACS network, this was for 1 min of simulated time (60 000 timesteps), and we collected the elapsed wallclock times to compute a mean and standard deviation across the runs.
For the weak scaling experiments for NEST, we parameterized the HPC benchmark script according to the scaling configuration through its command line arguments.This allowed us to select both a scale (with respect to the base case of 11 250 neurons) and number of virtual processes (with respect to the available compute resources).For example, on 64 nodes of Summit, this included the use of 2304 processes, with a scale of 512, for a total of 5.76 million neurons (and resulting in 2500 neurons per process).A single run of the HPC benchmark script consisted of a total of 250 ms of simulated time (2500 timesteps), of which the last 200 ms (2000 timesteps) were used to report the wallclock time.The NEST script was run multiple times for each scaling configuration to compute a mean and standard deviation across the runs.
For strong scaling in STACS, we were able to take the network snapshot that was generated for the largest network configuration at 256 nodes and simply concatenate the partitions to generate a strongly scaled snapshot.While we still had to perform a spatial repartitioning of the network, this allowed us to forego the lengthy network generation step, and is also an illustration of the utility of employing a partition-based data format.The repartitioned network was then simulated as usual for multiple runs.
The parallel efficiency (Eff) was computed as the percentage of the base case (t step on 8 nodes) according to the expected weak and strong scaling laws.For weak scaling this is simply t step for the base case divided by the t step for the non-base case (e.g. on 64 nodes).For strong scaling this is t step for the base case divided the t step for the non-base case and also multiplied by the ratio of the compute resources for the non-base case to the base case (e.g.64 nodes divided by 8 nodes).

Results
The main focus of our scaling experiments was on weak scaling, which is the expected practical approach for scaling up to biological scale network models simply due to the per-node memory constraints.From our experiments, we looked at the network generation, repartitioning, and storage using our extended SNN-dCSR format, where we demonstrate linear scaling with respect to the network size.For simulation time and parallel efficiency, we compared the low and moderate synapse density networks in STACS (see section 4.1) with the scaling behavior of a similar network in NEST (see section 4.2).Here, we demonstrate state-of-the-art parallel efficiency in the scaling behavior of the STACS simulation.We also looked at the strong scaling behavior of STACS for the low synapse density network, and show that there may be an optimal balance point between computation and communication workloads.We first present our collected experimental information in tabular format here for reference, and discuss the different scaling results in the following subsections, where relevant figures are additionally provided.We collected the following information:

Network build
For building a network, unless leveraging sample-based methods for determining connectivity (which STACS has some functionality for as well), there is effectively an N 2 V set of comparisons to determine if a given vertex should connect to another given vertex based on information such as distance or per-population index.If this workload is properly distributed between the parallel partitions, because the ratio between the compute resources to the number of vertices is kept constant for weak scaling, we should ideally expect the network generation to scale linearly with the network size (see section 3.4).Indeed, this is what we see with T build in tables 2 and 3, and plotted in figure 5.
While the quantity T build for network construction is not necessarily related to scaling behavior, it is certainly important for total experiment time, potentially overtaking the simulation time.For the HPC benchmark, NEST had used a random sampling based approach to instantiating the network connectivity, which can be computed independently per neuron.For a more spatially dependent build process of a biological scale network, a potential point of comparison could be made with the scaffold model found in [25].This was a NEST network consisting of roughly 5.3 M neurons and 40B synapses with a complex spatial connectivity pattern, and it was reported to take 240 h (or 10 d) on 20 CPUs (400 total cores) to build.Although we had a relatively simple algorithm to evaluate connectivity (as a combination of max distance  and probability function), at our highest scaling configuration of 25.6 M neurons, we were able to evaluate the space of 655.36 T synapses in a just a few hours, being well able to take advantage of 256 nodes of 10 240 total cores.

Network repartitioning
Given the same weak scaling assumptions, we expect repartitioning the parallel graph distribution to also ideally scale linearly with respect to the network size.We can see this behavior with T part in tables 2 and 3, and plotted in figure 6.This scaling behavior is due to slightly different reasons than for the network build (see section 3.4): instead of partitioning an O(n 2 ) computation, it duplicates an O(n) computation.This can be seen in figure 7 the strong scaling behavior when the network size is held constant, where the repartitioning time is roughly comparable between the different node configurations.
To our knowledge, STACS is the only parallel spiking simulator that possesses a built-in parallel repartitioning functionality.It is facilitated by the partition-based SNN-dCSR data format, of which STACS instantiates runtime data structures that are populated accordingly with the network snapshot.On disk, the snapshot is split across several files, and because the vertex index is implicit by line number, an easy method of joining, splitting, or resizing the partitions is to modify the distribution files with different offsets.Just like the adjacency lists, the other information for the network state is also organized line-by-line per vertex, which means that as long as their ordering is preserved, simple file processing methods may be used to redistribute them for partitioning on different compute resources (e.g.neuromorphic hardware) than how the network was originally generated.

Network storage
Despite using plain-text for portability, the SNN-dCSR data format on disk is still relatively compact due to its adjacency list structure.The majority of the network state is contained in the synaptic state variables, and we found that the storage costs scaled roughly linearly with respect to the number of synapses, as plotted in figure 8.There is little overhead in storage costs with the introduction of few additional metadata files (e.g.partition distribution, model dictionary).In addition to the E − I reservoirs we constructed for our scaling experiments, we also built, serialized, and scaled the cortical microcircuit model from [23].The base model consists of roughly 76 K neurons and 0.3B synapses, resulting in about 12GB on disk (regardless of the number of partitions).For a 2× model in the number of neurons, but with the same connectivity parameters, this resulted in a network of 154 K neurons and 1.2B synapses (4× the base model), and about 49GB on disk.
At the largest scale, we instantiated a model with 25.6 M neurons with 5 mutable state variables each, and 25.6B synapses with 7 mutable state variables each, resulting in a network snapshot of about 2.4TB.To compare it to a rough estimate of how much in-memory requirements are needed during STACS runtime, using a single floating point representation (4 bytes) for each mutable state, we get about 0.7TB in synaptic state alone, which is about a 3× reduction compared to on-disk.Of course, it is possible to compress the on-disk data or use binary formats in places, but the flexibility of space-separated plain-text for a potentially diverse set of arbitrary neuron and synapse models is also quite valuable for ease of interoperability (e.g.converting to/from NetworkX, using custom partitioners).

Weak scaling
We first provide a comparative summary plot of the parallel efficiency we computed under weak scaling from our experiments in figure 9. We note that depending on the computational workload, of either the low or moderate synapse density networks, the parallel efficiency of STACS goes either below or above that of the NEST HPC benchmark, respectively.Here, although the amount of communication of spikes between each partition is actually the same, because the overall workload of the simulation is more dominated by computation in the moderate synapse density case (i.e. more synapses to compute over), we found that the effects from communication are proportionally less during weak scaling.We can see from tables 2 and 3 that the per-communication timestep t step for the higher synapse density network is about 10× greater at the baseline configuration.Incidentally, for the NEST simulated network (table 4), we noticed a much steeper slope despite its larger t step of about 3× greater than the moderate density STACS network at the baseline configuration.
For the NEST experiments, we also noticed a sharp inflection point at the higher end of the scaling experiments where a significant deterioration to parallel efficiency.In contrast, the STACS parallel efficiency   curves are relatively smooth throughout.This is likely indicative of a threshold in computational resources being exceeded (e.g.cached memory).Although the synaptic computations are part of the computational workload, and should be fixed per neuron, it may follow a different scaling at a partition level.Due to its all-to-all exchange pattern, it is possible that the data structures that need to be instantiated per process need to increase in size proportional to the network size.In contrast, the local communication neighborhood in STACS was expected to remain constant under weak scaling.We plot the weak scaling timing results for STACS in figure 10, and for NEST in figure 11.Here, we note that another item of comparison is the variability in simulation times across multiple runs.Here, we note that while the NEST simulations remained relatively consistent, the STACS simulations experienced more variability at the higher end of the scaling experiments.Unlike the relatively structured phases of computation and communication found in NEST, STACS takes advantage of an asynchronous multicast   communication pattern provided through the Charm++ runtime system.Although the goal is to hide interconnect latency by overlapping computation with communication, we speculate that nontrivial interactions between the communication strategy and network contention pushed the different partitions of the STACS simulation out of phase with each other.Despite this, we found that STACS was capable of maintaining competitive levels of parallel efficiency for reasonable computational workloads.

Strong scaling
Due to memory limitations per Summit node, we have results for strong scaling in the lower density, 100 synapse per neuron network.For these experiments, we forego building the network at each scale.Owing to the use of an intermediate serialized representation of the network state on disk, we may simply build the 256 node case and concatenate files and repartition the network to get networks for the lower nodes.For this reason, we do not report network build times in table 5. We plot the strong scaling timing information in figure 12.
From these results, we see a fairly reasonable linear reduction of simulation times with increased processing at the lower end, where the simulation is dominated by compute workload.In fact, we are actually able to achieve slightly higher than 100 percent parallel efficiency in the first few increases in scale, potentially supported by the overlap in computation and communication from Charm++.At this end, because the partitions also cover a larger simulated area, the number of adjacent partitions is also smaller.We also note an inflection point where the simulation moves from being compute dominated to communication dominated (from parallel overhead), as we have seen previously in the weak-scaling experiments for the low synapse density network.We expect this inflection point to occur later for models with higher compute workloads.
Although we did not perform strong scaling experiments with NEST, we point to some prior work in [54] and [52] that used the same NEST HPC Benchmark networks.On the JUGENE and K supercomputers, the authors in [54] looked at networks of about 1 M and 3.5 M neurons, respectively, and estimated parallel efficiency of around 0.9.On the JUQUEEN supercomputer, the authors in [52] looked at a network of about 1 M neurons and noted sub-linear scaling up to around 2 K MPI processes and a similar inflection point to the STACS strong scaling behavior afterwards.

Discussion
The brain's immense scales and complexity should naturally position the neuroscience field to take advantage of today's advanced HPC capabilities, however the challenges associated with collecting neurobiological data, the heterogeneity of the brain, and the limitations in neural theoretical frameworks have held back the usefulness of large-scale simulations [55].As a result, many computational studies in neuroscience have historically focused on either data analysis or detailed biophysical simulations of small populations of neurons.Only in recent years has data been collected at a scale that justifies more detailed simulations of large populations of neurons.While the historic use of computation as a tool for data analytics has motivated an increased level of attention in large scale AI methods as a model of neural activity (artificial neural networks to model biological neural networks [56]), it is also reigniting a need to perform bottom-up simulations of neural circuits [22,57].
The results in this paper demonstrate that more favorable scaling can be achieved in parallelizing neural simulations if the challenges of neural communication are suitably accounted for.Overall, we see this work as a key step in addressing the communication workloads for large-scale SNN simulations.Through leveraging a graph-aware partitioning approach and the asynchronous multicast communication of Charm++, the STACS SNN simulator shows competitive weak and strong scaling as networks reach into the millions of neurons and billions of synapses.Our results suggest that the density and distribution of synapses, which determines the per-node communication workload, becomes the dominant cost for larger simulation sizes.In addition to simulation runtime, we also contribute a graph-based data format, SNN-dCSR, for network serialization and interoperability within a growing neuromorphic computing ecosystem.

Scaling further
While the results here are promising, the simulations shown are still several orders-of-magnitude short of human scale, and the complexity of neurons, synapses, and inputs are considerably lower than known biophysics.To move to human scale and complexity, it likely will be important to move past existing tools for scaling numerical simulations that have at their heart some fundamental assumptions about data locality, time discretization, and communication patterns.
Arguably, up to now the community has focused on using conventional HPC tools to get as far as possible can on large scale neural simulations.Many of the methods and design paradigms for HPC applications were developed with physics simulations in mind.At first glance, neural simulations are similar in many ways: they involve elements with a set of variables related by differential equations, and often neural models can be abstracted to be even simpler than most physics models.However, unlike finite element models of physical systems, neuron 'elements' are coupled with a large number of neighbors (on the order of 10 000), and some of these connections can be long range, extending over more than half of the simulated volume.This makes partitioning and communication more difficult, introducing unique challenges for scaling up neural workloads.This presents opportunities for developing tools that are more specialized for neural simulation, and furthermore these can be used for other irregular applications, such as graph analytics, that are not characterized by predictable memory and network access patterns.
While this work addresses some of these challenges, there is also plenty of room for improvement.We expect an important area of development will be heterogeneous communication approaches to accommodate the different connectivity patterns for short-and long-range synapses.In general, we would expect that longer range connections have longer transmission delays and would also allow for increased message latency.Instead of requiring spiking events to be exchanged between distantly connected partitions every timestep, the communication interval could be increased or the partitions could also have intermediate routing.In either case, we predict that a spatially determined partitioning of the network model such as provided through STACS could be used to support more complex source-target communication strategies.Communication primitives such as user-defined multicast groups such as provided through Charm++ would also be useful.
In addition to the execution of model dynamics, the increase in complexity of neural models will likely begin to stress other aspects of simulations as well.For example, a recent full-scale model of one subregion of the human hippocampus [25] highlights that the initialization of the model can present as much of a computational challenge as the simulation itself.For this reason, the characterization of the network generation and partitioning (see figures 5 and 6) may likely prove to be as important as simulation costs.Furthermore, as in other fields, large-scale simulations present challenges associated with data analytics of simulated neuron outputs.Although a tool like STACS is able to provide network snapshots, recorded states, and spiking activity for offline analysis, the volume of data generated this way can be overwhelming.Fortunately, data analysis has been an area of considerable progress in neuroscience, and there are also opportunities to develop in situ analysis methods of the data generated during the simulation.

Accelerators
Depending on the composition of the workload, offloading computation to accelerators such as GPUs could also be a valuable option.Indeed, a major part of the strategy in scaling up to exascale HPC systems has been the use of GPUs to handle numerically intense calculations on the compute nodes [58].Frameworks such as PyTorch supporting GPU acceleration have also only grown with the advancement of deep learning [59].In the space of SNN simulators, there are a number of options leveraging GPUs available as well, such as GeNN or BindsNET [28,60].For these GPU-accelerated simulators, however, the computation is mainly limited to a single node, as the move to GPU brings additional challenges for communication in a multi-node setting.
Because large-scale network models will necessarily be distributed across nodes due to memory requirements, this introduces two layers of communication overhead: between nodes, and between the CPU and GPU.Due to the added latency in moving data between GPU and CPU, maintaining data within GPU memory as much as possible yields the greatest speedup benefits.The irregularity of which synapses need updates and which neurons spike on a timestep-to-timestep basis also poses challenges for efficient data movement.Here, more architecture-aware graph embedding of an SNN onto the compute platform may become important [61].As these challenges are addressed, we expect that one of the significant benefits of GPUs will be in allowing more biophysically realistic neuron and synapse models to be used within the time-windows defined by the communication fabric.Of course, because there are compute limits on GPU, methods for addressing communication between nodes, such as explored in this paper, will still remain relevant as models reach toward human-scale.Just like the results found from strong scaling (see figure 12), we expect there will be inflection points balancing between the computation and communication demands per node depending on the network model.
These scaling challenges also clearly motivate the use of neuromorphic computing hardware for acceleration at scale.These platforms offer fine-grained parallelism and emphasize efficient event-based communication, not only for the neuroscience-based simulations explored here, but also for other graph-based computations of complex systems, network analyses, and future AI systems that mix broad communication with local processing.While neuromorphic computing originally emerged from the analog computing community, the large-scale neuromorphic platforms today are primarily digital CMOS systems that are specialized for the neural computations explored here.Either through use of ARM (SpiNNaker) [62] or specialized processors (Loihi) [63], these platforms leverage simple dynamics in each computing element and communication of very small packets.The trend with the newer generations of these systems is toward more general computing of the elements, which allow for more sophisticated neuron models.This is comparable to the move into general purpose GPU computing.Just like in the multi-node case for GPUs, computation on neuromorphic platforms can be spread across multiple chips or boards, where additional communication overhead may be incurred when moving between them, and there remains challenges on how to best utilize resources to achieve efficient scaling [64][65][66].
As suggested by our results, we expect that a significant challenge in fully leveraging both GPUs and neuromorphic platforms for large-scale neural simulations will be the development of partitioners that are better able to take into account the communication costs between compute nodes or chips.Ideally, these would be able to find a workload to resource mapping that balances between the different requirements to optimize compute throughput.In the neuromorphic architecture design space, emerging tools like SanaFe which provide cost metrics for latency and energy between different hardware configurations may be helpful [67].Of course, we believe that a partition-based data format such as SNN-dCSR will also be valuable in coordinating for compute parallelism and managing the hardware implications of a software model.

Figure 1 .
Figure1.Toy network illustrating spatial partitioning and round-robin partitioning.For a total of 9 partitions, spatial partitioning follows the inner-grid lines.For round-robin partitioning, neurons are assigned to same color partitions (which repeat in a round-robin pattern).As can be seen, the round-robin partitioning can lead to considerably increased communication costs as networks becomes larger.

Figure 2 .
Figure 2. Example sparse matrix represented in the compressed sparse row (CSR) and distributed CSR (dCSR) formats (for a 3-way partitioning), respectively.Numbered values correspond to row and column coordinates, lettered values correspond to non-zero entries, and Roman numerals correspond to partitions.Entries within a row are uniquely colored for ease of correlation within their corresponding arrays.
which correspondingly distributes the value and column arrays into sizes m 1 + m 2 + • • • + m k = m.The partitioned row arrays then take on sizes n p + 1 (where p ∈ [1.

Figure 3 .
Figure 3. Example 3-way partitioning (colored) of a simple network with 12 vertices (numbered) and 21 directed edges (lettered) using the SNN-dCSR format.Directed edges with associated state are bolded.Additional description in text.

Figure 4 .
Figure 4. Spatial connection probability as a function of the distance between cells.Inhibitory cells were specified to have close-range postsynaptic connections, while excitatory cells were connected to a broader spatial distribution of postsynaptic partners.Parameters of each curve were chosen such that the expected number of synapses per neuron was either 100 (left) or 1000 (right).

•
Node-Number of Summit compute nodes used • P-Number of parallel partitions or processes • N V -Number of vertices (neurons) of the network model • N E -Number of edges (synapses) of the network model • T build -Time elapsed (wallclock) to build a network model • T part -Time elapsed (wallclock) to repartition a network model (e.g.migrating vertices w.r.t. a parallel graph distribution) • T sim -Time elapsed (wallclock) to simulate a network model • t step -Time per communication-based timestep (this is effectively T sim divided by the number of event message exchanges) • Eff-Parallel efficiency with respect to ideal weak/strong scaling laws

Figure 5 .
Figure 5. Weak scaling in STACS, showing T build network build times following a linear scaling pattern in the size of the network.

Figure 6 .
Figure 6.Weak scaling in STACS, showing Tpart network repartitioning times following a linear scaling pattern in the size of the network.

Figure 7 .
Figure 7. Strong scaling in STACS, showing Tpart network repartitioning times following a nearly constant pattern alongside a constant network size.

Figure 8 .
Figure 8. On-disk storage costs show linear scaling for extended SNN-dCSR format for common network models.

Figure 9 .
Figure 9. Parallel efficiency for different network models and simulators under weak scaling.We show that STACS demonstrates improved parallel efficiency at larger scales and moderate compute workloads.Error bars correspond to one standard deviation.

Figure 10 .
Figure 10.Weak scaling in STACS, showing T sim simulation time for both the low and moderate synapse density networks.The communication dominant workload of the 100-syn/neuron network scales roughly as the square-root of the size of the network, whereas the computation workload of the 1 K-syn/neuron benefits more from parallelization.

Figure 11 .Table 5 .
Figure 11.Weak scaling in NEST, showing T sim simulation time for the reference network in[52].We note a sharp slowdown in simulated times at 256 nodes.

Figure 12 .
Figure 12.Strong scaling of the 100-syn network in STACS.We may identify an inflection point where the inter-node communication workload overtakes the per-node computation workload.

Table 1 .
Spatial connectivity parameters for determining average synaptic connections per neuron, for excitatory (E) and inhibitory (I) neuron types.Distance is given in units µm.

Table 2 .
Weak scaling results for STACS on a low synapse density (100 synapse per neuron) network, simulated for 1 min of biological time.Times for build, repartitioning, and simulation are in seconds, time for each communication-based timestep is in milliseconds.Parallel efficiency is in percentage (of the base case on 8 nodes).STACS Weak Scaling -100 N E /N V -2.5 K N V /P

Table 3 .
Weak scaling results for STACS on a moderate synapse density (1 K synapse per neuron) network, simulated for 1 min of biological time.Times for build, repartitioning, and simulation are in seconds, time for each communication-based timestep is in milliseconds.Parallel efficiency is in percentage (of the base case on 8 nodes).

Table 4 .
Weak scaling results for NEST HPC benchmark with a high synapse density (11.25 K synapse per neuron) network, simulated for 250 ms of biological time.Times for simulation are in seconds, time for each communication-based timestep is in milliseconds.Parallel efficiency is in percentage (of the base case on 8 nodes).NEST Weak Scaling -11.25 K N E /N V -2.5 K N V /P STACS Strong Scaling -100N E /N V -25.6 M N V total