Exploiting deep learning accelerators for neuromorphic workloads

Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency when performing inference with deep learning workloads. Error backpropagation is presently regarded as the most effective method for training SNNs, but in a twist of irony, when training on modern graphics processing units this becomes more expensive than non-spiking networks. The emergence of Graphcore’s intelligence processing units (IPUs) balances the parallelized nature of deep learning workloads with the sequential, reusable, and sparsified nature of operations prevalent when training SNNs. IPUs adopt multi-instruction multi-data parallelism by running individual processing threads on smaller data blocks, which is a natural fit for the sequential, non-vectorized steps required to solve spiking neuron dynamical state equations. We present an IPU-optimized release of our custom SNN Python package, snnTorch, which exploits fine-grained parallelism by utilizing low-level, pre-compiled custom operations to accelerate irregular and sparse data access patterns that are characteristic of training SNN workloads. We provide a rigorous performance assessment across a suite of commonly used spiking neuron models, and propose methods to further reduce training run-time via half-precision training. By amortizing the cost of sequential processing into vectorizable population codes, we ultimately demonstrate the potential for integrating domain-specific accelerators with the next generation of neural networks.


Introduction
Repurposing graphics processing units (GPUs) from graphics rendering to training deep neural networks has effectively shaped an entire decade of advances in artificial intelligence (AI) [1][2][3][4][5].This can be attributed to the numerous processor cores in GPUs that enable high parallelization of easily decomposable instructions, which are essential for the large number of matrix operations that take place in neural networks.
But a significant discrepancy arises: the cost of training deep learning algorithms in data centers sits between 100 s and 100 000 s of watts, whereas brain-driven cognition is bounded to approximately 10-20 W. This gap in performance has driven the neuromorphic engineering community to explore new algorithms, architectures, circuits, and devices that apply principles of neural processing to modern neural networks [6][7][8][9][10][11]. Spiking neurons transmit information in voltage bursts known as 'action potentials' , which are characterized as discrete events in many neural coding studies.As such, spiking neural networks (SNNs) distribute information over time, where most neurons are dormant at any instantaneous moment in time.This reduces memory access frequency, which is one of the dominant costs in deep learning workloads [12][13][14][15][16].
When it comes to training via gradient descent, there are next to no accelerators optimized for SNN workloads.The most common uses for neuromorphic hardware are: (1) inference using fixed weights where training takes place 'offline' , or (2) online learning using simple plasticity rules, such as spike time-dependent plasticity (STDP).If SNNs are so efficient, why are there no accelerators that can perform backpropagation on SNN models?While feedforward computation is cheap, in a twist of irony, gradient-based optimization of SNNs is less efficient than its non-spiking counterpart.There are several reasons for this drop in efficiency: (1) the time complexity of backpropagation through time (BPTT) means each time step instantiates an additional neural network.Memory usage scales linearly with time; (2) biological neurons are more complex than artificial neurons, and (3) the non-differentiability of spikes means that a direct application of automatic differentiation is incompatible with SNNs.In effect, GPUs and many accelerators have not been optimized for sequential instruction sets that are required by spiking neurons: multiply-accumulate → state update → thresholding → surrogate gradient calculations.
While the current market of accelerators are tailored to conventional DL workloads, this paper seeks to explore the use of accelerators that are better tailored for the types of operations that are characteristic of SNNs [17][18][19].In particular, intelligence processing units (IPU, Graphcore) include a feature set that is a natural fit for training SNNs via error backpropagation.By coupling highly-parallel multi-instruction multi-data (MIMD) processing to sparse, spike-based tensors, we take a stride towards extracting the benefits from DL accelerators and porting them to neuromorphic algorithms.
The contributions of this paper are as follows: • our SNN Python framework, snnTorch, is released for IPU compatibility using low-level, pre-compiled custom operations; • a variety of benchmarks are assessed to demonstrate up to 21.3× peak improvement in throughput over NVIDIA A100 GPUs when training SNNs; • a series of corner cases are identified where GPUs converge to accelerator performance in recurrent SNNs; • in much the way that brains distribute firing rates across pools of neurons, we demonstrate how the use of population codes can significantly accelerate the training process.
This paper presents the first analysis of the suitability and performance of IPUs in handling neuromorphic workloads when trained using approaches prevalent in deep learning.

SNN
The adoption of deep learning-based techniques to training SNNs can be dated back to 2002, when Bohte et al treated the firing time of a spiking neuron as a trainable, regression problem [20].Since the advent of CUDA-accelerated Python packages with built-in automatic differentiation (autodifferentiation) engines (e.g.PyTorch [21], Tensorflow [22], JAX [23]), the broader approach in recent years has been to apply a generalized backpropagation algorithm to an unrolled computational graph of spiking neurons (figures 1(a) and (b)) [24][25][26][27][28][29].BPTT adopts techniques used to train recurrent neural networks, where sequences are instead interpreted as discrete time-steps of finite duration [30,31].While a variety of detailed models are used to accurately emulate biological neurons, the simplest models are more commonly used in large-scale simulations.This can be attributed to several reasons: (i) calculating the solution is computationally cheap, (ii) simplifying an action potential to a single-bit event promotes sparse computations, and (iii) applying gradient descent to stiff equations (e.g. with sharp bifurcations) can lead to instability when training a network.
SNNs adopt the same topological structure as non-spiking networks.The main difference is that artificial neuron models are swapped out for time-varying spiking neurons.Time-evolution is modeled in a sequential structure.Specific details regarding the types of neuron models used are provided in the experimental results (section 4).

Neuromorphic processors
The neuromodulatory processes in the brain that leverage spikes to promote learning remain somewhat shrouded in mystery, which has inspired the development of several research-based neuromorphic processors.Several examples include Loihi developed by Intel Labs [32,33], IBM's TrueNorth [34,35], Neurogrid from Stanford University [36], SpiNNaker initiated at the University of Manchester [37,38], National University of Singapore's Shenjing [39], and memristor based accelerators like RENO [40], Harmonica [41], MNSIM [42], some of which have roused neuromorphic research ecosystems where hardware access is offered both remotely and physically to the broader research community.While such neuromorphic processors remain to be optimized for gradient-based learning, they have incited much interest in how neurobiological processes can be modelled in-silico.These processors allow users to explore how programmable learning rules can modulate plastic synapses.The push towards data-driven benchmarks from deep learning has led to the adoption of gradient-based learning rules to be used with SNNs, which is well-suited for non-convex optimization when combined with stochastic gradient descent, but demand far more computational resources when compared to biophysically motivated learning rules.Training SNNs via gradient descent compounds upon several challenges: • Temporal credit assignment: the BPTT learning rule requires storage of all gradients over time, where memory complexity scales with O(nT) where n is the number of neurons and T is the duration of time.• Weight credit assignment: routing gradients from the network's output back to plastic synapses requires the data path of the forward operation to be stored.The gradient of every synapse has an independent pathway, which scales the cost of communicating gradients to apply weight updates.• Non-differentiable operations: in leaky integrate-and-fire neuron models, a hard threshold is often applied to the membrane potential to elicit a voltage spike at the axon.This is a non-differentiable operation, and thus, incompatible with gradient descent.

Temporal credit assignment
The temporal credit assignment problem can be addressed by adopting real-time recurrent learning (RTRL) techniques, to avoid having to store gradients in time [43].The cost of doing so is that memory complexity now scales with O(n 3 ), where the cubic term discourages broad adoption in large-scale networks.Approximations of RTRL recently inspired the development of a lightweight SNN training accelerator for fixed, dense architectures [44,45].

Non-differentiable operations
Surrogate gradient descent is used to bypass non-differentiable operators, where the final calculated gradients are a sufficient approximation [6,46].This adds to computational cost, as analytical methods to computing derivatives (e.g.dual numbers [47]) must be supplemented with manually-determined heuristics (surrogate gradients); i.e. training SNNs via surrogate gradients is not as modular as non-spiking networks.

Low-cost inference
The high cost of training SNNs using non-local learning algorithms can be partially offset by the incredibly cheap cost of using SNNs in solely feedforward operations.It has been shown that SNNs can offer 2-3× orders of magnitude improvement over non-spiking alternatives [13].In general, this motivates offline training of SNNs typically using GPUs, where deployment can take place on low-power SNN accelerators.Several recent studies have leveraged programmable microcode of neuromorphic research processors to adopt BPTT variants on a single chip [48,49] Training SNNs has traditionally been slow compared to ANNs.This is due to the additional sequential operations needed in recurrent networks, along with stateful computations implemented at the neuron node.Accelerating offline training of SNNs can: (1) reduce enable more rapid, iterative testing and deployment of low-power models, and (2) guide hardware manufacturers towards alternative computational substrates that are valuable beyond GPUs.It is thought that deep learning proliferated due to the 'Hardware Lottery'-they were the right model for the right hardware at the time.Exploration of alternative architectures can bring alternative models to front stage of machine learning.

IPUs
IPUs are designed to facilitate deep learning workloads by processing fine-grained operations across a large number of parallel threads.The ability to process individual threads on sub-blocks offers a two-fold benefit on SNN workloads over single-instruction-multiple-data/thread (SIMD/SIMT) GPUs: (i) instructions from different network layers can be concurrently processed, where the constraints of contiguous vectorized data is no longer a performance bottleneck, and (ii) MIMD processing can accelerate applications with irregular and sparse data access without incurring performance degradation.This is optimal for spike-based workloads which include additional processing overhead in computing the state-driven dynamics of spiking neuron models (figures 1(c) and (d)).
Each IPU Mk2 core consists of 1472 high performance processor cores, where each processor core and a locally accessible in-processor memory unit form a tile.The IPU tile consists of one computing core and 624 KB of local memory.Each core contains six processor threads, totaling 8832 processor threads when operating in parallel.This amounts to a total of roughly 900 MB of memory and 250 TeraFLOPS of compute for the Mk2 GC200 IPU hardware which ran the experiments on this paper.Each core is connected directly to the IPU-Exchange, which is capable of transferring 8 TBps of data between IPU tiles.There is no global memory, and specialized hardware is incorporated for common neural network operations, such as convolutions and matrix multiplications.
IPUs follow a graph processing pipeline where programs are compiled into a logical execution graph.This graph is composed of alternating state and computation vertices.Each vertex consists of machine instructions that can execute in parallel, provided they write to independent parts of a tensor.Upon completion of a compute step, data is exchanged between tiles as part of the exchange phase of the bulk synchronous parallel (BSP) execution model.
Adopting this BSP execution model benefits bandwidth-limited neural network, as overlapping memory-bound computation and communication can lead to bandwidth contention and data collision [50,51].BSP eliminates the need for message buffers and global memory, though as a result, all inter-core communication must be planned during model compilation [52].In practice, once the model has been compiled once, it can be cached and subsequently reused.

snnTorch
A variety of gradient-based SNN libraries have been open-sourced, most of which are written in Python for syntactical ease, and several of which are built on top of commonplace deep learning packages [25,[53][54][55][56][57].Most approaches compose primitive functions together wrapped as a spiking neuron node, where gradients are analytically calculated using reverse autodifferentiation in the backend.As spikes are represented as discontinuous voltage bursts, they are non-differentiable.PyTorch allows users to override gradients with custom functions, and so has become a common backend for the implementation of surrogate gradient descent in SNNs [6,46].
snnTorch is adopted as the toolbox because it is: (i) designed with PyTorch as its backbone, so pre-existing interfaces can be used to lower composable PyTorch functions into IPUs, (ii) several features are unique to snnTorch in the context of gradient-based learning, such as using population-based embeddings to accelerate the training process, and (iii) quantization-aware training has been integrated into the state-space of spiking neuron models, which can be used in mixed-and low-precision accelerators.
Several alternative options are available for accelerating SNNs using CUDA-based libraries.SpikingJelly provides a CuPy backend [55], GeNN uses CUDA-generated code to implement an approximate form of BPTT [45,58], and lava-dl incorporates the most commonly used functions/neurons as optimized CUDA code, while other libraries mostly depend on the deep learning package's CUDA acceleration.
To summarize, prior approaches for faster gradient-based training of SNNs include: • utilizing microcode to enable neuromorphic processors to track gradients, • using custom CUDA backends to accelerate SNNs on GPUs, and • using pre-existing interfaces to CUDA via pre-existing deep learning libraries (e.g.PyTorch).
The first option is burdened with instruction set-level definitions that must be tailored for a given network architecture, and the latter two are limited by SIMD/SIMT processing.We take a wholly different approach by adapting Python-level SNN descriptions that leverage low-level, pre-compiled operations customized to an IPU accelerator harnessing MIMD architectures.This approach to distributed memory amongst IPU cores can be used to reduce data movement, thus amortizing the costs of weight and temporal credit assignment.

Neuron models 3.1.1. Leaky integrate-and-fire neuron
The dynamics of a leaky integrator neuron are as follows [59,60]: where u is the membrane potential of the neuron, i is the current injection to the neuron, r is the equivalent resistance of the ion channels of the neuron, τ = rc is the time constant of the neuron, where c is the capacitance of the passive membrane.Equation ( 1) can be solved using the forward Euler method: where β = e −1/τ is the inverse time constant of the neuron membrane potential, and the subscript t refers to time.When the membrane potential exceeds the threshold u thr , an output spike is generated: To introduce learnable parameters, the current injection term is replaced with a weighted input (1 − β)i ← wx.For notational brevity, the contribution of a single weighted input is used: The final term introduces a reset mechanism to the neuron.The unrolled computational graph depicting the operation of the neuron is shown in figure 2(a).

Current-based leaky integrate-and-fire neuron
If the leaky integrate and fire can be thought of as a low-pass filter, the current-based method can be thought of as a pair of low-pass filters.The input synaptic current is modeled as an AMPA-receptor with a rapid rise time and gradual decay, which then modulates the membrane potential of the neuron: where α = e −1/τsyn is the inverse time constant of the synaptic current, and τ syn is the equivalent time constant of the synaptic current in an analogous way to τ , with the computational graph illustrated in figure 2(b).

Recurrent spiking neurons
Both of the above neuron types can be adapted to include explicit recurrent connections.The output spikes are weighted and appended to the input.Formally, a recurrent leaky integrate-and-fire neuron is represented by: where v is the recurrent weight.

Custom operations on IPUs
The 'Poplar SDK' interfaces popular deep learning frameworks directly into IPU programming.The IPU uses an autodifferentiation engine independently of PyTorch's backend, and as such, spiking neuron models that depend on surrogate gradient descent are not compliable by default.Custom operations must be written in C++ and pre-compiled into machine-level codelets that are accessible to users via Python.
The approach here pre-compiles the surrogate gradient operator at the time snnTorch is imported.A custom operation is defined for the threshold-shifted Heaviside function (see (3)) implemented in C++, which is compiled thus generating a shared library object that can be dynamically linked in Python at runtime.This allows for the IPU-build of snnTorch to be syntactically near identical to CPU/CUDA-based usage, abstracting away machine-level complexities from the user.The surrogate gradient operator is co-located in the same IPU core which reduces the impact of non-modular function calls that are needed when overriding the autograd module in PyTorch.This is sequenced via pseudo-code in algorithm 1 and illustrated in figure 3. Specifically, ( 3) is a non-differentiable function.This function is replaced in the backward pass with the user's choice of approximation.For example, a straight-through-estimator simply bypasses the non-differentiable operator [61].Alternative approaches use functional approximations of the Heaviside operator by smoothing out the discontinuous step, e.g. the fast-sigmoid function: where the left-arrow denotes substitution, and the tilde in z represents an approximation.

Network architecture
For this paper, two network types were tested on four different types of hardware.The hardware tested include: the NVIDIA A100, NVIDIA V100, NVIDIA GTX 1080, and the Graphcore IPU Mk2.The networks tested are designed to fit a single processor to avoid comparisons that are I/O-limited, as large networks (and datasets) need to be moved in and out of memory, where the throughput will be limited by the memory controller.The architectures include: a 3-layer dense SNN (DSNN) and a three-layer convolutional SNN (CSNN).Despite the small size of the networks, these were trained over multiple time steps which led to near-full memory utilization.Leaky integrate-and-fire neurons are used for all experiments unless otherwise specified, and most spiking simulations are performed across 25 time steps.For experiments measuring throughput, the MNIST dataset is used in the interest of speed [62].For experiments that account for loss-based metrics (e.g.accuracy), CIFAR-10 is used [63].The various architectures used are specified in table 1. 5C12 refers to a 5 × 5 convolutional kernel with 12 channels.MP2 refers to a 2 × 2 max-pooling operator.Unless otherwise specified (e.g. in experiments that sweep across different architecture parameters), these networks are used for the experiments that follow with the AdamW optimizer used in all cases [64].The size of these models is limited by the memory available on a single IPU processor.In general, many SNNs are developed with edge-based architectures in mind, and are representative of lightweight models that are often reported in the neuromorphic literature [44,45,65,66].As models scale up in size, the performance gap between IPUs and GPUs will decrease until the GPU VRAM capacity is reached.Further scaling models beyond that point will see performance that is heavily dependent on how processor cores are interconnected.Where relevant, experiments were repeated five times to generate error bars.

Experimental results
The following experiments have been conducted to benchmark IPU performance: • baseline FLOPS (floating point operations per second) All experiments that follow account for the entire training process using BPTT, including the forward-pass, gradient calculation, and weight update.

Baseline FLOPS
Before performing IPU vs. GPU performance comparisons, we first assess the performance of a spiking network against equivalent, non-spiking artificial neural networks (ANNs) on the IPU.One FLOP is defined as one fused multiply-add floating point operation, calculated using the fvcore Python Library.The FLOPS comparison can be seen in table 2. On average, the IPU improves FLOPS by 4.6 × when compared to the A100, 6.4 × over the V100, and 10 × over the GTX1080.Interestingly, the performance of the spiking network is marginally better for the dense case than the non-spiking network.This may be because the IPU has been optimized for handling different types of concurrent operations, where processing neuron state-based computations are relatively simple operators when compared to large-scale matrix-vector multiplication.On the other hand, the TFLOPS when running the convolutional SNN drops by approximately 57% from non-spiking to spiking networks on the IPUs.

Baseline throughput
Throughput is measured in terms of 1000 s of images per second, and accounts for the wallclock time commencing from the forward pass, the backward pass, and concludes once the weight update is completed.A batch size of 128 images is used by default.Each network is trained over 60 epochs.To obtain error bars, this is repeated 20 times on each hardware to obtain a measure of standard deviation.This variance is larger when training on the IPUs for both DSNN and CSNN workloads as compared to all GPUs.This is likely because GPUs are less influenced by dynamical sparsity as IPUs are.Dynamical sparsity, along with neuronal firing rates, are a function of software-level random processes, such as random weight initialization and stochastic batching.
The throughput is calculated by: • measuring the wallclock time to process one minibatch, • dividing the batch size by the wallclock time.

DSNN throughput
The results from the DSNN are tabulated in table 3. The IPU can train an average of 46 297 images per second, which is 3.1× higher than the A100, 6.4× higher than the V100, and 9.9× higher than the GTX1080.Error bars across multiple trials are illustrated in figure 4(a).The standard deviation for the IPU is approximately 3623 images.

CSNN throughput
With respect to the CSNN (table 4), there is a much larger number of computations being performed leading to a decrease in throughput for both IPUs and GPUs.The IPU training throughput is 15 566 images per second.This is 2.1× more than the A100, 4× higher than the V100, and 5.9× greater than the GTX108.The standard deviation is 1069 images per second.The performance margin between the DSNN and CSNN indicates that the IPU has been optimized for high memory usage.This is useful for experiments that require traces of membrane potential to be stored as with BPTT, for pre-and post-synaptic current traces as with STDP [67], and also for dynamically varying synapses.

Throughput across batch size
As networks increase in size, memory limits constrain the maximum possible batch size that is permissible.This problem is exacerbated in SNNs which also consume memory for each additional simulated time step.
To measure this effect, the batch size was swept from 8 to 128, with throughput results shown in figure 4(b).On inspection, there is far less variance in performance for IPUs.This is especially important where a large number of time steps must be simulated, and the maximum batch size decreases.Close attention is given to the smallest tested batch size, as real-world batch sizes in continual learning workloads are '1' .For the smallest tested batch size (n = 8), the performance improvement of the IPU over the A100 for both CSNN and DSNN is more than one order of magnitude (14×).

Throughput across architectures
Network architecture is varied for both the DSNN and CSNN and throughput is measured.For the DSNN, the number of neurons in the hidden layer is increased, and for the CSNN, the kernel depths of the first two convolutional filters are increased.

DSNN throughput
GPUs are completely insensitive to increasing the number of neurons, as shown in figure 5(a).This indicates that for a small network, a large number of cores available are underutilized.On the other hand, the margin of improvement with the IPU increases with smaller networks.This is because different operations can be parallelized to improve utilization of the large number of IPU cores available.

CSNN throughput
The throughput of varying CSNN architectures is illustrated in figure 5(b).In contrast to DSNNs, larger convolutional kernels decrease the throughput of GPUs.The larger number of computations involved in convolutions indicates that the GPU cores are now fully utilized.

Alternative neuron models
Several other spiking neuron models are increasing in usage in the context of SNNs.Recurrent spiking neuron models, e.g.(7), have been shown to achieve better performance on datasets with temporal complexity [65].Current-based neuron models, e.g.see ( 5) and ( 6), are better suited for learning precise spike timing, as the membrane potential trace is differentiable with respect to time.Throughput for a  The performance of V100s remains relatively unaffected by more complex neuron models (or more complex computational graphs in the case of RLIFs), which causes the performance gap with IPUs to narrow.It is likely that the GPUs have no significant change in performance with respect to the LIF baseline as they have underutilized memory resources, whereas IPUs aim to fully utilize all cores.This highlights a potential opportunity to improve resource allocation during compilation.There are more sequential steps to process more exotic neuron models and computational graphs, and so more cores are allocated to handling those operations.This comes at the cost of less resources available to process synaptic operations, where computational complexity scales with O(n 2 ).

Spiking vs. static dynamics
To verify the above theory, the ratio of time spent calculating the dynamics of spiking neurons (i.e.solving (3) and ( 4)) is compared against the amount of time spent on matrix-vector multiplication.The results are shown in figure 7(a), demonstrating that IPUs provide better balance between neuronal and synaptic operations.In the CSNN, the amount of compute time allocated to solving state-driven dynamics is exactly equivalent to the duration of time spent on synaptic operations.This is beneficial for simple neuron models, but where more complex neurons are concerned, may require further optimization during compile time.Further improvements could be obtained by exploiting function outlining which merges repeatable code-blocks for execution on identical cores in IPUs.This can reduce the overhead allocated to solving state dynamics, and free up more cores to run synaptic operations.

Mixed precision performance
Mixed precision training reduces the bit-width needed for computation, which comes with an associated wallclock time reduction.This is often with minimal, if any, impact on network performance.The default full precision (32b) mode is compared to half precision (16b) training, with results shown in figure 7(b).The difference becomes non-negligible only for IPUs and A100s on convolutional SNNs, which have a significantly higher operation count than dense architectures.All other training instances show marginal improvement, because gradients continue to be calculated in full precision.

Biological plausibility
At present, the most common approach to determining the predicted class is to select the neuron with the highest firing count.This is equivalent to using a rate coded SNN.In neurophysiology, it is thought that rate codes alone cannot be the dominant encoding mechanism in the primary cortex.One of several reasons is because the background neuronal firing rate is roughly 0.1-1 Hz, which is far slower than the reaction response time of animals and humans [68,69].
But if multiple neurons are grouped with their collective spikes counted cumulatively, then it becomes possible to measure a firing rate for a population of neurons in a very short window of time.Assigning a population of neurons to individual classes is also known as using a 'population code' .Population coding adds credibility to the biological plausibility of rate-encoding mechanisms.

Population codes in unsupervised learning
In the past, it has been common practice to increase the number of neurons at the output layer of a network, and cluster the response of various neurons together.This practice has been limited to networks trained using spike timing-dependent plasticity.Neurons would be assigned classes based on which assignments led to the highest accuracy.As such, using a population code where multiple neurons were assigned per class typically led to a boost in classification accuracy in unsupervised learning tasks.This is because more neurons means more permutations of neuron assignments that can increase accuracy.The shift from unsupervised learning to gradient-based supervised learning has made population codes a diminishing practice when training SNNs, as targets are pre-assigned before training commences.We find that using population codes offers alternative benefits when training SNNs, where others have shared similar findings as well [70].

Population codes in gradient-based learning
These benefits are grounded in the fact that accelerators are optimized for parallel operations rather than sequential operations.Using a population of neurons redistributes the time cost over space instead, i.e. larger dimension matrix-vector multiplications can be used instead of repeatedly applying matrix-vector multiplications with smaller dimensions.We run a series of experiments to show population codes further accelerate throughput with a marginal impact on accuracy.Because accuracy is now of interest, we assess performance on the CIFAR-10 dataset as MNIST is broadly recognized as being too simple.

Experimental setup
Similar network architectures as described in table 1 are used.For the DSNN, the number of input neurons is increased due to the larger dimensionality of the CIFAR-10 dataset (32 × 32 × 3) over that of the MNIST dataset (28 × 28 × 1).The same holds true for the terminal layer of the CSNN.

Training throughput
A comparison of throughput across various output neurons and with different precision (half and full) are shown in figure 8.One single simulation time step is used.Performance follows a very similar trend to that of varying network architectures in figure 5, where GPUs perform identically as the output population increases.At the IPU's best, optimal throughput is approximately 145 000 images per second.This is 37× better than the original baseline performance (despite using larger images with three channels), and approximately twice as fast as the best GPU.At its lowest throughput, performance of the IPU and A100 converges in the DSNN experiment.When population codes are applied to CSNNs, the A100 skyrockets in performance and becomes invariant to architectural changes.The large number of terminal synaptic operations dominates the total cost of the network, completely outweighing state-based neuronal operations.This places the A100 in the lead in population-based CSNN benchmarks.

Accuracy
As a coarse-grain measure of accuracy, the DSNN model was used to provide an idea as to how population codes impact training performance.The DSNN is trained over 5 epochs to determine whether it is possible to train networks in one single time-step, where each neuron is constrained to only firing a maximum of once.Results are illustrated in figure 9, where a baseline accuracy of 52.2% is obtained without using population codes (i.e. 10 output neurons simulated over 25 time steps).This accuracy is almost reached when 500 output neurons are used, assigning 50 output neurons per class.As a matter of interest, indefinitely increasing the output neuron count does not continue to increase performance.Based on prior approaches to constructing models, network depth should be increased commensurately to network width to avoid leaning towards either end of the bias-variance trade-off [71].
We note that the target here is not state-of-the-art accuracy, but rather, to assess whether single time-step learning is possible at all.Our results indicate that, in this specific case, similar performance to multiple time-steps can be met by using population codes, verified on a simple DSNN architecture.This may degrade when scaling up to deeper architectures without additionally extending the width (or number of channels) of hidden layers, though the work in [70] shows how this may scale to larger scale architectures.The work presented here follows a similar approach in using population codes as that by Wu et al in [70], with some minor differences.Rather than randomly assigning labels to the multiple output neurons, our approach indexes evenly.Furthermore, we show there is a limit to how far population coding can go, potentially a result of overfitting.This may be less of an issue in the era of large-scale datasets, where each sample of data is only ever presented to a model once.

Outlook and conclusion
SNNs and conventional neural networks have overlapping features that can be concurrently optimized, and IPUs have demonstrated promising suitability for most operations that are characteristic of training SNN workloads.We flip the conventional approach to ASIC-driven SNN training by tailoring pre-compiled microcode to a domain-specific accelerator, rather than reconfiguring neuromorphic chips to handle backpropagation approximations on fixed network architectures.Our results show promising performance gains (throughput, TOPS/W, accuracy) can be made over GPUs by porting the advances made in deep learning accelerators to SNNs.We also indicate the types of hardware optimizations that benefit gradient-based learning in SNNs, such as MIMD processing, functional outlining, and balanced compilations of neuronal and synaptic operators, and how population encoding can be used to better utilize parallelism across both IPUs and GPUs.These features together enable high performance training and inference speeds with IPUs on SNNs.As model sizes rapidly scale up, we note the small-scale limitation of the models shown here.At present, edge-based hardware is limited to similarly-sized models [72], which matches the upper memory limit for training a model with a single IPU.But it will become increasingly important to demonstrate how long-range data movement will impact such models, as multi-accelerator and multi-GPU systems take the center stage.Large-scale, bandwidth-limited training workloads across different technologies are a subject for future work.
Finally, we note that this work is portable to various other Python libraries that emulate dynamical systems and SNNs.SNN models can be converted to and from snnTorch via the Neuromorphic Intermediate Representation [73]: for example, they can be trained on IPUs, and exported to a format compatible with Intel's Loihi neuromorphic processor [32].
All code used to generate these results is made openly accessible to enable the research community to accelerate their own custom SNNs on IPUs, and can be installed via PyPi.Population encoding has been integrated into snnTorch, with a corresponding interactive notebook that enables users to train population encoded SNNs on both IPUs and GPUs alike 5 .

Figure 1 .
Figure 1.Mapping neural networks to hardware.(a) Dynamics of a spiking neuron model.(b) The computational graph of the neuron is unrolled over time to enable compatibility with the BPTT training algorithm.(c) SIMD/SIMT is used to perform parallel computations for one layer at a time in GPUs.(d) MIMD is used to distribute layers and custom activations, such as spiking dynamics, across IPU cores to improve concurrency.Layer-to-memory maps are color-coded.

Figure 3 .
Figure 3. Data path of input tensors on GPU and IPU.(a) GPU: one instruction is applied to all elements of an input tensor while spiking neuron state and surrogate gradient computations are stalled in the instruction pipeline.(b) IPU: spiking neuron state and surrogate gradient computations are pre-compiled into machine-level codelets, and concurrently processed with the neural network (NN) matrix-vector-multiplication step.

Figure 4 .
Figure 4. (a) Baseline throughput for DSNN and CSNN.The error bars shown in red indicate the standard deviation of throughput.(b) Throughput with varying batch size.By varying the batch size, it can be seen that the IPU variance remains consistently high, whereas the conventional GPUs shows a wider fluctuation range.The legends describe the GPU and SNN used.

Figure 5 .
Figure 5. Throughput with varying network architectures.(a) DSNN with increasing network width.(b) CSNN with increasing convolutional kernel depth.For (N1, N2), N1 corresponds to depth of the first layer kernel, and N2 is the depth of the second layer kernel.

Figure 9 .
Figure 9. Accuracy performance with population coding.
. At present, these methods are constrained to fixed neuron models and network architectures, not yet generalized to convolutional networks.Despite these limitations, such methods offer a promising alternative for online deployment of BPTT-like training of SNNs to what we propose here.Rather than taking BPTT to processors optimized for SNNs, we use IPUs to compile and train SNNs using accelerators optimized for backpropagation.