Inferring population dynamics in macaque cortex

Objective. The proliferation of multi-unit cortical recordings over the last two decades, especially in macaques and during motor-control tasks, has generated interest in neural ‘population dynamics’: the time evolution of neural activity across a group of neurons working together. A good model of these dynamics should be able to infer the activity of unobserved neurons within the same population and of the observed neurons at future times. Accordingly, Pandarinath and colleagues have introduced a benchmark to evaluate models on these two (and related) criteria: four data sets, each consisting of firing rates from a population of neurons, recorded from macaque cortex during movement-related tasks. Approach. Since this is a discriminative-learning task, we hypothesize that general-purpose architectures based on recurrent neural networks (RNNs) trained with masking can outperform more ‘bespoke’ models. To capture long-distance dependencies without sacrificing the autoregressive bias of recurrent networks, we also propose a novel, hybrid architecture (‘TERN’) that augments the RNN with self-attention, as in transformer networks. Main results. Our RNNs outperform all published models on all four data sets in the benchmark. The hybrid architecture improves performance further still. Pure transformer models fail to achieve this level of performance, either in our work or that of other groups. Significance. We argue that the autoregressive bias imposed by RNNs is critical for achieving the highest levels of performance, and establish the state of the art on the neural latents benchmark. We conclude, however, by proposing that the benchmark be augmented with an alternative evaluation of latent dynamics that favors generative over discriminative models like the ones we propose in this report.


Introduction
The primary motor cortex of rhesus macaques comprises tens of millions of neurons [12], so inferring the firing rates of any one neuron on the basis of a hundred others' might seem quixotic.However, the state space of the body parts that motor cortex controls is much lower dimensional-on the order of hundreds of variables [28].And indeed, neural activity recorded by the Utah array [23], which spans essentially the entire anteroposterior extent of the precentral gyrus, is correlated-at least during the reaching tasks typically employed in motor-control experiments [7,29].
These observations have motivated the idea of "population dynamics," i.e. a (possibly lossy) description of the dynamics of motor cortex as a whole, or at least those neurons picked up by a typical microelectrode array.From this point of view, identifying population dynamics can be seen as a form of dimensionality reduction [7,9].More generally, the notion of a latent dynamical state maps naturally onto latent-variable generative models, and accordingly many attempts to capture neural dynamics have taken this form [34,1,2,15,22].
Nevertheless, ground truth for evaluating models of population dynamics remains the ability to predict the activity either of the observed neurons, but at future times, or of neurons altogether unobserved.Following [26], we refer to these as "forward prediction" and "co-smoothing," respectively.Still, adjudicating among different models has long been vexed by the use of proprietary data sets and idiosyncratic performance metrics.Recently, the Neural Latents Benchmark (NLB) was introduced to solve these issues [26], proposing a single measure of performance-essentially the cross-entropy of a model that predicts "instantaneous" firing rates of unobserved neurons-and assembling four benchmark neural data sets.The data sets all consist of spiking activity recorded from single units with microelectrode arrays implanted into sensorimotor cortices (broadly construed) of rhesus macaques, who were performing motor or (in one case) cognitive tasks (see Methods).
Here we show that standard recurrent neural networks (RNNs) achieve state-of-the-art performance on the benchmark.Surprisingly, these RNNs outperform "bespoke" models designed with neural data in mind, as well as more recent neural-network architectures like transformers.Furthermore, the RNN is not properly a latent-variable model; it is discriminative, rather than generative, and we train it essentially to map the observed to the held-out spiking activity.
We also show that the use of a transformer-encoder layer (i.e., with self-attention) [32] can in fact improve performance over the "standard" RNN, but only when used after layers of RNN, rather than in the transformer architecture.To our knowledge, this hybrid architecture is novel.Pure transformer-based models, in contrast, whether in our hands or the hands of other groups [33], cannot achieve the level of performance of RNNs.Altogether, these results suggest that, at least on data sets of the size found in the NLB, the autoregressive bias imposed by RNNs is worth the price of a less flexible model; but some of that flexibility can be usefully regained by augmenting an initially recurrent network with a layer of self-attention.

Results
Our goal is to construct models that capture the underlying dynamics of a population of neurons.Following the Neural Latents Benchmark (NLB) [26], neurons are taken to be a part of the same "population" if, roughly, all are located in the same Brodmann area and can be transduced with a single microelectrode array.The neural dynamics have been "captured" if the model can infer the spiking activities of some neurons in the population based on contemporaneous recordings from others ("co-smoothing"); or again can predict the future spiking activity of the entire population of recorded neurons ("forward prediction").In particular, to quantify performance, each model is given a short (∼500-1500-ms) sequence of spike counts across ∼50-150 neurons, and tasked with inferring the contemporaneous sequence of spike counts from a held-out set of neurons (about one third the size of the held-in set) simultaneously recorded by the array.The same model is also tasked with predicting future spike-count sequences, for the unobserved as well as the observed neurons (see Methods).
We trained all models identically-in short, to map the observed spike sequences to the unobserved spike sequences, both contemporaneous and future.There is one wrinkle: about 25% of the input bins were randomly set to zero ("masked") on each full pass through the data, and the model was also required to predict the spike counts in these masked bins.We explore the effect of this masking below.The training procedure is described in detail in the Methods.

Comparing RNN F, Trans F, and TERN
We consider three basic architectures (Fig. 1), the first two (RNN F, Trans F) being essentially standard time-series models in modern machine learning, albeit with a particular masking technique (see Methods).Our hypothesis is that state-of-the-art performance can be achieved with general-purpose architectures, not specifically designed for neural data.However, the RNN F and Trans F make different assumptions about the data; or, more precisely, they impose different biases: RNNs assume that temporal correlations in the data are mostly short-range, whereas the transformer is agnostic about which samples across the entire sequence are correlated with each other.A single layer of an RNN can be thought of as a first-order, nonlinear, IIR filter-although we use gated units and multiple bidirectional layers, which allow for higher-order, noncausal filtering.In contrast, a single layer of a transformer is essentially an arbitrarily powerful, non-causal, FIR filter.(See Methods for complete description of the models.)Nevertheless, in theory, either model could, with sufficient data, learn to model any time series; but we are operating in a "small-data" regime, where there is a considerable trade-off between bias and variance (see also the Discussion at the end of this report).
As we show below, the RNN F consistently outperforms the Trans F, suggesting that most correlations in the data are short-range.Still, we cannot rule out the existence of at least some long-range correlations, and accordingly we also investigate a third, hybrid architecture, consisting of an RNN (imposing a bias toward short-range dependencies), followed by a layer with self-attention (capable of finding long-range dependencies).We call this model TERN (Transformer Encoder over Recurrent Network; see Model Architectures in the Methods for full description of all three architectures).
Qualitative.Fig. 2 shows qualitative results for selected trials.In particular, Fig. 2a-d illustrates decoding from trained instances of all three architectures, on hand-picked neurons and trials from the test partition of the data.The models predict higher firing rates for regions with a high number of spikes and conversely, as expected.More importantly, predictions from the Trans F are noticeably less smooth than the other two models, presumably because it lacks the autoregressive bias of RNNs.A close inspection suggests TERN is subtly smoother and more accurate even than the RNN F, but this needs to be quantified (see next).(We also generate condition-averaged firing-rate predictions, i.e. model "PSTHs"; see Supplementary Figs. 3, 4, and 5.) Fig. 2e-g shows planar hand velocities that have been decoded (see Methods) from the models' firing-rate predictions.(The DMFC RSG data set does not involve arm movements.)More precisely, we fit a ridge regression on the training partition at each time step, mapping inferred firing rates for both held-in and held-out neurons to hand velocities.The velocities and their predictions are plotted here for handpicked trials from the testing partition.[33].(c) TERN (Transformer Encoder over Recurrent Network) is a multi-layer, bidirectional RNN followed by a single layer of a transformer encoder, again terminating in a (static) feedforward layer.Input and output data for all architectures are the same.
Qualitatively, we observed that the firing rates inferred from RNN-based models (RNN F, TERN) produce smoother and more accurate trajectories than the Trans F.
Quantitative: co-bps, fp-bps, and velocity R 2 .To quantify these apparent differences, we consider the 10 best instances (out of 125) of each architecture generated by our hyperparameter-optimization scheme (see Methods).Briefly, the "best" models are those that give the best inferences of unobserved firing rates, in terms of both co-smoothing and forward prediction.These inferences are evaluated on a "validation" partition of each data set, whereas the results (Fig. 3a, Fig. 4 and Table 1) are reported on a disjoint (local) test partition.
We consider first the ability of each model to infer the firing rates of unobserved neurons from the contemporaneous firing rates of observed neurons (co-smoothing [26]).Following the NLB, we ask how many bits per spike each model saves over a baseline model that simply predicts the average firing rates over thde whole trial (see Methods)-the so-called co-bps (Fig. 3a).We observe that the RNN F performs significantly better than Trans F on all four data sets.We also note that training of the Trans F, although faster (as we expect,  hyperparameter-optimization procedure was identical for all architectures.)Fig. 3a also shows the co-smoothing rates for our hybrid RNN/transformer model, TERN.Like the pure RNN, it significantly outperforms the Trans F on all data sets.Furthermore, TERN is generally superior to RNN F; the advantage is statistically significant for the MC Maze, MC RTT, and Area2 Bump datasets.Since TERN is essentially the RNN F plus a layer of a transformer encoder [32] (see Fig. 1), it might seem that its performance advantage consists simply in having more parameters, but this is not the case: The best instances of TERN found by our hyperparameter optimization employ one or two layers of RNN, whereas the best RNN F models use two to four layers of RNN.
Next we turn to the models' forward predictions, i.e. their ability to predict firing rates of observed and unobserved neurons at time points beyond the last observed sample (Fig. 4).Like co-smoothing, we quantify this ability in terms of the number of bits per spike (fp-bps) that each model saves over a baseline model that simply predicts that each neuron fire at its average rate over the entire set of future samples.For two of the data sets (MC Maze, MC RTT), we find similar results: the RNN-based models are evidently superior to the Trans F for most time points, for both observed (top row) and unobserved neurons  (bottom row).For simplicity, we quantify this statistically only for the comparison of TERN to the Trans F (broken red lines; TERN is superior for 100% [MC Maze] and 95% [MC RTT] of time points for observed neurons and 85% and 60% for unobserved neurons).
For the other two data sets (Area2 Bump, DMFC RSG), TERN is superior to the Trans F only about half the time in predicting observed neurons (50.0%[Area2 Bump] and 40.0%[DMFC RSG] of time points), and less still for unobserved neurons (27.5%, 22.5%).We suspect that the Trans F is more competitive on forward prediction than co-smoothing because long-range dependencies are more useful for the former: a transformer can use the observations at the last sample to predict directly firing rates 40 samples away (the end of the forward-prediction task), whereas an RNN must propagate information from these observations through all 40 steps.Nevertheless, this is sufficient to match performance of RNN-based architectures (RNN F and TERN) only on data sets that lack strongly autocorrelated motor-control tasks: Area2 Bump, in which the arm is unpredictably perturbed; and DMFC RSG, a primarily cognitive task.
Next we consider velocity decoding (Table 1).As a baseline, we consider predictions made from the inputs, rather than the outputs, of the models.That is, we ask how well the spike counts of the held-in and held-out neurons can predict velocities through a linear (technically, affine) model.Because the regression is fit on a sample-by-sample basis, predictions from raw spike counts are very poor (Table 1, final column), even though the model has access to ground truth data.Temporally smoothing the spike counts first [26] (penultime column) greatly improves velocity decoding.Nevertheless, the neural networks make much better predictions (second, third, fourth columns), even without access to the held-out neurons, because they incorporate information from multiple samples into a single estimate of the instantaneous firing rate at each time point.Furthermore, and as expected from Fig. 2e-g, TERN is superior to the Trans F on all data sets (p < 0.0005, Wilcoxon rank-sum test, Holm-Bonferroni corrected for two comparisons).TERN is likewise superior to the RNN F on MC Maze and MC RTT (p < 0.005), albeit by a smaller margin.(Its apparent superiority to the RNN F on Area2 Bump is not statistically significant.)The mean coefficient of determination (± standard deviation) for velocity predictions is given for each data set, across the best ten instances of each architecture.Predictions are made with an affine transformation of the firing rates of both held-in and held-out neurons that have been inferred by the models.The final column quantifies predictions made with an affine transformation of the inputs, rather than outputs, of the models; i.e., the spike counts rather than the inferred firing rates.
Finally, we note that during training of our models, the validation loss converged in fewer epochs for the RNN F than for the Trans F and TERN (not shown).Together, all these results strongly suggest that the bias imposed by recurrent neural networks toward short-term correlations is well matched to the population dynamics of neurons in monkey sensorimotor cortex.For the number of data in the Neural Latents Benchmark (see Table 1 in the Supplementary Material), this bias generates superior results to the more flexible transformer.As we have seen, co-smoothing can be improved further still with the addition of a self-attention layer after RNN layers, but not without the latter.

Comparison to other models
In addition to the basic RNN F and Trans F architectures and our TERN, many other models have been proposed for neural population dynamics, and tested on the Neural Latents Benchmark.To compare our models against these, we submit them to the NLB website, where they are tested according to the same metrics as those employed in the previous section, but on a "remote" set of test data inaccessible to users except through this submit/test procedure.(Performance can consequently differ by a small amount from the numbers reported above.)Fig. 3b shows co-smoothing scores for the best instance of each of our three architectures, along with the top published models on the NLB leaderboard.(We have excluded ensemble models, including our own.)To begin with, we note that our best Trans F models performed about the same as or better than the best Neural Data Transformer (NDT) constructed by another group, which suggests that our analyses apply to transformers more generally and not just our implementation.(Where our instantiations are superior, we attribute it to better hyperparameter optimization.) More importantly, our simple (GRU-based) RNN F evidently provides a better model of neuron population dynamics than the complex variants popular in the neuroscience literature, including the NDT, Gaussian-process factor analysis (GPFA) [34] and the latest version of LFADS (latent factor analysis via dynamical systems, a variant on the variational autoencoder) [17].This holds across all data sets except DMFC RSG, the data set with a cognitive task, where performance is approximately matched by AutoLFADS and MINT [27].Adding post-RNN self-attention-i.e., the TERN architecture-improves performance further still.In fine, the TERN architecture establishes state-of-the-art performance on the NLB, with the RNN F not far behind.(Standard errors are available for some of these models from [26]; see Table 2 in the Supplementary Material.)

Data-dependence of results
The point of the Neural Latents Benchmark is to establish a uniform and general evaluation of models of neural dynamics [26].Nevertheless, some facts about the benchmark data sets could be seen as arbitrary.Here we consider how the results presented heretofore are affected by two such facts: the size of the data sets, and the resolution of the spike counts (5-ms bins).
The effect of reducing training data.We have seen that a model with high autoregressive bias, such as an RNN, is well suited for capturing neuron population dynamics, at least on the Neural Latents Benchmark-indeed, better than other architectures including the transformer.But the competitiveness of the less-biased transformer is expected to increase with number of training data.Accordingly, we examine how our results are affected by the number of data available for training.We analyze the MC Maze dataset, since it is the largest (most trials).Keeping the test data the same as that of the original MC Maze dataset, we optimized and trained all three architectures with four different subset sizes of the original dataset (150, 300, 600, and all ∼1250 trials).The results are shown in Fig. 5a-c.As expected, performance generally increases with number of data for all architectures.But the models with high autoregressive bias (RNN F and TERN) are consistently significantly better than the Trans F in terms of co-bps, fp-bps, and velocity prediction, independent of the number of training data.The gap between the more flexible transformer and the RNNs, which are biased toward short-range correlations, is expected to close at some number of data (at the point where reducing bias is more important than reducing variance).And indeed, the gap between the Trans F and the RNN F appears to have reduced by 1250 training trials.But the gap between the Trans F and the hybrid TERN mode does not appear to be narrowing over the range of data in even this, the largest NLB data set.
The effect of bin size.For simplicity, we analyze only MC RTT.In particular, we resample the spike counts at 10, 15, 20, 25, and 30 ms (in addition to the original 5-ms bins), and then re-optimize models of all three architectures from scratch (see Methods).Fig. 5d shows the consequences for co-smoothing (co-bps, top row) and velocity predictions (coefficients of determination in bottom row).In fine, the results are essentially unchanged: The RNN-based models outperform the Trans F at all resolutions, and the hybrid TERN is typically superior to the pure RNN.

The effect of masking
A central contention of this report is that general-purpose RNNs, when properly optimized, perform as well-indeed, typically better-than more highly engineered models.Admittedly, however, our RNN training procedure does contain one "non-standard," albeit increasingly popular, component: dynamic masking of inputs [21].Here we assess the impact of this component by training models with (1) the random point mask used throughout, (2) a mask of random "strips" of three consecutive time steps, and (3) no mask at all.Note that in this last case, the output mispredictions are penalized at all samples, including the N in × M in sub-array of held-in neurons; whereas with masking, the only held-in samples the model is asked to predict are those that have been masked out of the input (see Methods).
Again for simplicity we focus on just MC RTT and the TERN architecture.Evidently (Supplementary Fig. 2), dynamic masking significantly improves co-bps and velocity predictions (up from Trans F-level performance; cf.Table 1).The latter in particular is consistent with the fact that masking requires the model to "fill in" samples based on their neighbors, rather than copying them straight through.Smoothing information across time is critical for velocity predictions since each is based on a single time point.On the hypothesis that improved performance results from forcing the model to interpolate across time, we also tested the "strip" masks lately described, which require longer interpolations; but these resulted in performance very mildly (but significantly) inferior to the point mask (Supplementary Fig. 2, first and second bars).

Discussion
RNNs provide state-of-the-art performance on the NLB.The notion of neural "population dynamics" arose as a consequence of multi-unit recordings: neural activity across scores of nearby (∼mm) neurons was discovered to be highly correlated, at least during typical behavioral (especially motor-control) experiments.One (but not the only) upshot is that it should be possible to "fill in" the contemporaneous activity of other nearby, but unobserved, neurons; and accordingly, the ability to do so was made the centerpiece of the Neural Latents Benchmark for modeling population dynamics.The benchmark additionally includes metrics for predicting future neural activity and kinematic parameters (from the inferred, contemporaneous firing rates).
As stated, these are supervised-learning problems: map observed to unobserved firing rates.Our investigation was premised on the hypothesis that general purpose ANNs, in particular RNNs with gated units, can provide high quality solutions to supervised-learning problems for time-series data, especially those that are dominated by short-range temporal correlations.Indeed, this has already been shown for kinematic predictions [11], albeit on a different data set.In this report, we have verified the hypothesis: our RNNs match or outperform all other state-of-the-art architectures on the NLB (Fig. 3b).
We note that one of the other models compared against in Fig. 3b is likewise based on RNNs.LFADS and its variants (in this case, AutoLFADS) use a pair of GRU-based RNNs to "encode" spike trains into a latent space and then decode them back into firing rates.But LFADS is trained as a generative, not a discriminative, model.In practice, generative models are most useful for discriminative tasks when extra, unlabelled data are available.That is not the case in the NLB, which is based on a straightforward discriminative-learning task (map observed neural activity to unobserved neural activity); and consequently it seems likely that discriminative models, like ours, will yield the best performance.On the other hand, the explicit latent space in generative models maps intuitively onto the notion of "population dynamics"; we return to this point below.
Self-attention helps in small doses.In the last five years, transformers have eclipsed RNNs in popularity in several domains, largely on the strength of their performance in natural-language processing.But the temporal dependency structure of natural language is not typical of time-series data in general: in the former, long-distance dependencies are common and there is no consistent short-term dependency structure.(Natural images also contain important long-distance dependencies, because e.g. of edges.)Transformers are ideal for such data because they make no assumptions about the order of their inputs.
In contrast, physical processes, and the time-series data describing them, typically do have consistent short-term dependencies (because they are governed by smooth differential equations).This class includes the movements of the body but also (to a lesser degree) neural activity.If the data were unlimited, these dependencies could be discovered by a transformer, and indeed the variance in parameter estimates incurred for its lack of bias could be driven down to arbitrarily small levels.But the data are not unlimited, particularly in the NLB data sets.Consequently, our RNNs consistently outperform transformers (from both our own and other groups) on the NLB (Figs. 3 and 4, Table 1).
Nevertheless, self-attention can be helpful when the autoregressive bias is maintained.Our TERN architecture, which augments the RNN with a single layer of a transformer encoder, consistently outperforms the basic RNN (Figs. 3 and 4, Table 1).This model sets the state of the art for the benchmark.To our knowledge it is novel, and we consider it to be one of the important contributions of this paper.Indeed, it may have useful applications in other domains: wherever short-distance dependencies dominate, but longdistance correlations also exist.
Still, we ultimately expect experimental recording times to increase, especially as fully sensorized cages and wireless transmission make continuous, days-or weeks-long recording possible.And for the largest data set (MC Maze), our transformers appear to narrow the performance gap with the RNN F as the number of training data are increased (Fig. ??).On the other hand, the gap with the TERN does not decrease.Furthermore, the number of neurons recorded simultaneously will also increase, so perhaps the autoregressive bias will continue to be important for the foreseeable future.
Evaluating models of population dynamics.Finally, we might ask (in a more philosophical key) what it means to have modeled population dynamics.As we have lately noted, the ability to infer unobserved neural activity is just one, and perhaps not the ideal, measure of the ability to capture latent dynamics, since it can be solved-indeed, at state-of-the-art levels-by discriminative models, as we have shown here.Of course, a model that can predict the activity of held-out neurons on the basis of held-in ones must be extracting useful information from the latter!But there are two possible objections: (1) If a (discriminative) model can predict held-out activity while (say) completely ignoring some of the held-in neurons (or some aspect of their behavior, or etc.), it will.(2) The discriminative model does not provide an unconditional description of the latent variables (p(z; θ)).It is possible in theory to assemble samples from this distribution (more precisely, from the "aggregated posterior," i.e. p(z|x; ϕ) averaged under the data, p(x)); but our hidden layers have on the order of 100 units-too large of a space to be sampled with these small data sets.
Alternatively (or in addition), then, one might ask how well a model can reconstruct the observed sequences of spikes (X), subject to a penalty on the size of the latent state (Z), to prevent the model from merely copying observations straight through to the output.More precisely, one could evaluate the sum of the (average) reconstruction error (in bits or nats) and the (average) "latent-code cost" (also in bits), ⟨− log p(X|Z; θ)⟩ p(x)p(z|x;ϕ) + ⟨− log p(Z; θ)⟩ p(x)p(z|x;ϕ) .
These average costs obviously depend on the distribution of observed spike sequences, p(x).But in generative models (like LFADS or GPFA), the "recognition model" [10] or "encoder" [19] is stochastic, so the costs also depend on the probability, p(z|x; ϕ), with which the model assigns spike sequences to latent states Z.
A probabilistic encoder is a feature, not a bug, of such models-it allows for uncertainty about the latent state-and the model should not be penalized for it.Indeed, if we were using the model for compression, we could in principle get these "bits back" [13] by employing some extra stream of data that we wish to communicate as the "random" seeds for draws from the recognition model.That is, in this thought experiment, the "sender" would select a latent code vector z for some input sequence x by passing uniformly distributed data (not noise) through the inverse cumulative distribution function (cdf) for p(z|x; ϕ).He would pass the latent code along with the reconstruction error on to a receiver, at a cost of − log p(x|z; θ) − log p(z; θ) bits.The receiver in turn could reconstruct x from z and the reconstruction error.But she could also recover the extra stream of data by passing the latent codes back through the cdf of p(z|x; ϕ).(Although this procedure was invented as a thought experiment [13], some groups have recently attempted compression along these lines [31,20].) This "refund" is worth − log p(z|x; ϕ) bits, so on average (across all observations and all assignments of latent codes) the total cost of the communicating information with the model is actually This is precisely the "free energy" (a.k.a. the negative of the variational bound or ELBO), which is an upper bound on the model fit to the data.Indeed, it is also precisely the loss that VAEs like LFADS are trained to minimize.Although during training it is intended to be a surrogate for the marginal cross entropy, H pp [X; θ], that it upper bounds, the free energy itself is a more appropriate measure of the model's ability to capture latent dynamics: The marginal cross entropy measures only the quality of the data generated by the model, taking no account of the inference/recognition/encoding mechanism.The free energy measures both.Of course, when the recognition model (p(z|x; ϕ)) is equal to the posterior under the generative model (p(z|x; θ)), as for generative models that are tractably invertible with Bayes' rule, the free energy collapses into the marginal cross entropy.But the free energy is more general.
It is perhaps surprising that, having designed and implemented models that outperform generative models on the benchmark, we conclude by arguing that the benchmark should be changed to be more favorable to the latter and less favorable to our own.Indeed, even if one were to identify a hidden layer of a discriminative model as the "latent variables," Z, it still would not be possible even to compute a free energy, since the models do not include a prior distribution, p(z; θ).On the other hand, our models may nevertheless have discovered low-entropy (if not low-dimensional) latent codes for their inputs-although exploring this is beyond the scope of the present paper.In any case, we think the free energy would make a useful addition to the NLB.

Data sets
The data consist of four sets of microelectrode array recordings from motor (broadly construed) or somatosensory cortex of rhesus macaques, made by different groups during different experiments, but collected and pre-processed identically for the Neural Latents Benchmark [26] (see Fig. 1 and Table 1 in the Supplementary Material).We briefly describe the NLB pre-processing.Raw voltages were spike sorted and binned at 5-ms resolution.From these continuous recordings, snippets were extracted around each behavioral trial, e.g. by aligning to movement onset, yielding data arrays of size (number of neurons [N ]) × (number of samples [M ])-although note that N and M vary by data set.Finally, approximately 25% of all neurons and the final 40 samples (200 ms) of each trial were designated "unobserved," thus dividing each trial's data array into four sub-arrays: held-in (N in ×M in ), held-out (N out × M in ), held-in-forward (N in × M fw ) and held-out-forward (N out × M fw ), as shown in Fig. 6.

Training
The same training and evaluation pipeline, depicted in Fig. 6, was used for all three architectures explored in this study (see Model Architectures below).Input consists of the array of observed spike counts from the held-in block (N in × M in ).At each epoch, a random mask is drawn (Fig. 6, center) and then used to mask (zero) out ∼27% of the held-in spikes, similar to the Neural Data Transformer (NDT) [33].The input data are then zero-padded out to the end of the entire trial (Fig. 6, right), i.e. to size N in × (M in + M fw ).Note that the mask points for held-in spikes are generated independently across channels and time points; this is illustrated in Fig. 6.
The output dimensions for all three models were set to (N in + N out ) × (M in + M fw ), i.e. spanning all four sub-arrays, and accordingly interpreted as predicted firing rates (λ) for all neurons at all times, both observed and unobserved.However, the loss is computed only on those portions not available at the input: the masked elements of the held-in sub-array, as well as the entirety of the held-out, held-in-forward, and held-out-forward sub-arrays (Fig. 6, bottom right).(For the special case of no masking explored at the end of the Results, the loss was computed on the entire held-in sub-array.)For the held-in spikes, this is equivalent to the "coordinated dropout" that has been applied to LFADS models [16].
Since we aim to predict spike counts, a suitable loss is the cross entropy of the model under the data with respect to a Poisson distribution (see Evaluation).We reduce this loss by stochastic gradient descent (with AdaM optimization [18]) in the space of the model parameters, always using the maximal possible batch size on our GPUs.(Typically this was one or two batches per epoch, but for the larger MC Maze and DMFC RSG data sets was 10 and 6, respectively.)All models were trained on 60% of the data (the training partition) and evaluated on a fixed 25% (the test partition), with the remaining 15% used for "validation."That is, training was terminated ("early stopping") and hyperparameters were chosen (see next) based on performance on this validation partition.
Hyperparameter optimization.The neural-network architectures we describe below depend on a number of hyperparameters, like number of layers and number of units per layer (see Table 3 for full list).We optimized these hyperparameters with the open-source Python package Optuna [3], which efficiently searches the space with a mixture of kerneldensity estimation (training runs that yield good results make the algorithm more likely to use nearby hyperparameters) and an evolutionary algorithm (that takes into account the relationships among hyperparameters).We trained ∼125 model instances for each (architecture, dataset) pair, under the criterion of minimizing cross entropy on all three of the held-out, held-in-forward, and held-out-forward sub-arrays of the validation partition.
When investigating how performance varies as a function of number of training data, spike-binning resolution, or masking (see Data-dependence of results), we repeated this entire procedure at each of the data sizes, bin sizes, or mask choices.To construct input data (right), the "held-in" block of spike counts is extracted, zeroed at random locations ("masked"; light blue), and zero-padded out to the total trial length M (upper right).This array of integers is then passed through the model, which predicts (real-valued) firing rates for all held-out and held-in neurons, for the entire length of the trial (right).To construct output targets (left), the mask is "inverted" and applied to the original data array (lower left), meaning that only the "missing" portions of the input, including the held-out neurons, are retained.The inverted mask is also applied to the output of the model (lower right), and the loss is computed on the retained elements.

Evaluation
The objective.When evaluating the model (on the validation or test sets), the input is the same as during the training phase, except that no mask is applied.That is, the input consists of the held-in sub-array (N in × M in ) (which is subsequently zero-padded out to size The model predicts firing rates for each neuron at all time steps (out to M ) and all neurons (out to N ), but following the NLB [26], we report performance separately on the three sub-arrays of unobserved data (see below).Nevertheless, the performance criterion [26] is the same for all three, a measure of information extracted per spike.In particular, since the model outputs firing rates, λ, it can be interpreted as assigning a conditional probability p to each input-output pair according to a Poisson distribution: p(y|x) . .= Pois(y; λ(x; θ)), where θ are the model parameters.During training (see above), the parameters are changed so as to reduce the conditional cross entropy of this distribution with the observed distribution of input-output pairs, p(x, y): where the expectation under the data has been approximated with a sample average.Intuitively, if a base-2 logarithm is used, this is the average number of yes/no questions (ideally, zero) one would have to ask in order to identify correctly all the unobserved spike counts, given the model predictions from the observed spike counts.
NLB metrics.The main two NLB evaluation metrics are identical to this cross entropy up to a shift and scaling.In particular, a crude baseline for performance would be the cross entropy under a "model" that simply predicts that each neuron fire at its mean rate (over the unobserved samples), λ: We can then quantify model performance in terms of the improvement it provides over this baseline.More precisely, the benchmark measures the reduction in bits (yes/no questions) that the model provides over the crude baseline, normalized by the total number of observed spikes: bits (Since the "baseline" is constructed from the test partition, it could in theory outperform a real model.In this, the NLB bits/spike resembles the coefficient of determination.)Our training procedure, which minimizes the cross entropy in Eq. 1, is clearly equivalent to maximizing the bits per spike in Eq. 3.
The benchmark proposes three main metrics: • co-bps: Eq. 3 applied to the (N out × M in )-sized sub-array of held-out neurons, which measures how well the model can "co-smooth" (infer) contemporaneous spiking from unobserved neurons.In the NLB, this is considered the primary evaluation metric.
• fp-bps: Eq. 3 applied to the (N ×M fw )-sized sub-array of all neurons in the 40 samples (200 ms) after the last observation, measuring how well the model can predict future activity of both observed and unobserved neurons ("forward prediction").In practice, however, the models are (unsurprisingly) better at predicting the future activity of observed, rather than unobserved, neurons.To avoid obscuring this difference, we evaluate our models separately on fp-bps-held-out and fp-bps-held-in, i.e. the forward predictions on (respectively) the (N in × M fw )-sized sub-array of observed neurons and the (N out × M fw )-sized sub-array of held-out neurons.
• velocity R 2 : the ability of the model's outputs to linearly predict the instantaneous velocities of the monkey's hand.The objective is to appraise the neural-prediction model, not the model that maps these to velocities, and accordingly the latter is supplied by the NLB and therefore uniform across all submissions.The linear model is fit (with ridge regression) on the model's responses to the training data (and the corresponding velocities), and evaluated in terms of coefficient of determination (R 2 ) on the testing partition.
The test data.Most results reported here are on the "local" test partition, constructed by holding out 25% of all trials.However, the NLB also includes a second, "remote" test set that is not publicly available but on which models can be evaluated according to the performance metrics lately described, by submitting a model to the benchmark's website.
When comparing to architectures from other groups, we report these numbers (as we make clear in the Results).Models tested on the remote data were trained from scratch on the local test as well as training partitions (still reserving the validation set for early stopping), since the test partition would otherwise be wasted.

Model Architectures
Guided by the hypothesis that standard machine-learning models, not inspired by or tailored specifically to neural data, can make competitive predictions on the Neural Latents Benchmark, we analyzed the three model architectures described here.In contemporary machine learning, there are three dominant model paradigms for time-series data, all based on artificial neural networks (ANNs): recurrent neural networks (RNNs), transformers [32], and convolutional neural networks (CNNs).We did not consider CNNs here and focused only on RNNs and transformers, including a hybrid RNN-transformer model, aiming to make use of the best features of both.All instantiations of these three architectures (Fig. 1) were constructed, trained, and tested in PyTorch [25].
RNN F. RNNs process their inputs sequentially, and consequently work better for data with shorter-length temporal autocorrelation.The flipside of this virtue is that they struggle to learn sequences with long-distance correlations [5].This can be greatly ameliorated, although not eliminated, by introducing units that interact multiplicatively, rather than additively, allowing the model to learn to gate the flow of information and therefore preserve (and dump) information across long time scales [14].The RNNs investigated in the present study used gated recurrent units (GRUs) [4]; multiple recurrent layers; and a "bidirectional" architecture, i.e. each layer consisted of a pair of RNNs running in opposite temporal directions.The final, purely feedforward layer used an exponential activation function to ensure that the predicted firing rates be positive.
Trans F. This model was inspired by the "neural data transformer" (NDT) introduced in [33], although our implementation is based on the RoBERTa formulation of the trans-former [21], as implemented in the HuggingFace library.It is essentially the encoder half of the transformer's encoder-decoder architecture [32] (see Supplementary Material for hyperparameters).In particular, any unit can transmit its activity at any time point to any unit in the next layer at any time point; but the flow of this information is governed by activity-dependent gating ("attention").We do not mask out future inputs, as in some implementations of the transformer, giving the network access to all neuron spikes at all (held-in) time steps.
Whereas in an RNN, the order of the input samples is implicitly coded by the order in which they enter the network, in an attention-based network this information is thrown away.Accordingly, we encode position information with the sine/cosine method from the original transformer [32].In our experiments, we place a feedforward layer (i.e., static affine transformation followed by pointwise nonlinearity) before and after the encoder.The pointwise nonlinearity for the latter feedforward layer is exponential (like the RNN F) in order to predict positive firing rates.
TERN.This (to our knowledge) novel architecture aims to combine the advantages of the RNN F and Trans F. The model consists of an RNN followed by a single layer of a transformer encoder [32].Our rationale is that the primary temporal correlations in neural data are local, and therefore an RNN provides a more appropriate model.This is especially true for data sets such as MC Maze and MC RTT, where the (motor) neural activity coincides with a continuous reaching task: we expect some (although not all) of the kinematic correlations to be reflected in the neural commands and plans.Nevertheless, we should like to allow the model to learn some long-distance correlations, after the shortdistance correlations have been exploited.Therefore we enable self-attention after the RNN, and accordingly call this model Transformer Encoder over Recurrent Network (TERN).
For consistency with the other two architectures, we use a bidirectional, GRU-based RNN, and a single layer of a transformer encoder [32].After the fashion of RoBERTa [21], we do not use any attention mask for the transformer (i.e., we allow it to attend to the RNN's output at all time steps).We also do not encode any position information between the RNN and Transformer layers.

Significance testing
In order to quantify the statistical significance of architecture comparisons, we consider 10 model instances of each architecture.Typically, for each architecture, one selects the set of hyperparameters that yielded best performance on the validation partition, and trains from scratch N (in this case, 10) instances of that architecture with these hyperparameters.Randomness in the initialization and the partition of training data into minibatches then yields a distribution of (non-hyper)parameterizations and consequently performances.
We consider this procedure somewhat brittle, since-at least in the present experimentvarious settings of the hyperparameters yield fairly similar performance levels for a single architecture.Consequently, the setting that yields the very best performance is not likely to generalize, especially since the validation partition on which the models were evaluated is arbitrary and not particularly large.
Therefore, for each architecture, we simply select the top 10 out of the 125 hyperparameter optimization runs (see above).Since these 10 instances typically have different hyperparameters (numbers of layers, units, etc.), this means that in practice our results will be slightly noisier than under the more standard procedure, and significance slightly harder to prove.Nevertheless, we believe this increase in variance is worth the reduction in bias, making the results more reliably informative about the different architectures.We could reduce variance as well by considering more than 10 instances of each architecture, but informally we notice that models outside of the top 10 can be far from optimal.
It is difficult to justify an assumption of Gaussianity for a distribution of just 10 data, so we do not use t tests for our comparisons.Instead, we use the Wilcoxon rank-sum test (a.k.a. the Mann-Whitney U test), which essentially asks whether the probability mass of one distribution (of X) is shiftward upward with respect to another distribution (of Y ).More precisely, it tests the null hypothesis that the probability that X > Y is the same as the probability that X < Y .Since we are hypothesizing that TERN is superior to the RNN F, and that both are superior to the Trans F, we use single-sided tests throughout.The exception is in our comparison of models with different types of masks: here we are agnostic as to which is superior and accordingly use a two-sided test.When comparing multiple pairs of architectures with respect to the same hypothesis (e.g., co-bps), we correct the p values with the Holm-Bonferroni method.[30] interval matching Supplementary Table 1: Description of data sets.M1 = primary motor area; PMd = premotor dorsal; S2 = secondary somatosensory cortex; DMFC = dorsolateral medial frontal cortex (including supplementary eye field, supplementary motor area, and pre-SMA); Nin = number of observed ("held-in") neurons; Nout = number of unobserved ("held-out") neurons; Min = number of observed ("held-in") samples; M fw = number of future ("forward") unobserved samples; Npre-movement = number of samples of the trial before movement onset (for MC RTT, trials were not aligned to movements).For all data sets, spike bins are 5 ms.

(a) (b)
Supplementary Figure 2: Effect of dynamic masking on TERN (MC RTT).For each of the three masking options (bars), hyperparameters for the TERN model were optimized from scratch (see Methods), and the top 10 (out of 125) models selected.Performance (mean with standard deviation) is quantified in terms of (a) co-bps and (b) velocity predictions (coefficient of determination).The "point mask" zeros out random samples, whereas the "strip mask" zeros out random "strips" of three consecutive samples (both for random neurons).Point and strip masking were compared with no masking (***: p < 0.0005, one-sided Wilcoxon rank-sum test, Holm-Bonferroni corrected) and with each other (+: p < 0.05, ++: p < 0.005) (two-sided Wilcoxon signed-rank test, Holm-Bonferroni corrected).Supplementary Figure 4: Peri-stimulus time histograms for Area2 Bump."PSTHs"-i.e., timealigned averages-for three neurons (rows) and five random conditions (colors) were constructed for the firing-rate predictions from all three architectures (columns).Shading indicates ± the standard error.For comparison, the true PSTH (constructed according to the procedure of [26]) is shown in the first column.Note that this procedure smooths the spike counts, so the true PSTH will typically be smoother than the model PSTHs (and standard errors are incomparable, so they have been omitted).All PSTHs were constructed on held-in neurons and time steps, but from held-out (evaluation) trials.

Figure 1 :
Figure1: Model Architectures.(a) RNN F is a multi-layer, bidirectional, recurrent neural network with gated recurrent units (GRUs), followed by a single non-recurrent (feedforward) layer.(b) Trans F is the encoder half of a Transformer with multiple self-attention layers and a final feedforward layer.This model is similar to the Neural Data Transformer (NDT)[33].(c) TERN (Transformer Encoder over Recurrent Network) is a multi-layer, bidirectional RNN followed by a single layer of a transformer encoder, again terminating in a (static) feedforward layer.Input and output data for all architectures are the same.

Figure 2 :
Figure 2: Example trials.(a-d) Spike counts (in 5-ms bins) and firing-rate predictions for three randomly selected neurons (subplots) in each of the four data sets.The color of the lines for each firing-rate prediction indicates the model (green: RNN F, blue: Trans F, orange: TERN).Note that the predictions have been converted from spikes/bin to spikes/s, shown on the left vertical axis.(e-g) Velocity decoding (see Methods) for MC Maze, MC RTT, and Area2 Bump data set.(DMFC RSG does not have ground-truth hand velocity data due to the nature of the behavioral task.)

Figure 4 :
Figure4: Forward-prediction performance (fp-bps-held-in and fp-bps-held-out).The models' performances in predicting future firing rates of observed (top row) and unobserved (bottom row) neurons is plotted as a function of time-or, more precisely, samples-since the final observation.Solid lines and shaded regions indicate the mean (across the ten best model instances of each architecture) and its standard errors.The RNN/transformer hybrid TERN outperforms the pure transformer at all points indicated by the (broken) red lines (p < 0.05, Wilcoxon rank-sum test), computed separately at each time step.

Figure 5 :
Figure 5: Data dependence of results.(a-c) Effect of number of training data (MC Maze).For each of four training-set sizes and each of the three architectures, hyperparameters were optimized from scratch (see Methods), and the top 10 (out of 125) models of each architecture selected.Mean and standard deviation of the (a) co-bps, (b) fp-bps, and (c) velocity R 2 are shown for each architecture.The RNN F and TERN were tested (one-sided Wilcoxon rank-sum test with Holm-Bonferroni correction for two comparisons) for superiority to the Trans F; significance at p < 0.005 is indicated with a star.(d) Effect of spike-count resolution (MC RTT).For each of the six bin sizes shown (columns) and the three architectures (abscissae), hyperparameters were optimized from scratch (see Methods), and the top 10 (out of 125) models of each architecture selected.Mean co-bps (first row) and velocity R 2 (second row) are shown along with their standard deviations.Significance (*: P < 0.05, **: P < 0.005, ***: P < 0.0005) is computed with a Wilcoxon rank-sum test, corrected for three comparisons.

Figure 6 :
Figure 6: Training Methodology.A single trial's data correspond to an array of size (number of neurons [N ]) × (number of samples [M ]) (top left).To construct input data (right), the "held-in" block of spike counts is extracted, zeroed at random locations ("masked"; light blue), and zero-padded out to the total trial length M (upper right).This array of integers is then passed through the model, which predicts (real-valued) firing rates for all held-out and held-in neurons, for the entire length of the trial (right).To construct output targets (left), the mask is "inverted" and applied to the original data array (lower left), meaning that only the "missing" portions of the input, including the held-out neurons, are retained.The inverted mask is also applied to the output of the model (lower right), and the loss is computed on the retained elements.

Table 1 :
Hand-velocity R 2 on the test partition.
name area task N in N out M in M fw N pre-movement

Table 2 :
[26]e of the art in co-smoothing.Comparison with the top models submitted to the Neural Latents Benchmark '21[26].The models introduced in this work are indicated by * .