Firing-rate-modulated spike detection and neural decoding co-design

Objective. Translational efforts on spike-signal-based implantable brain-machine interfaces (BMIs) are increasingly aiming to minimise bandwidth while maintaining decoding performance. Developing these BMIs requires advances in neuroscience and electronic technology, as well as using low-complexity spike detection algorithms and high-performance machine learning models. While some state-of-the-art BMI systems jointly design spike detection algorithms and machine learning models, it remains unclear how the detection performance affects decoding. Approach. We propose the co-design of the neural decoder with an ultra-low complexity spike detection algorithm. The detection algorithm is designed to attain a target firing rate, which the decoder uses to modulate the input features preserving statistical invariance in long term (over several months). Main results. We demonstrate a multiplication-free fixed-point spike detection algorithm with an average detection accuracy of 97% across different noise levels on a synthetic dataset and the lowest hardware complexity among studies we have seen. By co-designing the system to incorporate statistically invariant features, we observe significantly improved long-term stability, with decoding accuracy degrading by less than 10% after 80 days of operation. Our analysis also reveals a nonlinear relationship between spike detection and decoding performance. Increasing the detection sensitivity improves decoding accuracy and long-term stability, which means the activity of more neurons is beneficial despite the detection of more noise. Reducing the spike detection sensitivity still provides acceptable decoding accuracy whilst reducing the bandwidth by at least 30%. Significance. Our findings regarding the relationship between spike detection and decoding performance can provide guidance on setting the threshold for spike detection rather than relying on training or trial-and-error. The trade-off between data bandwidth and decoding performance can be effectively managed using appropriate spike detection settings. We demonstrate improved decoding performance by maintaining statistical invariance of input features. We believe this approach can motivate further research focused on improving decoding performance through the manipulation of data itself (based on a hypothesis) rather than using more complex decoding models.


Introduction
Brain-machine interfaces (BMIs) have shown great potential for providing new treatment for disorders [1,2], rehabilitation [3,4] and humancomputer interaction [5,6]. Such amazing achievements are the result of the joint effort of neuroscience, the development of neural signal processing and artificial intelligence, and advanced recording technology. Designing minimally invasive wireless implantable BMIs with high channel counts is the current trend in next generation BMIs [7,8]. High channel count implants can offer high-resolution recordings, leading to improved decoding performance and enhanced reliability. Through miniaturisation and wireless connectivity, there is no need for percutaneous wires, ensuring that the BMIs are the most portable with the least risk of brain damage or infection.
Next generation BMIs pose significant new challenges. On the implant end, enormous data is collected with increased channel counts. Bandwidth reduction and lightweight on-edge processing algorithms are essential to guarantee that the on-implant processing and transmission power are within the power budget. This prolongs the battery's lifetime whilst preventing damage to brain tissue. On the back end, the decoding performance can be affected by implant degradation. Improving the long-term stability of neural decoding is another essential requirement for invasive BMIs. Designing advanced decoding models [9,10] and utilising long-term stable features are two keys to preventing performance degradation [10][11][12].
Multi-unit activity ((MUA), unsorted spikes) is one feature that can fulfill the mentioned requirements. Studies [13,14] have shown that MUA can contain up to 92% mutual information of single-unit activity ((SUA), sorted spikes), suggesting that spike sorting could be skipped in BMI applications, which can significantly reduce the computation. Additionally, MUA signal decoding has already been shown to have good performance in practice [6,15,16]. Spike detection in [17] is one of the pioneering works in the field, proposing hardware architecture for a wireless BMI back in 2007. Although the hardware architecture has not changed much over the past 15 years, the hardware capability, and the knowledge of how neural signals adapt over time have moved on significantly. Our approach has been, in part, motivated by the trend in recent translational efforts in implantable brain computer interfaces, based exclusively on unsorted activity, with a strong emphasis on adaptiveness, long-term stability, and hardware efficiency.
For MUA-based BMIs, on-implant spike detection is often combined with off-implant decoding. Such systems however typically only utilise the most basic spike detection methods (e.g. fixed threshold defined by training statistics). The system co-design must take into account different requirements/constraints for the spike detection algorithm and the neural decoder. The spike detection algorithm needs to be designed to minimise hardware complexity and power consumption as this will be performed on the implant. In contrast, the neural decoding will operate on an external node with more resources and a larger battery, such as a mobile phone or wearable device, where the computational requirements are more relaxed in order to improve decoding accuracy.
The co-design has been shown in figure 1. It consists of an ultra-lightweight adaptive spike detection and a neural decoder with improved long-term stability addressing the requirements mentioned previously. The proposed spike detection algorithm was also implemented on field programmable gate array (FPGA) to demonstrate its ultra-low hardware complexity, making it suitable for on-implant use. The decoding algorithm, however, was evaluated on real recordings in offline simulation. Using the proposed system co-design, we also investigate the relationship between spike detection and neural decoding performance in real recordings where no ground truth is available. Three findings have been observed in this work.
1) Detecting only significant peaks causes less than 1% performance degradation. As fewer spikes are detected, less information needs to be transmitted. This can be the best trade-off between the data bandwidth and decoding performance, which is often preferable for wireless BMIs. 2) Detecting small spike peaks around the noise floor can improve the decoding performance. It reveals that there are neural activities far away that are informative relatively. Utilising this extra activity in decoding can improve decoding performance other than being detrimental despite detecting additional noise that is often considered to be detrimental for spike detection performance. Therefore, applications that target either designing BMIs to maximise decoding performance or studying underlying science require undistorted neural information should follow this approach where the threshold should be set only a little above the noise floor, i.e. sensitivity is more important than specificity. A similar finding is also observed in [14,18], we re-proofed such a finding in a practical setting. 3) Long-term stability can be degraded through inconsistent feature representation when using MUA features. Neural activities can drift over time, while scar tissue growth can weaken or distort the neural signal. Neuron units can be lost and new connections can build. Normalising the input feature with the training set statistics is a standard step before feeding data into neural networks. However, training set statistics are no longer valid as with the neuron activity drift. We identify that the standard DL procedure on normalisation is not suitable for long-term decoding due to varying features. Tracking the feature statistics allows the model to adapt to such drift and reduce long-term performance degradation. Our approach to co-design detection together with decoding is in part proposed to address this problem. In addition, we found using a broader spatial distribution of neural activities, i.e. spike signals further away, also benefits long-term stability. (a) Spike detection-neural decoding system co-design. After digitising the intracortical signal, spikes are detected at the pre-set target detection rate. Detected spike counts are binned at a fixed period. Binned spike counts are then modulated using the target detection rate to align the feature representation over time. Finally, the decoding model predicts the target velocity, and its correlation coefficient to the real velocity is used to evaluate the system performance. (b) Logic circuit of spike detection. The Absolute Difference Filter takes the input signal's derivative and absolute value to remove the LFPs. The spikes are detected if the filtered signal outnumbers the threshold. The threshold is updated duty-cycled according to the relationship between the target detection rate and the number of detections counted by Spike Counter. Repeated detections are counted only once. (c) Flowcharts of the neural decoding modules. All modules use the LSTM decoder but have different input layers. The proposed rate-modulated LSTM (RM-LSTM) subtracts the target detection rate from the binned spike counts followed by a dense layer. Compared to the proposed architecture, the conventional LSTM subtracts the mean values calculated from the training dataset to perform a standard normalisation of the input feature. Input dense LSTM (ID-LSTM) consists of a dense layer with the standard normalisation. Rate subtracted LSTM (RS-LSTM) only modulates the input feature using the target rate without the dense layer.
This paper is structured as follows: section 2 details the challenges in both spike detection and neural decoding. Section 3 introduces the datasets used in this paper. Section 4 introduces the proposed co-design system, including the novel firing-ratebased spike detection algorithm and input-consistent decoding model. Section 5 shows the performance of the proposed spike detection algorithm on the synthetic dataset, the corresponding hardware cost and The comparison to other algorithms. It also includes the relationship between spike detection and neural decoding, the decoding performance of the co-design system, and its improvement in long-term stability. Section 6 discusses the concern on data consistency in neural decoding and the opportunity this work provided to BMI. Section 7 concludes this work.

Challenges in wireless invasive BMIs
Challenges in designing wireless invasive BMIs are still significant. One key challenge is power consumption. On the one hand, a massive amount of data is collected on the implants. The recording instrumentation together with processing and transmission can consume enormous power. On the other hand, the brain tissue around the implants is hugely sensitive to temperature increases and vulnerable, which can be damaged by only a 1 • C temperature increase [19]. Advanced low-power hardware design is one key factor in overcoming this challenge while reducing the data bandwidth also plays an important role.

Neural spike detection
Feature extraction is essential for reducing the data bandwidth and keeping the implant power within budget. Using the MUA signal, the bandwidth can be reduced from more than 300 kbps/channel to 1 kbps/channel/neuron firing around the implanted probes [20,21], and even as low as 30 bps/channel after compression [22]. In the meantime, useful information can be mostly preserved as spikes are the most informative within intracortical signals.
In order to obtain MUA signals, spike detection is commonly used. Spike detection applies a threshold to the recordings, and for the duration of the signal above the threshold, spikes are recognised. Various spike detection algorithms have been proposed using the wavelet transform [23][24][25], control theory [26,27], robust estimation of noise and spike distributions [28][29][30], and machine learning [31]. These algorithms can detect spikes with excellent accuracy. However, their computation requirement can be excessive for on-implant use, even though some have already been optimised towards low-power implementation. A more simple form of spike detection using nonlinear energy operator (NEO) [32,33], amplitude slope operator [34,35] to emphasise the spikes, and defining the threshold based on gross signal statistics can be ideal for on-implant use due to their simplicity.
Another aspect that becomes increasingly important is real-time processing and algorithm adaptiveness. Real-time signal processing is essential for BMI applications guaranteeing the fast response of the whole system, and adaptive spike detection enables the threshold to adapt to the varying brain environment in time and across channels or electrodes. Various techniques have been used, for example. In [34], bit shifts are used to replace multiplication, improving the real-time system response time. In [36], dual thresholds derived from smoothed and non-smoothed NEO is used to guarantee high accuracy in both low and high signal-to-noise ratio (SNR) level to improve adaptiveness.

Challenges in real-time on-implant spike detection
Several challenges exist, and the trade-off between algorithm complexity and performance is still unresolved. Low-complexity spike detection algorithms can consume little power, which is preferred for wireless implantable BMIs. In [28,37], different lowcomplexity spike detection algorithms have been tested on FPGA or ASIC, addressing such a trade-off.
However, the adaptiveness of the evaluated statisticalbased approach is of concern. In [38], it evaluated the adaptiveness of different spike emphasisers, and in [39], it is reported that utilising a threshold based on noise statistics can never establish a threshold with consistent performance across different noise levels, while jointly using noise and spike statistics does. However, spike peak statistics are difficult to be estimated as spikes are unknown before they can be detected. In addition, extra computation is needed.
We proposed a firing-rate-based algorithm in [40] based on the fact that neurons in a certain brain region are firing at a normal rate. If the average detection rate is close to that normal rate, the current threshold can be considered to be a reasonable threshold. We developed several techniques enabling the target spike rate to update to the correct rate adaptively, eliminating the effect of inaccurate initial settings. The proposed online method achieves better detection performance than the offline thresholding algorithms. At the same time, the hardware cost is even less than the threshold algorithms based on calculating the local average.
Another challenge is in evaluating the spike detection algorithms. There is generally no spike ground truth in neural recordings unless with manual labelling. Though some studies tried to design an auto-labelling algorithm [41], new metrics [42] or use intro-extracellular-paired recordings [43], most of the spike detection algorithms are evaluated using synthetic data [28,34,[44][45][46][47][48][49]. There are considerable differences between the synthetic data and real recordings w.r.t noise level, signal variety, etc. It is not guaranteed that the performance of a spike detection algorithm can be maintained in practice.
In the context of the MUA-based BMI, the output of spike detection is the input of the neural decoding. An even more significant and under-investigated challenge is to design a spike detection algorithm that improves the neural decoding performance rather than solely focusing on the spike detection performance. These two elements are always optimised individually in isolation. According to [20], the presence of randomly missed spikes in MUA only has a minor effect on decoding performance. In [18], it identified the optimal threshold for achieving the highest decoding accuracy through a sweeping process of different static thresholding levels, which was found to be lower than the threshold typically used in BMI applications. However, it is important to note that the threshold sweeping was conducted offline, assuming perfect knowledge of the neural signal, whereas in practice, spike detection is performed in real-time with only local information available.
Though great progress has been made in neural decoding, most of the input features (spikes) are extracted using a relatively simple thresholding method via manually selected multipliers to signal statistics. Such an approach could be practical when enough memory is available to allow for statistically significant signal dynamics to be estimated. However, this can be unreliable in low-power design with constrained resources available on implants. It is commonly accepted that data quality is the essential factor in the success of machine learning or deep learning. This is especially so since the detection is effectively deciding what data is discarded. Therefore, it is again crucial to understand the relationship between spike detection and neural decoding performance and apply more advanced spike detection techniques to enhance the feature quality.
Long-time stability is also a challenge in neural decoding, as mentioned before. With the electrode ageing, scar tissue growth, and foreign body response, the SNR can drop as time goes by. The spike waveform can be weakened and distorted resulting in fewer spikes to be detected in a static detection setting and wrongly sorted units [62]. Such drift limits the model's capability in the long term. There are major efforts investigating the underlying mechanisms aiming to improve long-time stability [9][10][11][12]. Among different types of neural features, SUAs and spike waveforms are least robust in the long term, while multiunit activities and local field potentials (LFPs) are better [12,50]. However, there is still no commonly-accepted way to adapt to such drift and reduce performance degradation.

Dataset
This study used two datasets, one synthetic and one realistic recording dataset to evaluate spike detection and neural decoding performance.
The synthetic dataset was first published in [44] and has subsequently been widely used in [28,34,[45][46][47][48][49] to evaluate the spike detection performance. This dataset consists of four groups of recordings, namely Easy1, Easy2, Difficult1 and Difficult2. Each group consists of four different noise levels from 0.02 to 0.5 with a step of 0.05 in total 16 different pieces. The noise level in this dataset is defined with the ratio between the noise standard deviation and the average spike amplitude. The recordings are sampled at 24 kHz, and each consists of three types of spikes firing at 20 Hz (60 Hz in total) following Poisson distribution. More details of this dataset can be found in [44]. In our experiment, we used 60 s of each recording downsampled at 7 kHz in fixed point representation as [34,63] suggested The data with real recordings was collected by Sabes Lab [64]. Utah arrays were placed into the primary motor cortex of a nonhuman primate. The data are sampled at 24414 Hz, downsampled three times (8138 Hz)and rounded to fixed point representation. Its 2D hand movement was recorded with the brain signal. The subjects can freely move and operate the tasks spontaneously, using a ticker to control the cursor reaching desired locations on the screen. The spike timing is provided in this dataset using a threshold derived from 3 to 5 times the signal rooted-mean-square (RMS) values. This dataset has been widely used for evaluating brain signal decoding algorithms [50,[65][66][67]. We have used the 24 pieces recording on Monkey Indy starting from 27 th , June 2016 to 27 th , January 2017, in total 5 h.
In this work, instead of applying the decoding algorithm directly on the provided threshold crossings, we used the proposed spike detection algorithm to re-detect the spikes from the raw recordings. We evaluated our spike detection algorithm with the decoding performance which can be translated to how much useful information is preserved after detection compared to the conventional spike detection algorithm. Moreover, we can observe the relationship between spike detection and decoding performance.

Spike detection and decoding system co-design
As shown in figure 1(a), the proposed system consists of signal acquisition, spike detection, spike binning and decoding, four modules. A target detection rate is set on both the spike detection and decoding model, modulating the input feature of the decoding model across different system settings and brain environments to maintain a consistent representation statistically.

Firing-rate-based spike detection algorithm 4.1.1. Preprocessing
The digitised raw intracortical neural signal can be offset by the LFPs and noise. In order to remove the LFP, we used the absolute value of the derivative following [40] as equation (1) shows.
where X n is the nth input and Y n is the corresponding output. K is sample rating dependent. At 7 kHz sampling frequency, a spike occupies around 7 samples (assuming spike duration 1 ms), so k = 2 is likely to sum the peak to the minima, providing a large (boosted) output. Using such a simple absolute difference filter is effective enough in both datasets and significantly improves the signal SNR with minor computational cost.

Firing-rate-based adaptive threshold
The proposed adaptive threshold is set based on the fact that the firing rate of a certain brain region is on average stable. In [68], researchers monitored twelve human motor neurons when the subjects were operating tasks with minimum effort, observing that the firing rate of each neuron is about 10 Hz, and it became over 20 Hz per neuron for maximum effort tasks.
Though the firing rate can fluctuate during the active or silent period, it is relatively stable in the long term. Therefore, by setting a reasonable target detection rate and threshold update strategy, we can control the threshold to saturate at the level, automatically detecting spikes at the desired rate. Figure 1(b) shows the proposed spike detection algorithm. The target detection count interval [R 2 , R 1 ], initial threshold T 0 and updating period P are set at start. In one updating period, if Y n > T n , a spike is detected, and 5 samples (in the 7 kHz sample frequency case) are skipped to avoid multiple detections of one spike. The detection count C is increased by one. If C > R 1 , the threshold will increase by p%, and a new updating period is started. At the end of one update period, if C < R 2 , the threshold will decrease by p%. Otherwise, the threshold will be unchanged. In short, we set an acceptable detection number range [R 2 , R 1 ], and the threshold will be decreased or increased if the detection rate is too low or high. Even though the initial threshold can be non-ideal, the threshold will eventually saturate at a level that fulfils the target. Here we set the R 1 = 2R 2 . The value of R 1 can vary and will be specified later in section 5.
p% determines the saturated speed and threshold resolution. A larger p means faster saturation but a less accurate threshold. Adaptively changing p can be implemented to mitigate such a trade-off but not used considering the computation overhead. In both datasets, p% is set to 6.25%, i.e. 1/16, which can be easily calculated with 4-bit right shifting as all numbers are in fixed-point representation.
P determines the adaptiveness of the system to the varying signal condition. The threshold can be less sensitive to signal varying if P is large, while it can be disturbed by the burst firing if P are too short. The frequent threshold updating also leads to more computation and higher power consumption. P is empirically set to 3 s in both datasets.
The target detection count interval [R 2 , R 1 ] is the most important parameter determining the acceptable detection rate range and eventually determines the saturated threshold level. There are several benefits to setting the target detection rate rather than conventionally deriving from multiple times to noise mean/median/RMS/standard deviation values.
(1) Compared to finding a multiplier, selecting one detection count interval can be more heuristic, building on previous neuroscience observations like [68] other than trial and error. (2) It guarantees the detection of useful information. It could happen using multiple times of signal statistics that the threshold can grow too high or low in some channels. That can lead to continuous detection or zero detection resulting in system failure. (3) It benefits the detection in long-term implantation. The scar tissue around the probe can push neurons further away and weaken neural activity. The conventional approaches based on statistics can miss these spikes while the proposed algorithm can lower the threshold adaptively to detect these spikes and fulfil the target detection rate. (4) The target detection rate range can potentially be used to improve the decoding performance. In more detail, if the resultant detection rate of a certain period approaches or exceeds R 1 , such period is more likely to be active, while if the resultant detection rate of another period is under R 2 , such period is more likely to be silent. The traditional method cannot provide such extra information (at least not in realtime), and we have proposed a rate-modulated LSTM (RM-LSTM) based on such information, introduced in section 4.3.3, showing its improved performance compared to the baseline models. Notice that the target detection rate mentioned later will be the count interval upper limit R 1 unless specified.

Spike count binning
The output of the spike detection is a binary stream. It has been found in [22,50] that binning the binary threshold crossings into MUA counts at a certain period can improve the decoding accuracy. However, a trade-off between decoding accuracy and temporal resolution has to be made. In this work, a 50 ms bin period is used as in [22,50].

Neural decoding
The binned MUA counts from different days are split into ten sets for decoding model cross-validation. For each validation cycle, one set is used for testing; one set is used for validating the best model and the rest sets are used for training. The average score across 10-fold validation is used as the final score of this day. Before feeding into the model, the binned MUA signals are normalised using the mean and standard deviation of the training set unless a special design is applied.
Two baseline decoding algorithms are implemented, namely Wiener cascade filter (WCF) and LSTM, to decode the subject behaviour (finger velocity) from the spike events detected by the proposed spike detection algorithm.

Wiener cascade filter
This adds non-linearity to the conventional Wiener filter [50,[69][70][71]. It consists of a linear Wiener filter and a static nonlinear system. The Wiener filter is trained by fitting the Lasso regression between the binned features of N recent time stamps and kinematic data with a regularisation factor of 0.01. N was tuned from 2 to 15, and set at 2. The performance variation is minor (approximately 1%) across the different N values. A 3rd order polynomial between the Wiener filter output and the kinematic data is then fitted and cascaded after the Wiener filter to add nonlinearity to the system.

Long-short term memory
LSTM has been widely used in various machine learning problems, especially for time series analysis. Recently, it is also used in brain signal decoding achieving promising performance [6,50,72]. After model tuning, 150 LSTM cells are used with Adam optimiser at a learning rate of 0.0035 and batch size 64. We will introduce several variants of the LSTM later and the LSTM mentioned below is this conventional architecture unless specified.

Spike detection and neural decoding co-design
As mentioned before, the proposed spike detection algorithm can provide extraction information of the potential silent and active periods. We have designed a special input layer before the LSTM to utilise such extra information to modulate the binned spike count. A RM-LSTM is proposed to have the capability to keep the consistent feature representation even though the input feature statistics are changed over time as shown on top of figure 1(c). The input layer consists of a firing rate modulation step and a dense neural network layer which is formulated as equation (2).
where X in ∈ N N is the binned MUA signal, and N is the channel number. STD T is the standard deviation of the training set, R 2 is the target detection count lower limit, and BP is the bin period in seconds used for firing rate alignment. W in ∈ R N×N , ϕ(·) and Y in ∈ R N are the weights, Sigmoid nonlinear activation and the output of the dense layer, Consistent feature representation means the input of the LSTM can be at a similar scale statistically regardless of the varying of the binned MUA count statistics. This is achieved using the firing rate modulation. Input feature consistency is critical to maintaining the long-term stability of the decoding model, as the SNR can degrade with implant aging. Shifting the data via X in − R 2 * BP makes the potential activate period positive while the less active period is negative. The input-output relationship of the model is correctly mapped after shifting, as shown in figure 1(a) firing rate modulation. This correctly mapped relationship is expected to be easier to learn by the decoding model compared to normalisation feature. At the same time, such mapping is inputfeature-scale invariant. Even though the recording quality is degraded, the mapped feature statistic can stay unchanged. Moreover, such mapping is consistent for different target detection rates, and there is no need to retrain the decoding model after changing the target detection rate of the spike detection algorithms.
The nonlinear response adds the capability of the models to simulate the nonlinear relationship between the input firing rate and the output that estimates behaviour. We assume the effect of the input firing rate on the behaviour (finger movement velocity) should show a saturation trend when the input firing rate is very high or small. Such a relationship can be well simulated using the Sigmoid function, a widely used activation function in neural networks.
Compared to the conventional LSTM, RM-LSTM used different input (X in − R 2 * BP) and more parameters (W in ). We built another two models: Rate-subtracted LSTM (RS-LSTM), which only subtracts the rate from the input and input-dense LSTM (ID-LSTM), which only has the input dense layer and nonlinear activation. RS-LSTM controls the number of parameters to identify the improvement made from firing rate modulation over standard normalsation, and ID-LSTM controls the input feature statistics to identify the contribution of the increased model size. The architectures of these models are shown in figure 1(c).

Evaluation metrics
Several metrics have been used to evaluate the performance of the proposed spike detection and decoding algorithms.
For spike detection, sensitivity (Sens), false detection rate (FDR) and accuracy (Acc) are used and defined as below: where TP, true positive, is the number of spikes that have been correctly detected; FP, false positive, is the number of periods that have been incorrectly detected Figure 2. (a) Average Spike detection accuracy, sensitivity, and FDR across synthetic datasets of the proposed algorithm compared to our previous work [40], and exemplar methods including a high complexity algorithm [73], an offline algorithm [44], a low complexity float-point algorithm [45] and a low complexity fixed-point algorithm [34] at different noise levels. (b) The spike detection performance at the different sampling rates. There is nearly no degradation when the sampling frequency is above 8 kHz and only minor degradation at 6 kHz. as spikes; FN, false negative, is the number of true spikes that the algorithm has not detected. Sens and FDR are two commonly used metrics for evaluating the performance of spike detection algorithms. Sens describes the ability of the algorithm to detect true spikes and refine their timing, while FDR evaluates the algorithm's ability to reject noise. However, it is difficult for an algorithm to achieve the best performance on both metrics simultaneously. Generally, a higher Sens is preferred over a lower FDR, as a higher FDR means detecting more noise that can be recovered, while a lower Sens can result in a loss of information that cannot be retrieved. Acc is a metric that balances the trade-off between Sens and FDR. These metrics are used to evaluate the spike detection performance of the synthetic dataset, and the average scores across recordings are used as the final score.
For neural decoding, the Parson correlation coefficient (CC) between the behaviour signal and predicted output is used and formulated as equation (6).
where Y t andȲ are the true target velocities at timesteps t and the average,Ŷ t andȲ are the predicted velocities and the average. N is the total number of samples in this trial. CC is measured on both the x and y-axis and averaged across 10-fold validation.

Spike detection performance in synthetic dataset
We first tested the proposed spike detection algorithm on the synthetic dataset to quantitatively evaluate the algorithm compared to other state-of-the-art spike detection algorithms.
The spike detection performance at different noise levels is plotted in figure 2(a). The proposed real-time low-complexity fixed-point multiplicationfree algorithm has shown the highest detection accuracy at all noise levels among our previous work [40], offline spike detection [44], real-time float point algorithm [45] and fixed point algorithm [34]. It also performs better than the high complexity algorithm [73] in 0.05 to 0.15 noise levels.
The proposed algorithm performs only comparably to the high-complexity algorithm at noise levels below 0.2, with the main performance degradation occurring due to an increased false detection rate. FDR can be reduced by introducing increased complexity spike enhancing operations, but sometimes a higher FDR setting can be beneficial for neural decoding. Such a finding has been shown in [18] and will also be demonstrated in section 5.4.
The proposed algorithm operates at a relatively low sampling frequency, which has the potential to significantly reduce the power of the hardware implementation. However, reducing the sampling frequency can come at a cost of degraded detection performance. We therefore conducted an experiment to evaluate the impact of downsampling on spike detection performance while the raw signal at 24 kHz was downsampled 2-4 times. The results of which are presented in figure 2(b). It indicates degradation in detection accuracy is nearly non-existent when the sampling frequency is above 8 kHz but starts to become noticeable at 6 kHz, although still with a detection accuracy above 0.9. It is worth noting that the energy of the spikes is mostly concentrated below 3 kHz, and hence, 7 kHz is the lowest sampling frequency for accurate spike detection, in accordance with the Nyquist theorem and 1 kHz oversampling. These findings are consistent with prior research by [36,63] suggesting 7 kHz to be the minimum acceptable sampling frequency to spike detection. However, spike sorting can require a higher sampling frequency because the spike waveform is critical for clustering. To quantitatively evaluate the performance of the proposed algorithm in practice, we compared the decoding results obtained from the threshold crossing derived from signal standard deviation (STD) values and the proposed algorithm. The average CC of 24 days of recordings is presented in table 1. For the STD threshold, calibration is performed for each channel and each day. The proposed spike detection algorithm, with a fixed threshold setting at a detection rate of 46 Hz, achieves even higher decoding CC compared to the results obtained from calibrated STD thresholding. Furthermore, we observed that finetuning the proposed algorithm resulted in only a marginal 1% improvement, indicating that there is no need for further calibration. It is worth noting that the STD-based results are detected at the raw sampling rate (24 kHz), whereas the proposed algorithm operates on the signal downsampled by a factor of 3. This however does not impact the decoding performance of our algorithm-we in fact, observe only a difference of 0.1% in CC between using the raw signal or signal downsampled by a factor of 3.
Compared to our previous work in [40], where we achieved ultra-low resource usage and power consumption, we have now further reduced the resources used to less than one-third of the previous design. This translates to a significant reduction in area usage in the ultimate ASIC design. Table 2 summarizes the hardware cost, spike detection accuracy, and other specifications of the proposed firing-rate-based spike detection algorithm, along with those of other state-of-the-art algorithms. The proposed algorithm achieves significantly lower resource usage (logic cells and memory) and processing power while achieving the highest spike detection accuracy compared to other hardware-efficient spike detection algorithms.

Target spike detection rate and spike detection performance
In order to find the relationship between the spike detection performance and neural decoding performance, we need a measure of the spike detection performance on the realistic dataset. It is challenging as there is no ground truth. However, with the proposed spike detection algorithm, we can perceive the spike detection performance using the target detection rate. More specifically, if we set the target spike detection rate close to the real, local firing rate, the algorithm should provide the highest spike detection performance. Using the Quiroga dataset (with an average spike rate of 60 Hz), we tested different target detection rates R 1 by sweeping from 40 to 100 Hz." The detection accuracy, FDR, and sensitivity are shown in figure 3(a). The spike detection performance peaks at R 1 = 56 Hz, close to the 60 Hz real spike firing rate. It demonstrates that the proposed spike detection algorithm can provide the highest detection performance when the target detection rate is close to the actual firing rate. It can also be observed that when the target detection rate is above the actual firing rate, both sensitivity and FDR increase, which means more spikes are detected while more noise peaks are recognised as spikes and vice versa.

The relationship between the spike detection and neural decoding performance
Data collected from the Utah array, which has been spike sorted [64], shows 3 to 5 clusters per channel. Meanwhile, [14] suggests that, on average, there are 3.8 to 4 clusters per channel when no spikes are discarded. Therefore, a reasonable assumption of the target detection rate should be 30-40 Hz for a ticker control task (as motor neurons can fire at below 10 Hz when operating little-effort tasks). The actual interval can be even lower as the number of neurons can be fewer than the observed clusters and the subject did not operate the ticker continuously. With this assumption, we are able to set a target detection rate interval to get the most accurate spike detection results on a realistic dataset for the motor tasks. Varying the target detection rate allows us to change the spike detection performance and builds the relationship between detection and decoding. Figure 3(b) is obtained after decoding the binned spike detection results with WCF and LSTM. 30-40 Hz should be the desired detection rate achieving the highest detection performance, and is supposed to provide the best spike detection performance as well. However, the decoding performance of both methods peaks at 46 Hz. Such unmatching suggests the nonlinear relationship between detection and decoding.
This desired detection rate interval might be inaccurate. To be more rigorous, we also visually observed the spike detection result in figures 3(c)-(e). They shows a recording snapshot with the spike detection threshold generated with R 1 = 23 Hz, 35 Hz and 46 Hz in red, yellow, and green separately. The threshold at 46 Hz is obviously adapted to the level only a little above the noise floor. Compared to the threshold at 35 Hz, which is the best threshold to discriminate spikes from the noise by visual inspection, this lower threshold detected more noise-level peaks. Both theoretically speaking and by inspection, detection at 46 Hz is not an ideal spike detection setting but achieves the highest decoding performance.   This finding does not match what is expected intuitively that better spike detection performance and better decoding performance. It is also unmatched with the spike detection algorithm design objective that aims at increasing the sensitivity while reducing the FDR because the detection of these noise-level spikes can increase the FDR significantly. It suggests that some around-noise peaks are results in spikes actually firing far away. Detecting these spikes improves decoding performance despite also detecting more noise. This finding is also supported by the result in [13,14,18], suggesting that discarding the noise cluster is detrimental to the decoding performance in decoding neural signals using SUAs.
Another finding is that if the detection rate is doubled, to e.g. 100 Hz, which means about half of the spikes detected could be noise peaks, the performance is only degraded by about 1.4%. In addition to detecting noise, the low detection threshold borders the spatial detection region around the electrodes, and as a result, activities from regions further away with different functional tuning are more likely to be detected. This can also lead to degraded decoding performance [18]. However, this degradation is minor. This suggests that the decoding models, even the simple filter-based model (WCF), are robust to noise. We only investigated a simple LSTM neural network. It would be expected that if more advanced deep learning models are used, even better noise resistance can be achieved.
Finally, if the detection rate is halved to 23 Hz, many of the spikes with lower peak amplitudes can be missed. However, the decoding performance is only degraded by less than 1%. This finding is significant for on-implant spike detection, allowing us to send fewer data without sacrificing much performance. The threshold crossing can be represented using a binary stream at 1 kbps. A lower detection rate means a more sparse signal, and the more sparse the signal is, the easier it is to be compressed. The data bandwidth can be even reduced using compression techniques [22], leading to less wireless transmission power.
Spike detection previously is an unsupervised problem in practice and highly depends on the results from synthetic data. Our finding is significant in providing a guide on setting the threshold. Especially for the BMIs with a limited power budget, the goal of spike detection would no longer be to find the best threshold level for the highest detection accuracy but the trade-off between the data bandwidth and decoding performance. With the firing-rate-based spike detection, the threshold level can be selected so it fulfils the system requirement on bandwidth and decoding accuracy without needing to consider the performance of the spike detection algorithm itself.

The performance of the co-design system
We used the target detection rate to bridge the neural decoding with spike detection. The improvement is shown in figure 4(a). Compared to the conventional LSTM (blue), over 1% improvement has been achieved using RM-LSTM (red) at all different detection rates.
When the detection rate is above 40 Hz, more noise is likely to be detected. The detected noise will be added in MUA counts to the decoding model (activity from further cells irrelevant to the targeted activity can also be regarded as noise). Because of such noise, we can observe a decaying trend to the tail of the conventional LSTM. RM-LSTM however contains the decay suggesting the firing rate modulation (keeping consistent feature representation) not only helps to improve the model capability but also improves the model's robustness to the noise.
To further understand such improvement in decoding performance, we designed the ID-LSTM model, which has the same number of parameters as RM-LSTM but without the firing rate modulation as figure 1(c) shows. The yellow curve in figure 4(a) shows its performance. ID-LSTM provides better decoding performance than conventional LSTM but is not as good as RM-LSTM. Such observation suggests that the improvement of RM-LSTM is not solely from the increased number of neural network parameters, but the firing rate information also matters.
We mentioned that firing rate modulation could improve the decoding performance because such an operation can keep a consistent feature representation and zero-centre the mapping, which is easier to learn. However, data normalisation can have a similar effect. In order to show that using the rate information of the proposed spike detection is better than the normalisation, we designed the RS-LSTM shown in figure 1(c). It subtracts the target rate instead of the mean calculated from the training set. It can be observed in figure 4(a) that with such a simple subtraction, RS-LSTM achieves (purple) better performance than the ID-LSTM model, which has 48190 more parameters. We can also observe that the decaying trend with increased detection rate is also contained, similar to RM-LSTM. Such observation suggests that using the firing rate modulation can more accurately map the high firing rates to the active behaviour while map the low firing rates to the inactive behaviour than using standard normalsation enhancing the learning procedure of the deep learning models.
We also tested how the performance of RM-LSTM can be affected when fewer or more spikes are detected. We tried to train the model with the detection results in one target detection rate(50 Hz) and tested at a different detection rate simulating more or fewer spikes being detected when the environment changes. We can observe the significant improvement in figure 4(b), especially when fewer spikes are detected when the rate information is used, demonstrating the robustness of the system co-design.
Our results above not only demonstrates the improved performance of the proposed spike detection and neural decoding system co-design compared to the baseline models but also provide an explanation as to where the improvement comes from, addressing the importance of consistent feature representation. It is essential that AI is not blindly applied but rather its application is informed by some knowledge of the 'system' . We believe such a co-design system can thus provide better intuition in designing and optimising decodes performance than by simply applying complex deep learning architectures.

Long-term neural decoding performance
Maintaining long-term neural decoding performance is a challenging task in BMIs, because of the recording quality degrading with implant ageing. The growth of the scar tissue can push neurons further away from the electrodes weakening neural activities. In figure 5(a), the black/grey curves show such decaying of the spike amplitude over 200 days (Spikes are detected using a threshold of 3.5 to 4 times the local STD values [67]). As the signal weakens, using traditional statistic-based spike detection results in fewer spikes being detected over time. As the yellow curves show, the average threshold crossing rate can drop from 18 Hz to below 10 Hz. Such an information loss can potentially reduce the long-term decoding performance The proposed co-design methodology can address this issue. The firing-rate-based spike detection can update the threshold to lower levels over time, allowing for the detection of neural activity further from the electrodes and stabilizing the detection rate. Meanwhile, firing-rate-modulated decoding maintains zero-centered binned spike rate features, providing statistically consistent features in the long term, regardless of the noise levels and spike amplitude. A statistical-based approach cannot provide such a feature because the optimal threshold level is nonlinear to the noise statistics [39]. While setting a new multiplier to the noise statistics can resolve this issue to some extent, manually tuning all channels to maintain a stable output stream is impractical.
Three different models have been implemented to evaluate the improved performance of the proposed co-design. In addition to the RM-LSTM and conventional LSTM, we implemented another model, meantracked LSTM (MT-LSTM). Its only difference from the conventional LSTM is that test data is normalised by the test data statistics rather than the values from the training set. This model simulates tracking of the input mean values, so-called mean-tracked.
The conventional LSTM and MT-LSTM use the threshold crossing of the adaptive STD threshold (It has been tested that decoding the adaptive STD threshold results performs better than using the static STD threshold), while RM-LSTM uses the results of the proposed algorithm detection at 20 Hz and 50 Hz. The 20 Hz model keeps the threshold crossing rate close to the STD-detected ones, while the 50 Hz model allows further cell activities to be detected to improve decoding performance. Comparing the decoding performance of conventional LSTM and MT-LSTM to RM-LSTM can demonstrate the need of using firing rate information in both spike detection and decoding to stabilise the features and improve the long-term decoding performance.
The decoding performance of different models is shown at the bottom of figure 5(a). We can clearly observe their decaying trends following the threshold crossing rate. The conventional LSTM cannot provide any long-term decoding ability because of the inconsistent threshold crossing rate over time. CC is consistently below 0.1 and therefore not plotted. RM-LSTM-50 Hz(red), RM-LSTM-20 Hz (blue) and MT-LSTM (Green) are trained with the data from two different dates. The model trained on recordings collected on 15 September 2016 simulates the scenario when the electrode ageing is mild (dash curves), while the one trained on 27 June 2016 simulates the scenario of severe electrode ageing (solid curves). We also provide the results of the models trained and tested on the same day data as a reference observing how much the performance is degraded (dotted curves). When the electrode ageing is mild, the nearly overlapped red and blue dash curves indicate that there is no significant difference using activities further away. Two RM-LSTM outperforms the MT-LSTM by about 2%. Such improvement is mainly contributed by the model capability instead of firingrate modulation's contribution to long-term stability because the RM-LSTM reference already outperforms MT-LSTM reference for about 2% seen from the dotted reference curves. However, the improved performance of RM-LSTM is observed when the ageing is severe. After 80 to 100 days of implementation, compared to the reference model, 12% performance degradation can happen on MT-LSTM and 10% for RM-LSTM-20 Hz. However, it is only 8% for RM-LSTM-50 Hz.
Moreover, comparing the severe electrode ageing case to the mild ageing one, there is about a 5% difference for MT-LSTM. However, it is only a 2% difference for RM-LSTM-20 Hz and even better performance using RM-LSTM-50 Hz (Because the data quality of 27 June 2016 is better than that of 15 September 2016). Compared to the performance of the first day of implementation, after 100 days of implementation, MT-LSTM degrades for more than 20% while both RM-LSTMs degrade for only around 10%. The co-design improves the long-term decoding performance by nearly 10%.
To better understand the reason for improved long-term stability, we plotted the input feature distribution before and after normalization/firing rate modulation for different models in figure 5(b). The distribution of the firing rate feature obtained using STD thresholding changes significantly. After 200 days of implantation, the relationship between behaviour and firing rate feature was completely different using STD thresholding. Applying standard normalization (the gray row) to the testing data shifted the testing feature distribution mostly to the negative side. That is the reason for making the conventional LSTM fail over time. Although tracking the mean value of the features (the green row) can address this issue to some extent, the decoding model can still potentially suffer from the large difference between the training and testing feature distribution (comparing the two * distributions). However, when using firing-rate-based spike detection with firingrate modulation, there is no significant difference in the distribution (comparing the two # distributions). Shifting the distribution with the target detection rate can effectively map high/low firing rates to active/inactive behaviour, which is always valid in these 200 days, unlike that using STD thresholding. It is worth noting that using standard normalization for the firing-rate-based spike detection feature can lead to a similar mapping as firing-rate modulation does. However, we have already shown that introducing firing-rate modulation to LSTM decoding models can improve decoding performance, as shown in figure 4(a). Additionally, tracking the feature mean value in real-time requires more computation and memory.
Based on the observations above, it appears that consistent features and activities from the further regions are both beneficial to improve long-term stability, especially when the electrode ageing is severe. The proposed co-design can take advantage of both, delivering improved long-term stability compared to other LSTM implementations with standard statistical-based spike detection.

Discussions
Our results suggest that the relationship between spike detection and neural decoding performance is non-linear. The spike peaks of high amplitude contain substantial information to achieve acceptable decoding performance while detecting the spike peaks of amplitude comparable to the noise can improve the decoding performance despite more noise being detected. Such a finding brings opportunities for both BMI engineering and neuroscience. Considering the proposed spike detection and neural decoding system co-design, we identify that maintaining the input data consistency is one key to improving the decoding performance and robustness, especially in the long term, when recording quality is degraded gradually with electrode ageing.

Opportunity for designing low power BMIs
In [20], It is suggested that relaxing some of the hardware requirements, for example, sampling frequency, amplifier performance, and ADC precision, can reduce the system power consumption by an order of magnitude without much degrading the decoding performance. It also shows that randomly missed spikes can only degrade decoding performance by a small amount. Until this work, there has not been reported any spike detection algorithm that can 'miss' spikes heuristically.
In this work, we suggest relaxing the amount of data transmitted. Only transmitting the high-level peaks can reduce half of the data to be transmitted with less than 1% decoding performance degradation for behaviour decoding task. Half bandwidth reduction is significant in reducing transmission power, especially when wireless transmission is involved.
The threshold crossings can be encoded into a binary stream or the number of spikes detected in a certain period. The reduced detected spikes rate can either create a more sparse binary stream or reduce the dynamic range of the possible spike count range.
In both cases, the bandwidth can be reduced. Taking the latter case as an example, we assume the detection stream is binned at τ sec, and the detection rate is λ Hz following Poisson distribution. The number of arrival N of one bin is a Poisson process as equation (7) shows.
In practice, the spike count is saturated at a value S to limit the spike count data width. The probability of the spike count exceeding S (inclusive) is given in equation (9) and it needs L S bits for representing S values If we saturate the spike count at P τ (S) < 1‰, when a 50 Hz detection stream is binned at 0.01 s, S is 6 with P 6 0.01 ≈ 0.12‰. When the detection rate is reduced to 25 Hz, S becomes 4 with P 4 0.01 ≈ 0.13‰. Therefore 3 bits are required for 50 Hz spike detection counts, while it is 2 bits when detection at 25 Hz.
(10) Figure 6 shows how the number of bits and minimum code length for representing the spike counts. The minimum coding length is 1.34 at 50 Hz, while it becomes only 0.89 at 25 Hz. 33% bandwidth can be reduced while it only introduces less than 1% decoding degradation, as section 5.4 shows Figure 6. The number of bits and entropy encoding length for representing the spike counts.

Feature consistency
Inconsistent features can degrade the performance of the learning models. This inconsistency can come in three different forms. The first form is invalid features. For example, the model is fed with Motor Cortex recordings for behaviour decoding but tested with Visual Cortex recordings, which is technically incorrect. The second form is the incomplete feature set. For example, the model is trained on neural recordings for reaching tasks but tested to decode placing tasks. Such inconsistency can be overcome by designing more generalised models.
More importantly, the third form is the statistical difference, which we can meet in long-term neural decoding. The recorded amplitude of action potentials can gradually decrease over 37% within two months after implantation [12]. Some studies [12,14,50] suggest that threshold crossing can achieve better long-term stability than using spike waveforms or spike sorting results.
We need to know how machine learning or deep learning models are trained to understand such a reason. There is a normalisation step before feeding input features into models and guaranteeing the input is not statistically changed. However, tested datasets can only be normalised using training dataset statistics as the tested dataset should be unknown to the model. Normalising the test data using training statistics can be a problem in long-term decoding. The realistic mean of the input models can be lower than the features used in training. Normalisation can shift the testing features' centre to negative while it is expected to be at zero. That is why long-term degradation can happen (in addition to more noise being potentially contaminated).
Using spike waveform as the feature can be affected the most as it is most sensitive to such a normalisation error. Using SUA from spike sorting has worse long-term stability than using MUA. Besides fewer spikes appearing in each cluster, the inconsistent spike waveform can negatively impact the clustering algorithm. Such artefacts will propagate to the decoding model and even lower the decoding accuracy. Spike detection is only affected by the reduced number of detections.
Knowing the reason for long-term stability degradation, one key to resolving this issue is maintaining the consistent input feature of the decoding model. The proposed spike detection and neural decoding co-design system addresses that. This idea can potentially be explored in other domains, such as spike waveforms, SUA, or LFPs.

Limitation of this work
The results of this work have been validated on 5 h of recording over 7 months. However, the recordings are from a single subject operating one type of task. Findings are currently only valid for motor cortex behaviour decoding. The cross-subject generalisation is another important area to be investigated. Unfortunately, we cannot test so limited by the available data. We can expect improved cross-subject decoding performance if our claim on feature consistency is established.

Conclusion
In this work, we presented a methodology for BMI signal processing system co-design that significantly improves long-term decoding performance compared to current methods. By modulating the input feature with the target detection rate, we were able to improve decoding performance by more than 10% after 80 days of operation. We also demonstrated the nonlinear relationship between spike detection and decoding performance. Using the proposed co-design methodology, by reducing spike detection sensitivity to detect approximately half the spikes, results in only a 1% decoding degradation whilst also reducing communication bandwidth by at least 30%.
For the first time, this work tightly links spike detection with decoding. We believe the significant improvement presented in this work can motivate hypothesis-led data science strategies to further develop BMIs.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).