An unsupervised real-time spike sorting system based on optimized OSort

Objective. The OSort algorithm, a pivotal unsupervised spike sorting method, has been implemented in dedicated hardware devices for real-time spike sorting. However, due to the inherent complexity of neural recording environments, OSort still grapples with numerous transient cluster occurrences during the practical sorting process. This leads to substantial memory usage, heavy computational load, and complex hardware architectures, especially in noisy recordings and multi-channel systems. Approach. This study introduces an optimized OSort algorithm (opt-OSort) which utilizes correlation coefficient (CC), instead of Euclidean distance as classification criterion. The CC method not only bolsters the robustness of spike classification amidst the diverse and ever-changing conditions of physiological and recording noise environments, but also can finish the entire sorting procedure within a fixed number of cluster slots, thus preventing a large number of transient clusters. Moreover, the opt-OSort incorporates two configurable validation loops to efficiently reject cluster outliers and track recording variations caused by electrode drifting in real-time. Main results. The opt-OSort significantly reduces transient cluster occurrences by two orders of magnitude and decreases memory usage by 2.5–80 times in the number of pre-allocated transient clusters compared with other hardware implementations of OSort. The opt-OSort maintains an accuracy comparable to offline OSort and other commonly-used algorithms, with a sorting time of 0.68 µs as measured by the hardware-implemented system in both simulated datasets and experimental data. The opt-OSort’s ability to handle variations in neural activity caused by electrode drifting is also demonstrated. Significance. These results present a rapid, precise, and robust spike sorting solution suitable for integration into low-power, portable, closed-loop neural control systems and brain–computer interfaces.


Introduction
The analysis of extracellularly recorded neural action potentials (i.e.spikes) relies on spike sorting techniques [1,2].These techniques facilitate the decoding of population-level neural activities and contribute to the investigation of sensory information encoding and processing in neural circuitries [3][4][5][6].In extracellular neural recordings [7][8][9][10][11], action potentials from several nearby neurons are simultaneously captured by an implanted electrode, resulting in distinct shapes of spike waveforms depending on neuronal properties, as well as the relative position and orientation to the electrode.These spike shapes are extracted and employed as classification features in spike sorting algorithms to classify and assign each spike to its originating neuron, thereby providing a more accurate representation of the underlying neural activity.With the development of multielectrode probes and advanced neural modulation techniques such as optogenetics [12,13] and closed-loop control [14], efficient and accurate spike sorting systems are highly demanded.
In recent years, several real-time spike sorting algorithms have been developed and implemented in dedicated hardware.Template-matching [25,[31][32][33][34][35][36] is a widely utilized approach, wherein real-time spike sorting is conducted based on pre-trained spike templates or matching filters [37][38][39][40].As a supervised method, the pre-training phase is essential and directly impacts system performance.However, this approach faces challenges in efficiently handling recording variations and the appearance of new clusters after the training that commonly occur in chronic neural implants due to electrode drifting [1,34].As a result, untrained spikes are either discarded or misclassified [41].Alternatively, the unsupervised K-means algorithm has been modified for realtime spike sorting and implemented into low-power application-specific integrated circuit (ASIC) designs [42,43].Nonetheless, it requires pre-defined cluster numbers and encounters similar difficulties in handling variations.Thus, a robust unsupervised real-time spike sorting system that can effectively detect and track recording variations is highly demanded.
OSort [44], a widely used unsupervised spike sorting algorithm, demonstrates performance comparable to other offline algorithms [1,44,45].Unlike conventional offline spike sorting algorithms [1,17,45,46], OSort processes spike in a streaming manner and conducts sorting spike by spike, making it inherently appropriate for hardware implementations such as accelerated offline OSort on field programmable gate array (FPGA) platforms with a 25-fold speedup [47], or online OSort on FPGA or integrated circuit (IC) designs [41,[48][49][50][51][52][53][54].However, a significant challenge in performing real-time spike sorting using OSort is the massive memory usage and heavy computation load during the initiation phase, characterized by an uncertain large number of transient clusters to save before eventual convergence, especially in multi-channel systems.Various strategies have been developed to implement hardware OSort with reduced memory usage, including utilizing shared memory between all channels [36,49,54] or limiting the number of occurring clusters [41].In one example of an IC implementation for a 16-channel system [48,49], a multiplexing approach was utilized to share memory among all channels to generate the channel templates, and a template-matching process was then conducted across all channels to perform real-time spike sorting.Despite optimizing memory usage and working voltage, memory-related power consumption still accounted for a dominant 88% [49].In addition, accurately estimating the number of transient clusters is challenging, often requiring extra considerations to avoid cluster overflow, such as inferring from experimental data [47][48][49] or weighting cluster size and eliminating small clusters [36,41].Although existing designs of hardware OSort have recognized challenges posed by transient clusters and mitigated the issue through increased architecture complexity, there remains a need for a more targeted solution to tackle the issues of transient clusters in the implementation of hardware OSort.
In this study, we propose an optimized OSort algorithm (opt-OSort) specially adapted for realtime hardware implementation, which seeks to resolve issues related to transient clusters by applying a fixed number of cluster slots while preserving the performance and advantages of the original OSort.Furthermore, opt-OSort is designed to handle recording variations such as electrode drifting in chronic recordings.The FPGA-implemented opt-OSort achieves an accuracy comparable to offline OSort and exhibits a reduction in transient cluster occurrence by two orders of magnitude with a 0.68 µs sorting time.These outcomes demonstrate a rapid, precise and robust spike sorting solution that can Y Wu et al be integrated into low-power, portable or implantable devices for further closed-loop neural control systems.

Methods
The proposed opt-OSort algorithm utilizes cluster slots to perform spike sorting, where the number of cluster slots can be determined by the count of neurons recorded in experimental data, typically ranging from 4 to 6 per channel in extracellular recording [49].The implementation of a fixed number of cluster slots leads to a marked reduction in memory usage, thus reducing the computation of spike classification and facilitating further simplification of the related hardware architecture.This section starts with an introduction to the OSort algorithm and the obstacles in implementing hardware OSort, proceeds with an explanation of the proposed opt-OSort algorithm and its hardware implementation, and ends with the description of the simulated datasets and experimental recordings used to verify the performance.

OSort and transient clusters
The conventional OSort algorithm conducts spike sorting in a streaming manner as illustrated by the flowchart in figure 1.Initially, the distances (D) between the incoming streaming spikes and all existing cluster centers are calculated.These cluster centers are computed as the means of all spikes in the corresponding clusters, with the first spike of a new cluster serving as the center.Following this, the minimum distance (min(D)) of the input spike is compared to the preset spike classification threshold (T C ).When the min(D) exceeds T C , indicating that the input spike does not belong to any existing clusters, a new cluster is created, and the corresponding spike is established as the cluster center.The algorithm then completes the sorting process and awaits the next spike.Conversely, if min(D) is less than T C , the input spike is assigned to the min(D) cluster.In this case, as a new spike is assigned, the center of the min(D) cluster needs to be updated.Subsequently, the distances (D all ) between the min(D) cluster center and all other centers are computed, and the min(D) cluster is merged with the min(D all ) cluster if they are close to each other.During this procedure, if min(D all ) is less than the preset cluster merging threshold (T M ), indicating that two clusters are not distant from one another, they are merged together and a new cluster center is generated.This process is iteratively carried out until no cluster merging occurs (or min(D all ) > T M ).Once this is achieved, the spike classification is completed and the algorithm awaits the next input spike.Two thresholds T C and T M are estimated as the same value of N times δ in real implementation, where N is the number of samples used to represent a spike, and δ is the standard deviation of recording noise [44].The cluster center generated by two merged clusters is calculated as the weighted average of each center.For example, two cluster centers C 1 (containing M spikes) and C 2 (containing P spikes) are merged, the new cluster center C new is calculated as equation (1) [49]: Since the input spikes are streamingly sent into OSort and the OSort has the process of creating new clusters, the early formed clusters are usually not the final converging clusters.There are a large number of transient clusters gradually occurring during the procedure of spike sorting.The outputted spike ID in this phase is thus not the true ID of the spike.Each input spike will slowly drive the centers of transient clusters moving towards the direction of the converging centers, and these movements will further promote the merge of transient clusters that are closed in distance.Along with this procedure, these transient clusters are gradually occurring, moving, merging, and finally converged into several clusters which are equal to the number of recorded neurons in the experiment.
The first issue is that the number of transient clusters is difficult to estimate in real recordings.It is related to the distribution of spike data and recording noise, especially when the recorded neural signal is contaminated by non-Gaussian noise like electrode-drifting or animal moving artifacts, which are unavoidable to happen in the long-term recordings.All these factors can increase the number of transient clusters in the sorting procedure.For example, figure 2 shows the maximum number of transient clusters under different noise level in a simulated dataset, which is from the OSort paper and each set of data contains three neural clusters [44].The spike amplitudes are normalized into 1 and the noise level is expressed as the standard deviation of recording noise.Figure 2 demonstrates that the maximum number of transient clusters are varying greatly, from minimal 40 to maximal 201 clusters under different noise level.However, the capability designed to handle the transient clusters is usually limited in hardware.Thus, strategies have to be taken to estimate the number of occurring clusters, but it is usually impractical in real recordings.To solve this problem, the handled clusters are estimated by experimental observations in some designs [36,47,49], or monitoring the cluster size and removing the small size clusters to handle the case of cluster overflow [36,41,49].The hardware architecture is thus complicated.
The second issue caused by the transient clusters is the memory cost.Even though only three neural clusters are constructed in the dataset [44], all transient cluster centers  are required to be saved to calculate the final converging cluster centers.The extra memory required is about 13-67 times compared to the final converged clusters in the simulated dataset.This is not a problem for running OSort algorithm in a computer which has massive memory.However, for IC or ASIC design, the memory is very precious resource and its usage has to be carefully considered.Not to mention that the previous analysis is just for single-channel recording.In a commonly-used multi-channel system, 13-67 times of the memory cost is unacceptable without any optimization.Therefore, the architecture of sharing memory between channels is typically adopted to save memory in hardware implementations [36,49,54].Along with the memory cost, power consumption is another big challenge for implementing spike sorting digital signal processor (DSP) with big memory.After optimizing the architecture of hardware implementation (to save memory) and memory working voltage (to save power), the memory-related designs still take about 88% power consumption [49].Excessive power-consumption limits the OSortbased system implemented into miniaturized, portable, and low-power devices for future closed-loop control applications.
The third issue caused by the transient clusters is the heavy computation load.As described in figure 1, each new input spike requires computing distances with transient cluster centers to perform spike classification.After assigning the spike into a cluster, the cluster merging procedure is conducted between all existing centers.It is an iteration process: the cluster merging would be continually carried out between the transient clusters until all left clusters are far away from each other in distance.The large number of transient clusters obviously leads to more computations, therefore increasing computational latency and consuming more power.Thus, the development of low-power portable devices using OSort to perform real-time spike sorting is hindered.

Optimized OSort algorithm
The nature of spike sorting is to cluster a group of spikes that have similar features in the same group, which are distinct from other groups.The original OSort uses spike waveform as the spike features, which indicates that different spikes are clustered by their temporal shapes and the spike distances are used as the only factor to classify spikes and merge clusters.Combining with the streaming processed manner, it is easy to generate a large number of transient clusters, especially when the recorded neural signal is contaminated by non-Gaussian noise, which is unavoidable in real physiological conditions (electrical drifting, cell bursting activity, and local field potential occurrence) during the long term recordings [1].
To sort the neural spikes into different clusters, it has been found in previous work [34] that both Euclidean distance (ED) and correlational matching (CM) techniques achieved high sorting accuracy.And CM can better handle spikes with non-Gaussian distributions, making it more robust for physiological conditions and recording noises.The results can be explained by the fact that mathematically the correlation coefficient is much less sensitive to amplitude fluctuation as long as the spike shape is maintained, while ED may change significantly when the amplitude varies.
Inspired by this finding, we propose an optimized OSort algorithm (opt-OSort) that uses Pearson correlation coefficients (CC) between the temporal shapes as spike classification factor to reduce the transient cluster number.Neural spikes usually remain similar temporal shape under the same recording neuron [33,34,55] and this keeps the spike in the same clusters with high CC values, while the spikes from different clusters keep pretty low CC values.Yet if CC is used to replace distance as spike classification criterion, the converged cluster number could possibly be detected at the first several spikes as long as the spike shape in the same clusters is very similar.Therefore, three groups of neural spikes could potentially be detected by the first three spikes if they are belonging to different clusters, and the rest spikes would be automatically classified into the first three clusters as long as they have similar spike shapes and keep a high CC value with their respective cluster centers.This method could find the converged cluster number using a few numbers of spikes with high CC values.Hence, the process that a large number of transient clusters are merging and converging in the original OSort is avoided, and all the issues caused by transient clusters are thus addressed.
The equation of Pearson correlation coefficient used in spike sorting is shown as equation ( 2), ⃗ w i is the waveform of spike i arranged in temporal sequence.k is the number of samples in a spike.⃗ w a is the center of cluster a. ρ i a is the CC value of spike i with cluster center a. wi and wa are the averages of spike ⃗ w i and cluster center ⃗ w a .The CC has a value range of [−1,1].The value 1 indicates a very strong similarity, while −1 shows pretty poor similarity, where In order to demonstrate the characteristic of CC in the spike groups, the OSort dataset with typical noise level of 0.15 is sorted with CC as shown in figure 3. The temporal shapes of sorted all three spike groups are shown in figure 3  The flow chart of the opt-OSort algorithm is shown in figure 4. The first step of the algorithm (S1) is to set running related parameters.The definition of these parameters is listed as follows: • C number : the configurable maximum neural cluster number (4-6 for single-channel extracellularly recording).• ρ thr : CC threshold used to classify the spike into a specific cluster.• N loop1 : checking point at each N loop1 spikes.
• G size1 : the minimal cluster size at N loop1 checking point.
• N loop2 : checking point at each N loop2 spikes.
• G size2 : the minimal cluster size at N loop2 checking point.• D size : the threshold of discarded spikes.
The typical values of these parameters used in the work are introduced in table 1.After setting the parameters, the spike sorting procedure starts to work.The Input spike (S2) is sent to the module to count the total input spike number N total , and calculate the CC value ρ between the input spike and the C number of cluster centers (S3).Then, the obtained Max(ρ) value is compared with the classifying threshold ρ thr (S4).If Max(ρ) exceeds the ρ thr , the algorithm runs assigning cluster procedure (S5-S10); else the algorithm runs creating new cluster procedure (S11-S15).The two procedures are separately introduced as follows.
Assigning cluster procedure (S5-S10): as the Max(ρ) is larger than the classification threshold ρ thr , indicating that the input spike is strongly related to Max(ρ) cluster and is thus assigned to this cluster.The spike center of Max(ρ) cluster is then updated by re-calculating the spike mean, and the number of total spike N in this cluster adds one (S5).
Next, the algorithm checks if the N total is the multiple of N loop1 (S6).If not, the spike classification is completed and the algorithm will wait for another input spike (S2).If yes, the algorithm checks cluster size N in each of the sorted clusters (S7).If N is less than G size1 , the corresponding cluster is discarded as an interference cluster.In this case, the interference cluster center and spike number N are reset to zero (S8).After that, the algorithm checks if another checking point is coming (S9), that N total is multiple of N loop2 (N loop2 is much larger than N loop1 ).If yes, the algorithm checks if existing N in all sorted clusters is less than G size2 (S10).If yes, the corresponding cluster center and spike number N are also reset to zero (S8); otherwise the algorithm waits for another input spike (S2).
Creating new cluster procedure (S11-S15): If the Max(ρ) value is smaller than ρ thr , indicating that the input spike is not similar to all the existing clusters and a new cluster is created.Then, the algorithm checks if empty clusters are existing in all the C number clusters (S11).Here, the empty cluster is defined as the cluster size N is zero.If yes, the input spike is set as the new cluster center, and the new cluster size N is set to 1 (S12); if not, the spike is discarded and the discarded spike number D is added one (S13).After that, the algorithm checks if discarded spike number D is larger than D size (S14).If yes, indicating that too many spikes are discarded and the whole spike sorting procedure needs to be restarted.The total input spike counter N total , the C number of cluster centers, and the corresponding spike counter N are all reset (S15).If not, the algorithm waits for another input spike (S2).
The proposed opt-OSort algorithm provides two loop-checking points (N loop1 and N loop2 ), which basically have the same process but with different design purposes.The first loop-checking point (at each N loop1 spikes) usually sets a small N loop1 and G size1 (in this work, checking on every 200 spikes if existing clusters size is smaller than 4), to prevent the occasionally occurring spike outliers taking the position of the C number of cluster slots.Otherwise, the neural spikes are discarded due to the fact that the interference clusters may take all the C number of slots, and there is no empty cluster slot left to create a new cluster and assign the true spikes.This could happen as the interference are occurring prior to the true spikes in the neural recordings (e.g.recording noise or moving artifacts), taking the precious C number of cluster slots and being handled as new clusters.These interference clusters could be detected and removed in the first loop-checking process due to the abnormally-small size (occasionally occurring) compared to the true neural clusters (continually firing).The proposed opt-OSort sets a small N loop1 value to make a quick check and remove the interference clusters as soon as possible, thereby leaving the cluster positions to the true neural spikes.
The second loop-checking point (at each N loop2 spikes) is designed to prevent the case that the interference clusters fortunately exceed the first loopchecking threshold G size1 (e.g. a long period of recording noise caused spikes), which can be detected and removed at G size2 threshold checking.Therefore, the setting of N loop2 is larger than N loop1 , and G size2 is larger than G size1 .More than that, the second checking point could handle electrode-drifting caused recording clusters varying in the long-term recordings, in which the recording electrode is slowly drifting As described in the figure 4, the proposed opt-OSort algorithm only uses C number of cluster slots to perform the whole spike sorting procedure, therefore addressing all the issues caused by the large number of transient clusters when running the original OSort.C number is typically less than 6 in single-channel extracellular recordings configuration and is implemented into 4 in most hardware designs [31,33,36,49].In addition, the opt-OSort could automatically reject occasionally occurring spike outliers and only those that have strongly similar shapes, with steadily firing clusters are kept.Moreover, the opt-OSort could handle electrode drifting caused neural clusters varying in the long-term recordings.Compared with the conventional OSort algorithm, the present opt-OSort could achieve not just more efficient memory usage, less computational load, and less power consumption, but also more robust for various physiological conditions and recording noises.

Hardware implementation of opt-OSort algorithm
The hardware implementation of opt-OSort is shown in figure 5, which follows the flowchart of figure 4 but with hardware logic.In this design, we assume the spikes have been detected and only consider the real-time sorting function.The designed hardware contains four cluster centers: Spike Center 1-4 (C number = 4).Counter 1-4 are designed to count the spike number of the four clusters.The other running related parameters ρ thr , N loop1 , N loop2 , G size1 , G size2 , D size , are implemented as registers and configured by Uart module, which is designed to communicate with the computer.The value of these If not, the input spike will be discarded and the Counter 5 will add one to count the total discarded spikes.In the meantime, Comparator 4 module would compare the Counter 5 with D size (the threshold of discarded spikes) to decide if too many spikes are discarded and the whole system needs to be reset.If the system is reset, the Spike Center 1-4 and related Counter1-4 are reset, and the input spike counter N total is reset too.From the above description we could conclude that only simple logic is required to implement the opt-OSort.By contrast, special designs are usually required in original OSort to process a large number of transient clusters, and the complexity of designed architecture and resource usage are thus compromised.

Datasets for evaluation
In this paper, two synthetic and two in-vivo prerecorded datasets were used to benchmark the performance of the proposed opt-OSort algorithm.The first synthetic dataset, introduced in the original OSort paper [44], is used to compare the accuracy between the original OSort and opt-OSort.The second synthetic simulated dataset [1] is tested with both opt-OSort and common offline sorting algorithms.The in-vivo pre-recorded datasets, one publicly available extracellular recorded dataset [56,57] and the other dataset with spikes recorded in the olfactory bulb from freely-moving mice [34,58], are used to evaluate the performance of the opt-OSort.

OSort dataset
OSort dataset were constructed with spikes that are fired by neurons following independent Poisson distribution.The dataset contains three sets of data, and each set includes four neural traces with different levels of noise, which are scaled to have a standard deviation of 0.05, 0.10, 0.15, 0.20 and correspond to an SNR of 6.7, 3.4, 2.2 and 1.2, respectively [44].In order to construct the simulated data as closely as possible, the recordings were generated by randomly mixed with unidentifiable neural waveform as noise and superimposed with Poisson firing neural spikes.Therefore, the background noise was not Gaussian distributed and the waveform of identified neural spikes would be corrupted to enhance the sorting difficulty.

Wave_Clus dataset
The dataset was constructed from a spike library including 594 different averaging spike shapes recorded from the neocortex and basal [1].The dataset contain 23 sets of data, each set had three groups of neural spikes and was random selected from the spike library.The spike amplitudes were normalized into 1 and the noise level was expressed as the standard deviation of recording noise (from 0.05 to 0.2).The recording noise was generated from real recording superimposed with a serial of real spikes with small amplitude to mimic the noise of actual recordings generated by distant neurons.To further increase the realism of simulation conditions, three data with non-Gaussian spikes were built to simulate the real recording situation of burst firing, the correlation between spike amplitude and local field potential (LFP), electrode drifting.The dataset provide spike timestamps labels, therefore could be used to evaluate the performance of spike sorting algorithms.

Pre-recorded public dataset
This public dataset (hc-1) is an in-vivo recording dataset containing hippocampal neural activities of anesthetized rats from Buzsáki's Lab [56,57].This dataset contains both intracellular and extracellular recordings.The extracellular recordings have been adopted by various laboratories as a benchmark for clustering algorithms and were used for assessing the performance of the opt-OSort algorithm.Due to the abundant size of the original dataset, which consists of 39 experiments each with multiple recording channels, five example traces were randomly chosen to verify the sorting agreement between the opt-OSort and the KlustaKwik.

Pre-recorded experimental dataset
A set of pre-recorded neural spikes measured in the olfactory bulb from freely-moving mice were used to test proposed opt-OSort.All experiments were performed according to protocols approved by the Xuzhou Medical University Institutional Animal Care and Use Committee.The detailed procedure employed to record the neural voltages was previously described [34,58].In brief, fine wire electrodes were implanted into the olfactory bulb of the mice by the surgery for the recording.After recovery from the surgery, the head of the animal was fixed by the aluminum head plate to obtain stable neural recording.The recorded neural spikes were separated from the background noise and then sent into opt-OSort to perform real-time spike sorting.

Results
In this section, the proposed opt-OSort is initially compared with the offline OSort in the metric of spike sorting agreement using dataset from the original OSort paper.Following this, an additional publicly available Wave_Clus dataset is used to compare the sorting accuracy against six commonly used offline sorting algorithms.To demonstrate the robustness in handling in-vivo recordings, publicly available hc-1 dataset and pre-recorded experimental data from freely moving mice are employed to evaluate the accuracy of the opt-OSort.Lastly, the ability of opt-OSort to handle recording variations due to electrode drifting is demonstrated, and the real-time characteristics of the FPGA-implemented opt-OSort are assessed.

OSort dataset
Using the OSort V4.1 [44] as the reference, the sorting agreement and accuracy are compared against the opt-OSort.Figure 6 is used to show the sorting result of opt-OSort and OSort using data set of 0.1 level noise (the data set contains three neural clusters).Figures 6(A) and (B) are the spike waveforms sorted by opt-OSort and OSort, in which different neural clusters are indicated by different colors.As shown in figure 6(A), some big amplitude spikes (in red cluster) are correctly classified by the opt-OSort due to the opt-OSort using spike shape, not distance (original OSort does) to classify spikes.By contrast, the bigger amplitude spikes could result in large variations in spike distances and causing these spikes to be discarded by the original OSort.Therefore, the number of discarded spikes is smaller in opt-OSort comparing to OSort as shown in figures 6(C) and (D).Figures 6(E) and (F) are the projection of sorted spikes by the two sorters using the first two principal components (PCs).The number of discarded spikes (indicated as black dots) is smaller in opt-OSort comparing to OSort (especially in blue cluster).Figure 6(G) is the sorting agreement between opt-OSort and OSort, in which the spikes in the original OSort are used as base when calculating the agreement; the two sorters exhibit almost the identical sorting result (at least 99.1% agreement).Figure 6(H) is sorting results comparison with the ground truth.The opt-OSort demonstrates the same level of sorting accuracy as the original OSort.
The comparisons of accuracy between opt-OSort and OSort on all data from OSort paper are shown in figure 7(A).The dataset of four levels of noise (from 0.05 to 0.2) are respectively compared.The results show that: (1) both sorters demonstrate very high accuracy in all data (at least 87%); (2) two sorters are at the same level of sorting accuracy (the difference is within 2%).In fact, the opt-OSort obtains a little higher average accuracy on all noise levels than OSort.

Wave_Clus dataset
In order to further verify that the proposed opt-OSort has the same level of performance as the original OSort does, another publicly available dataset is used to test the accuracy of both sorters [1].In fact, this dataset has been widely used as benchmark to demonstrate the performance of spike sorting algorithms.The dataset contains 23 sets of data and each set of data has three neural spike groups.All 23 sets of data are plotted as box chart with the sorting accuracy as shown in figure 7(B).In average, the opt-OSort has a little higher accuracy than OSort indicated as the star.Figure 7(C) is the comparison of the transient clusters between two sorters during performing spike sorting.The proposed opt-OSort uses fixed 3 clusters to perform the whole spike sorting procedure, which is calculated as base (1 time).While, the OSort generates much higher number of transient clusters on all data sets: from 291 to 397 times of clusters are generated during the spike sorting procedure.Figure 7(D) shows the comparison against six commonly used software spike sorting algorithms.The opt-OSort obtains 79.8% average sorting accuracy, which is consistent with other hardware sorters in handling this   dataset [41,49].Although the accuracy of opt-OSort is not as good as other offline algorithms, the other algorithms have their own limitations.Support vector machine (SVM), Bayes, artificial neural network (ANN), and KlustaKwik are supervised algorithms and a training process is required.K-means is an unsupervised offline data clustering algorithm.However, K-means requires pre-assigning the cluster number, which is usually unknown information before the recording.Wave_Clus is another unsupervised, parameter-free data clustering algorithm but is computation-expensive, and the performance will be compromised when handling a big number of spikes.Not to mention that all introduced algorithms are offline methods, which could not be used to process real-time recorded screaming spikes.The proposed opt-OSort is an unsupervised, online, spike sorting algorithm, and is specially optimized for handling streamingly captured neural spikes in real-time.

Pre-recorded public dataset
The extracellular recordings from hc-1 dataset are utilized to access the performance of the proposed opt-OSort.The sorting result from KlustaKwik is served as a base and the sorting agreement between KlustaKwik and the proposed opt-OSort is demonstrated in figure 8. Spikes were extracted by KlustaKwik then sent to opt-OSort for classification.Meanwhile, those spikes were also being classified by KlustaKwik with the default settings.A total of 5 randomly selected traces are utilized.Figures 8(A)-(C) illustrate three example clusters detected in a recording trace sorted by the opt-OSort algorithm with color coding corresponds to trace 5 in figure 8(D).Figure 8(D) demonstrates the sorting agreement of five multi-cluster traces from this dataset.The maximum sorting agreement is 97.9%, while the minimum is 90.0%.The average sorting agreement is 95.7% indicating the proposed algorithm has similar accuracy with KlustaKwik when processing in-vivo neural spikes.

Pre-recorded experimental dataset
Pre-recorded neural voltages measured in the olfactory bulb from freely-moving mice are also used to test the accuracy of opt-OSort in handling real neural spikes [34].Since there is no ground truth in this dataset, the sorting agreement with widely-used commercial spike sorting software (offline sorter, Plexon Inc., Dallas TX) is employed to demonstrate the capability of opt-OSort.The sorting results from offline sorter are used as base, and the sorting agreement is calculated as the ratio of spikes in the opt-OSort divided spikes of the same clusters in the offline sorter.The temporal shapes of four neural clusters sorted by the two sorters are demonstrated in figure 9(A).The size of each cluster obtained from two sorters is also illustrated in the figure.The results show that the number of spikes is similar in the same cluster.Figure 9(B) is the sorting agreement of two sorters in all four clusters.In average, there is as high as 92.1% sorting agreement achieved, which demonstrates the performance of opt-OSort in handling real neural spikes.

Handling recording variation
In the real electrophysiological recording especially in long-term recording experiments, the recording electrode, which is typically implanted into the head of the animal by surgeries, will slowly drift away from the original recording positions.The spike amplitude of recorded neurons would change.Even, spikes from a recorded neuron may not be measured again and spikes from distant neurons are gradually captured.The proposed opt-OSort could handle this situation well and be adaptive to the variation of recording neurons.We construct a dataset using spikes from OSort data at the noise level of 0.15 to simulate the situation of electrode drifting, in which the electrode is slowly drifting away from blue neuron and being closed to another previously far away green neuron as shown in figure 10.
In figure 10, three groups of neural spikes are streamingly sent into the opt-OSort.The sorting procedure is illustrated following the timelines (T1-T4) and the recorded neurons in each specific time are expressed with color stars shown at the top of figure 10.The spikes of each neuron are projected as neural points with the same color as the stars using the first two PCs in the following four subfigures.At the beginning of time T1, the spikes from two recorded neurons are classified and shown as blue and red neural groups.As the time goes to T2, more streaming spikes from red and blue neurons are sorted by the opt-OSort and the discarded spike outliers are expressed as black points.At the time of T3, a new group of spikes from the green neuron are gradually occurring and the blue neural spikes is gradually stopping update, which is used to simulate the electrode drifting away from the blue neuron and being close to the green neuron.At the time of T4, the spikes from green and red neurons are continually increasing and the blue neural cluster is removed by the opt-OSort in the second loop-checking point, where the spikes in the blue cluster are counted to be smaller than G size2 threshold.

Real-time characteristics of opt-OSort in hardware implementation
The proposed opt-OSort has been designed as digital logic and implemented into a ZYNQ XC7Z020 FPGA development broad (PYNQ Z1, Xilinx CA) to exhibit the performance of real-time sorting.The logic is designed by Verilog hardware description language synthesized on FPGA using Xilinx development Tools (Vivado design suite, Xilinx, CA).The hardware architecture of the opt-OSort is introduced in figure 5 and the whole design is running at a 100 MHz system.The hardware implementation supports four clusters, and the clock cycle usage of each module is listed in table 2.
The hardware opt-OSort includes four major functions: (1) spike classification.The input spike is calculated CC value with cluster centers by the Computing Correlation Coefficient module, and 66 cycles are taken to complete the computation.In our implementation, 4 Computing Correlation Coefficient modules are parallelly designed to get the four CC values at the same time.Then, the maximum CC value (Max(ρ)) is obtained at Comparator 1. Following, the Comparator 2 takes 1 cycle to decide if Max(ρ)   From table 2 we conclude that the opt-OSort only takes 68 cycles, 0.68 µs to finish the Spike Classification function, which indicates that about 1470 588 spikes could be sorted in one second by implemented opt-OSort.The other three functions are pipelining design and working with the Spike Classification function at the same time during the whole sorting procedure.Therefore, as a new spike is calculating CC values for classification in the Spike Classification function, the Spike Update function is updating the cluster center of the last spike assigned.In the meantime, the Two Loop-checking and Discarded Spikes functions are monitoring the four cluster states and related control signals.Since the other three functions is using fewer cycles (56 cycles) than Spike Classification function (68 cycles), there are no extra waiting cycles required for finishing the three functions when the Spike Classification function is in proceeding.The whole system latency of the opt-OSort measured to be 68 cycles.By contrast, the cycles used in the original OSort to classify spikes are usually varied.Because the spike is classified by comparing the distances between the spike and all the existing cluster centers, and the occurring cluster centers are changed in the sorting procedure.Next, after assigning the spike into a cluster, the clusters merging are iteratively conducted and the time usage is usually uncertain.

Discussion
OSort, an unsupervised, online spike sorting algorithm, has demonstrated its performance and robustness in several studies, leading to its implementation in FPGA or IC design for real-time spike sorting [36,41,47,49,53,54].Because spikes are sequentially fed into OSort, the initial clusters are typically provisional, necessitating a cluster-merging process among numerous transient clusters before cluster convergence.These transient clusters lead to challenges for hardware implementations: (1) a substantial memory is required to store transient cluster centers, as all occurring cluster centers must be retained to accurately calculate the final converging ones through cluster merging; (2) spike classification and cluster merging within transient clusters result in increased computation; (3) system latency and power consumption escalate due to heavier computation; (4) additional design are essential to manage cluster overflow; (5) there's increased complexity in the architecture to finalize the sorting function.These limitations are even more severe in multi-channel systems as memory usage and computational scale proportionally increase with the channel number, making the original OSort not suitable for low-power, portable applications conducted on small-size animals without optimization.Although specific hardware OSort architectures have been proposed to mitigate these issues [36,41,49,54], a more efficient solution remains wanting.
The presented opt-OSort utilizes CC as the spike classification criterion, in which the spike shapes are employed to differentiate spikes.Highly similar shapes are classified into the same groups and maintain high CC values within the clusters, which are reasonable assumptions as given that the same neuron typically fires similar temporal profiles [25,26,33,34,36,54,55].Using CC as the criterion to evaluate spike similarity, makes the entire sorting procedure complete within the C number of cluster slots, and eliminates the need for a cluster-merging process between transient clusters as in the original OSort algorithm.This optimization significantly reduces computations required by OSort within the spike classification and cluster merging before the cluster converging.Especially in cluster merging calculations, which is an iterative process continually running among all transient clusters until no cluster distances trigger the merging threshold (T M ).Notably, this process would frequently run once a spike is assigned to an existing cluster, whereas are avoided in the proposed opt-OSort.All these advantages allow opt-OSort to be implemented into a computationally lightweight system.Moreover, considering that the number of transient clusters generated in OSort is indeterminate, the initial step to implement hardware OSort involves estimating or limiting the occurring clusters.Extra designs have to be developed to prevent cluster overflow during the operation of OSort [31,36,42,44].Consequently, this leads to a more complex architecture and increased resource usage.In contrast, the proposed opt-OSort only conducts operations within cluster slots, thereby permitting implementation with a simpler architecture.All these benefits make the proposed opt-OSort particularly suited for hardware implementation of miniature systems.
Thanks to the quick converging feature of CC, both the computational load and power consumption of the proposed opt-OSort can be further reduced.Utilizing CC as the classification criterion, the opt-OSort does not require the cluster centers to converge into the 'true' position as long as the shapes of the calculated centers have 'converged' .This is demonstrated in figure 11 using a simulated dataset.After approximately 20 spikes, the calculated cluster centers bear a strong resemblance to the converged centers, despite potential small discrepancies between the calculated centers and the final converged centers.Therefore, for hardware implementation of opt-OSort, an additional counter could be integrated into the design to track the spikes of each cluster.Once the cluster size exceeds a configurable threshold (i.e.20), the sorting process could cease updating the spike centers.This operation could further decrease the computation with negligible impact on sorting results.Even the two loop-checking functions could be skipped once the calculated centers are obtained.The opt-OSort could be employed as a CC-based template-matching classifier, where the templates are those centers that have counted to cease updating.Through these operations, the computations and power consumption of the proposed opt-OSort can be further optimized.
The presented opt-OSort, owing to its use of a fixed number of cluster slots, remarkably decreases the number of occurring clusters during the sorting procedure-by up to 397 times in the simulated dataset compared to the original OSort, as depicted in figure 8. Therefore, memory usage is significantly reduced in the proposed opt-OSort.Table 3 compares the memory usage of our research with other OSort-based hardware implementations [35,36,47,49,53,54].Considering that these systems store a cluster center with a varying number of samples (32 or 64), each sample's resolution differs (12 bits, 14 bits or 16 bits).Also, supported channels are varied, and implementations are executed on different hardware platforms (FPGA or IC).To ensure a balanced comparison, we utilize the available memory slots to facilitate cluster convergence in single-channel designs, thereby comparing the memory usage across OSort-based hardware implementations.Our approach achieves a minimum memory-saving of 2.5 times in the number of transient clusters and 50% raw memory reduction [36,49].In some designs, memory usage is maximized, reaching up to over 80 times [41].Even though the most memory-demanding work [41] consumes considerable memory, it delivers the fastest sorting speed among all compared designs (0.25 µs), thanks to its parallelly executed sorting procedure in the parallel OSort.Our research strikes a balance between a 0.68 µs sorting time and the lowest memory usage, marking a major difference compared to other studies.Additionally, our method could reduce power consumption due to decreased memory usage.According to previous research [49], despite optimization of memory architecture and memory working voltage, memory-related design continues to account for most of the power consumption (about 88%) in the implemented spike sorting DSP.
Our opt-OSort utilizes a specific number (C number ) of cluster slots to sort spikes, while maintains a high accuracy.It operates based on two conditions: (1) the same neuron fires strongly similar temporal profiles that are unique between different neurons, and this distinctness can be quantified by CC, as depicted in figure 3. (2) Interfering clusters are eliminated through two loop-checking processes, reserving the cluster space for genuine spike groups.The first, smaller loop-checking process, is designed to remove the spike outliers as quickly as possible.The second, larger loop-checking process is particularly useful for handling electrode-drifting induced recording variations.This phenomenon, which involves alterations in the amplitude and grouping of initially recorded spikes, is unavoidable in the long-term extracellular recording experiments.Our proposed opt-OSort uses spike shape as a feature to classify spikes, making it much less sensitive to amplitude fluctuations as long as the spike shape is preserved.In contrast, distances can change significantly when the amplitude varies, as seen in the original OSort.The CC-based classifier can better manage spikes with non-Gaussian distributions, rendering it more robust for physiological conditions and recording noises [34].Regarding cluster variations caused by electrode-drifting, the opt-OSort removes those clusters that cease to update due to electrode position change, replacing them with newly detected spike groups.Thus, the proposed opt-OSort can track variations in recording neurons in realtime, a crucial factor for downstream closed-loop control systems where feedback control is conducted based on decoded neural information.Clearly, accurate neural information provided by a resilient, adaptive spike sorting system could assure the decoding system to make the correct decisions.
The presented opt-OSort is an online system, specifically optimized for real-time spike sorting.Its accuracy has been validated with two simulated datasets and two datasets with pre-recorded real neural spikes.An FPGA-based hardware implementation has also been demonstrated.real-time performance of the developed opt-OSort has been measured as well.The current design could be directly employed in a single-channel spike sorting system.Given the speed of this design, which classifies a spike in just 0.68 µs, the proposed opt-OSort could easily support hundreds channels by adding a multiplexer.For high-density recordings, additional designs are required to address spike duplication.This occurs when the same spike is detected by several closely situated electrodes.Another challenge is when multiple spikes from different neurons happen within a narrow time window and are recorded as a single composite waveform, thus requiring more robust sorting algorithms [59] and extra efforts on hardware designs.In addition, the artifact introduced by electrical stimulation and/or motion artifacts during freely moving animals also requires extra preprocessing as artifact removal [60][61][62].Nevertheless, once the 'processed' spikes have been obtained, the sorting function can still be executed by the proposed opt-OSort.Our current focus is on designing and implementing a real-time spike sorting system for high-density electrodes using the opt-OSort, with hopes that this system will be able to support downstream closed-loop neural control systems.For such systems, precise spike information is crucial, and rigorous processing time is a necessity.

Conclusion
In this study, we have presented an advanced realtime, unsupervised spike sorting system developed using the proposed opt-OSort algorithm.Unlike the original OSort, which struggles with a large number of unavoidable transient clusters, the opt-OSort utilizes correlation coefficient as the classification criterion to enhance the robustness of spike classification under diverse and ever-changing recording conditions, and confines the cluster slots to a fixed number to complete the entire sorting procedure, This effectively mitigates the challenges posed by transient clusters.Additionally, the opt-OSort incorporates two configurable validation loops not only to effectively reject spike outliers induced by recording noise, but also to monitor the neural variations caused by electrode drifting in real-time.This approach results in a remarkable 300-fold reduction in cluster occurrences compared to the original OSort, and achieves a 2.5-80 times reduction in memory usage relative to the existing hardware implementations of OSort.The computational demands associated with spike classification and cluster merging processes between transient clusters are significantly reduced in opt-OSort.Additionally, the requirement for extra operations to prevent cluster overflow is eliminated.These optimizations render opt-OSort a system that is easily implementable using low-cost hardware, while offering low memory usage, reduced computational load, and lower implementation complexity.The FPGAimplemented opt-OSort achieves a remarkable online spike classification time of 0.68 µs, ensuring sufficient speed for real-time processing in downstream applications, such as closed-loop neural control or brain-computer interfaces, which response is generally required within milliseconds.Future efforts will direct towards developing a real-time spike sorting system specifically tailored for high-density electrodes using opt-OSort.

Figure 2 .
Figure 2. The number of transient clusters varies under different levels of noise using OSort dataset [44].
(A) with different colors.The CC values of each spike with all three cluster centers are projected as shown in figure 3(B), and these three clusters are clearly separated by CC values.

Figure 3 (
C) demonstrates the statistics results of CC values of each cluster.The pie charts in the upper of figure3(C) count the density of the CC values that are larger than 0.8 (strong similarity) in the corresponding cluster.The results illustrate that at least 97.8% (the green cluster) of spikes have a CC value larger than 0.8 (the other two clusters are 98.8% and 98.2%) with their respective centers, which implies the spikes keep strong similarity in the same cluster.Inspired by this observation, a CC-based opt-OSort and its procedure are proposed in this work.

Figure 3 .
Figure 3. Characteristics of the correlation coefficient (CC) represented by three spike clusters from the OSort dataset [44].(A) The spike waveform of the sorted spike groups by CC indicated in different colors at noise level of 0.15.(B) 3D projection of CC values of all the spikes in the classified cluster.(C) Histogram of CC values of all classified cluster.

Figure 5 .
Figure 5.The hardware implementation of proposed opt-OSort algorithm.

Figure 6 .
Figure 6.Example results of spike sorting with opt-OSort and OSort using the OSort dataset [44].Figures (A) and (B) are the spike sorting result of opt-OSort and OSort.Sorted spike groups are indicated by different colors.Figures (C) and (D) show the discarded spikes by opt-OSort and OSort, which are plotted with black color.Figures (E) and (F) are the projection of the sorting spikes by the opt-OSort and OSort using the first two principal components (PCs).Figure (G) is the sorting agreement of the three neural groups between opt-OSort and OSort.Figure (H) is the correctly sorted spikes with ground truth between opt-OSort and OSort.
Figure 6.Example results of spike sorting with opt-OSort and OSort using the OSort dataset [44].Figures (A) and (B) are the spike sorting result of opt-OSort and OSort.Sorted spike groups are indicated by different colors.Figures (C) and (D) show the discarded spikes by opt-OSort and OSort, which are plotted with black color.Figures (E) and (F) are the projection of the sorting spikes by the opt-OSort and OSort using the first two principal components (PCs).Figure (G) is the sorting agreement of the three neural groups between opt-OSort and OSort.Figure (H) is the correctly sorted spikes with ground truth between opt-OSort and OSort.
Figure 6.Example results of spike sorting with opt-OSort and OSort using the OSort dataset [44].Figures (A) and (B) are the spike sorting result of opt-OSort and OSort.Sorted spike groups are indicated by different colors.Figures (C) and (D) show the discarded spikes by opt-OSort and OSort, which are plotted with black color.Figures (E) and (F) are the projection of the sorting spikes by the opt-OSort and OSort using the first two principal components (PCs).Figure (G) is the sorting agreement of the three neural groups between opt-OSort and OSort.Figure (H) is the correctly sorted spikes with ground truth between opt-OSort and OSort.

Figure 7 .
Figure 7. Performance comparisons opt-OSort and other spike sorting algorithms.(A) Comparison of spike sorting accuracy against OSort with the OSort dataset [44].(B) Comparison of accuracy against OSort using Wave_Clus dataset [1].(C) Comparison of transient cluster number between opt-OSort and OSort with Wave_Clus dataset [1] under different noise level (normalized).(D) Comparison of accuracy between opt-OSort and other commonly used software sorting algorithms.

Figure 8 .
Figure 8. Spike sorting results using public hc-1 dataset from lab [56, 57].(A)-(C) are spike clusters found by opt-OSort from trace 5 in (D) with corresponding color coding; (D) 5 randomly chosen traces are used to compare the sorting agreement between KlustaKwik and opt-OSort.

Figure 10 .
Figure 10.The illustration of adaptivity of opt-OSort in handling of recording neurons.Three groups of neural spikes are used with different color, discarded spikes are plotted as black color.The sorting procedure of opt-OSort is demonstrated following the timeline (T1-T4) shown at top of the figure.The neurons occurring at specific time are illustrated with color stars.The sorting results at different time (T1-T4) are shown in the following 4 figures.Not updated neural groups are exhibited as gray color.

Figure 11 .
Figure 11.The CC value between cluster centers calculated by the varying number of spikes and the final cluster centers.

Table 1 .
The parameters used in the proposed opt-OSort algorithm.
size2 .The new occurring clusters are gradually taking the positions of removed clusters.This feature is demonstrated in the result section.After the two rounds of loop-checking, the left clusters are belonging to those steadily firing neurons, and all those occasionally appearing spike outliers are detected and removed.

Table 2 .
Timing characteristics of the opt-OSort.

Table 3 .
Comparison with other real-time OSort-based spike sorting systems.