Efficient spatio-temporal feature clustering for large event-based datasets

Event-based cameras encode changes in a visual scene with high temporal precision and low power consumption, generating millions of events per second in the process. Current event-based processing algorithms do not scale well in terms of runtime and computational resources when applied to a large amount of data. This problem is further exacerbated by the development of high spatial resolution vision sensors. We introduce a fast and computationally efficient clustering algorithm that is particularly designed for dealing with large event-based datasets. The approach is based on the expectation-maximization (EM) algorithm and relies on a stochastic approximation of the E-step over a truncated space to reduce the computational burden and speed up the learning process. We evaluate the quality, complexity, and stability of the clustering algorithm on a variety of large event-based datasets, and then validate our approach with a classification task. The proposed algorithm is significantly faster than standard k-means and reduces computational demands by two to three orders of magnitude while being more stable, interpretable, and close to the state of the art in terms of classification accuracy.


Introduction
Event-based vision sensors such as the DVS [1] and ATIS [2] promise several advantages over conventional cameras, including a high dynamic range and independent pixels that are directly driven by scene activity at microsecond temporal resolution. These sensors take inspiration from the biological retina by replicating the low redundancy acquisition mechanism while consuming little power. Artificial intelligence and robotics applications might benefit from algorithms developed around such a sensory modality, especially when computing resources are limited.
A large spectrum of techniques is associated with the processing of events. On one end, we encounter algorithms that integrate events over a long time interval to produce frames which can then be processed with classical computer vision and machine learning techniques [3][4][5]. Naively reducing a large number of events to a frame is likely to remove information that is relevant to computing tasks such as object recognition or sensory-motor integration. Therefore, we choose to focus on algorithms that process individual events as they would maintain the benefits associated with event-based sensors [6].
In this work, we tackle the problem of learning from events through a probabilistic learning approach based on Gaussian mixture models (GMMs). To keep track of the scene dynamics, we use time-surfaces which provide us with the spatiotemporal context surrounding an event [7]. The time-surface, as further discussed in section 2.1, allows us to extend probabilistic learning techniques to events without breaking the asynchronous nature of the events.

GMMs and events
The GMM is a probabilistic machine learning model that spans a wide range of applications from astrophysics [35] to computer vision [36,37] and speech recognition [38]. For computer vision especially, identifying patterns of local image features with GMMs is arguably one of the most well-studied approaches [39]. Parameters of the model are typically optimized using expectation minimization (EM) which is an algorithm for maximum likelihood estimation. The complexity of a GMM learning model was until recently bound to be a function of the number of data points and the number of mixture components. As datasets grow larger, it is increasingly difficult to apply traditional clustering methods and improvements in computational complexity are all the more necessary. Novel algorithms for training mixture models with increased efficiency [40,41] are particularly promising for learning from large datasets.
In this work, we present a method for fast and efficient EM-based learning of Gaussian mixtures and we apply it to large event-based datasets. Probabilistic event generation models that account for sensor noise are typically represented by a Gaussian distribution as event camera noise statistics are still not completely understood [8,42]. The proposed algorithm is sub-linear in computational complexity with respect to the number of mixed Gaussian distributions. As GMMs are highly sensitive to initialization [43], the latest advances in seeding algorithms allow us to produce good initial clusters without compromising scalability to very large datasets [44,45]. Lightweight coresets can optionally be used to further reduce algorithmic complexity [46,47].
Our approach introduces a stochastic method for identifying a truncated approximation of the posterior during learning. Truncated approximations [48] have been previously employed on latent variable models to identify subspaces of the posterior that contain most of the posterior mass [49][50][51][52]. Empirical evidence has shown an improved performance in avoiding local optima during learning [53,54], however, truncation is typically carried out using deterministic methods which are inherently more sensitive to the local optimum problem. To our knowledge, no work explores stochastic variants for truncation. The proposed algorithm maintains a record of cluster similarities to identify clusters with a high probability of being near a datapoint without having to compute the posterior over all dimensions. The extension to event-based vision is ensured by an approach that uses time-surfaces [7], a feature representation that relates events that are close to each other in space and time. The combination of these two objectives leads to the proposed event-based GMM model.
We first evaluate the performance, quality, and stability of the algorithm on large event-based datasets that have been extensively used in the relevant literature such as POKER-DVS, a 4-class playing card pips dataset [55]; N-MNIST, a handwriting recognition dataset [56]; N-CARS, a car classification dataset [32]; and DvsGesture, a gesture recognition dataset [57]. The proposed algorithm is two to three orders of magnitude faster and more efficient than standard k-means and can be applied to a wide variety of tasks without compromising cluster quality and stability which is important for reproducibility and consistent inference. We then run a classification task on the same datasets and compare our method to state-of-the-art efficient clustering techniques [58,59] as well as some of the event-based feature extraction methods discussed in the previous section. Our algorithm manages to reach near state-of-the-art accuracy and is competitive against most current event-based learning methods, considering particularly the significant speedups in feature extraction.

Time-surfaces as spatio-temporal features
Event-based sensors register a change in visual information as an event e = {x, y, t, p} characterized by spatial coordinates x, y, a timestamp t, and a polarity p ∈ {−1, 1} indicating whether the luminance increased or decreased. As events are generated, they can be used to build and update time-surfaces (see figure 1). The polarity value does not carry useful information for this study and is therefore not taken into account.
Time-surfaces are 2D structures that make it possible to apply well-known machine learning algorithms on event streams by taking into consideration the spatio-temporal context around each event as formally introduced in [7]. They can encode both shape and motion by taking advantage of the temporal relation between neighboring events. For an event e i occurring on pixel (x i , y i ) at time t i , we can create a time-surface s i by applying an exponential decay kernel to the timestamps of the most recent events within a (2r + 1) 2 spatial neighborhood M i of radius r such that: where τ is the decay constant reflecting the scene dynamics. Since t i is the time of the current and most recent event, we always have t k t i . The exponential decay is a costly function to compute. As an alternative, we can use a piece-wise linear function to speed up the processing of events into time-surfaces such that: where τ is a decay constant similar to τ . In practice, we generate the discretized form, s (i) , of the time-surface using the Tonic Python package [60]. s (i) is a D-dimensional vector which can be rasterized into a grid of pixels centered around the pixel that recorded the event e i . The radius r is therefore reduced to a ρ × ρ-grid, with D = ρ * ρ. This provides the time-surface with a layout similar to an image patch with the only difference being that pixel intensities encode the time difference with earlier events.

Stochastic EM approximation on GMMs
Given a dataset of N events, a set S = {s (1) , . . . , s (N) } of time-surfaces is generated from all events according to equations (1) or (2) and is used as input to train the proposed clustering algorithm. Once the model converges, the cluster centers represent the elementary time-surfaces encountered in the dataset and can be used as features for classification. To classify an event stream into a specific group, the time-surfaces are assigned to the nearest cluster centers and a histogram of features is generated. Both non-linear and linear classifiers can be used at this stage, but we focus on the latter for increased efficiency and easily reproducible results. Figure 2 illustrates the GMM-based clustering pipeline from the input to the labeled events.
The clustering algorithm proposed in this work is based on GMMs and it assumes that each data point is distributed according to an observed random variable Y that follows one of a finite set of M multivariate  Gaussian probability densities: with a uniform prior probability distribution: where θ = {μ 1:M , σ , α 1:M } denotes the model parameters and C is a set of latent variables that take values in {1, . . . , M}.
A preliminary statistical analysis of event-based data shows that the typical covariance matrix of timesurfaces is a near-identity matrix compared to standard image features as shown in figure 3. This is priming our choice of using a single isotropic covariance σ 2 for the GMM, where denotes a D × D identity matrix. Each Gaussian distribution is therefore defined as: where d (n) c = s (n) − μ c 2 is the squared Euclidean distance between the data point s (n) and the mean of the Gaussian indexed by c. The typical approach to fitting a GMM to a dataset is through the use of the EM algorithm [61], a two-step iterative process that maximizes the likelihood of a model with respect to θ: Explicitly maximizing L(θ) for models with latent variables can be difficult [62]. Instead, the EM algorithm starts by computing a lower-bound to the log-likelihood in the E-step. This consists in identifying the posterior probability p (n) c of each Gaussian generating each data point: The posterior probabilities are then used in the M-step to update the model parameters and maximize the lower-bound. These two steps are repeated until convergence of the parameter estimates. The complexity of EM on GMMs is O(NMD) making it already a very efficient algorithm. Nevertheless, estimating the posterior is computationally intensive, particularly with large event-based datasets. To achieve better scaling behavior and increase computational efficiency, we rely on a truncated optimization scheme [48,52]. Instead of calculating p (n) c we propose to estimate the posterior over a truncated subspace of clusters The posterior for clusters c / ∈ K (n) , is never estimated and therefore q (n) c allows for a much faster implementation than p (n) c . In terms of total variation, c , the approximation is optimal when K (n) holds the clusters that account for most of the posterior mass.
In order to identify the set K (n) (equation (10)), we start by initializing it with H clusters drawn uniformly at random. At each iteration t, we update K (n) by comparing it with R clusters, randomly sampled from a proposal distribution p C t |C t−1 =c n ; S according to equation (9), and maintaining the H best clusters: wherec n = arg min c d (n) c |c ∈ K (n) and S ∈ R M×M is a cluster similarity matrix that assigns an increasing importance S i, j to cluster pairs {i, j} that are near the same time-surfaces. In other words, the proposal distribution at iteration t represents the neighborhood of the clusterc n that was closest to the time-surface s (n) in the previous iteration. where is an operator that keeps the H smallest elements of the set ordered by the distance d (n) c . The hyperparameters {H, R} fully specify the approximation and enable control of the complexity of the algorithm directly as opposed to a precision estimate in typical approximations. Compute time-surface s (n) according to equations (1) or (2) 3: end for 4: initialize μ 1:M , σ, S, K (n) = ∅ for all n; 5: initialize iteration counter t ← 0 6: repeat 7: 10: : end for 14: Similarly to the exact EM algorithm, we can estimate the parameter updates using the approximate posterior q (n) c from equation (8) and still maintain convergence [63]: Iterating between estimating q (n) c and the parameter updates (equations (11)-(13)) defines an algorithm we refer to as the data similarity GMM (D-GMM).
D-GMM starts with a randomly initialized K (n) that is intended to store the centers more likely to represent that data point. The set K (n) is updated using samples from the proposal distribution, at the beginning of each M-step with R new centers. After estimating the distances between the centers in K (n) and the newly sampled centers, we only maintain the H centers in K (n) nearest to the data point. The approximate posterior q (n) c is estimated over the set K (n) and used to update the centers in the M-step. Since q (n) c is inversely proportional to the centers' distance from the data point s (n) , far away centers would have a very low-exponentially decreasing-value for q (n) c . The complete procedure is detailed in algorithm 1. The complexity of the D-GMM method compared to the exact EM algorithm for GMMs reduces from O(NMD) to O (N(R + H)D) in the E-step where typically R + H M. The complexity becomes O NHD + NH 2 in the M-step. In section 3, we investigate applications of the algorithm where H 2 is orders of magnitude smaller than M and the model is successfully fitted to the data.
In order to highlight the importance of stochasticity in avoiding well-known local optima issues [54] we also constrain the algorithm to update K (n) using R samples drawn uniformly at random throughout the learning process instead of the proposal distribution i.e. after the first iteration, K (n) is updated using the following equation instead of equation (10): The uniform variant of D-GMM will be referred to as stochastic GMM (S-GMM).

Implementation details 2.3.1. Initialization
During the first epoch of the proposed algorithm, we initialize the Gaussian centers μ 1:M with the AFK-MC 2 method, which is an efficient approximation of the k-means++ seeding algorithm [45]. This algorithm samples an initial center μ 1 ∈ S uniformly at random and then uses it to derive the proposal distribution g(s|μ 1 ). A Markov chain of length m is used to iteratively sample sufficiently distinct new centers μ 2:M from the data. AFK-MC 2 requires one pass of complexity O(ND) through the data to define the proposal distribution g(s), and the centers are sampled with a complexity of O mM 2 D .

Lightweight coresets
To further improve computational efficiency, data summarization strategies such as lightweight coresets can be used to get smaller training sets [47,[64][65][66][67]. A coreset S = {(s 1 , w 1 ), . . . , (s N , w N )} is a representative subset of a full dataset S where each data point is weighted according to its significance in the original dataset. The weights w 1:N are a multiplicative constant on the parameters and therefore the parameter updates become: These updates can replace equations (11)-(13) in algorithm 1 to allow applications on a coreset S . Working on coresets introduces an error in the approximation that has been analyzed rigorously in earlier work [47].

Experiments and results
In this section, the performance and quality of D-GMM (algorithm 1) and the uniform variant, S-GMM, are evaluated in comparison to standard k-means initialized with the k-means++ method [43]. We first assess convergence on an artificial dataset with known ground truth [68] and then move on to an evaluation of the proposed clustering algorithms on a series of event-based vision datasets including POKER-DVS [55], N-MNIST [56], N-CARS [32], and DvsGesture [57]. The last three mentioned datasets are substantially large; to put this into perspective, a single recording of N-MNIST has an average of 4176 events and a total of more than 250 million events in the training set alone. As such, the clustering evaluation for these three datasets was done using a random subset to run the k-means baseline in a reasonable amount of time. Finally, we test the proposed algorithms on a real-world classification task, this time using the full datasets, and we compare them to feature extraction methods specifically designed for dealing with events.
The cluster centers are initialized using the AFK-MC 2 method with a Markov chain length m = 5. As a stopping criterion the iterative EM process is terminated when the increase in the lower-bound of the loglikelihood is less than = 10 −4 . Our algorithms were implemented on C++ with the help of a vectorization library, and numerical results are reported as an average of 5 trials unless otherwise stated.
As a reminder, the notations used in this work are listed below: • N: number of data points

Convergence analysis on artificial data
We explore the convergence properties of our truncated stochastic approach on an artificial dataset with N = 5000 data points and 15 cluster centers [68]. In figure 4 we measure the root mean squared error (RMSE) between learned centers and the ground truth across 100 trials, setting the hyperparameters to M = 15, H = 3 and R = 6. D-GMM and S-GMM are better at recovering the actual cluster centers compared to standard   [58] O N log(M)D JRMPC [59] O(NMD) Greedy [69] O NM 2 D k-means. Results suggest the stochastic approach is more effective at avoiding locally optimal solutions. Additionally, working with a truncated set can improve numeric stability by ignoring low probability clusters.

Clustering evaluation on event-based data
For a more descriptive analysis, we evaluate the complexity and stability our clustering algorithms on a series of large event-based vision datasets. We avoid using coresets initially to make a fair comparison with the k-means baseline. Table 1 gathers, upon availability from the sources, time complexities during the E-step of state-of-the-art efficient clustering algorithms to which D-GMM and S-GMM are compared. As previously mentioned, we only consider cases where R + H M as it seems to be sufficient for most applications. We also experimentally assess the complexity of the proposed algorithms. Figure 5 shows the scaling behavior of our algorithms with an increasing number of clusters M on POKER-DVS (top) and N-CARS (bottom), with M ranging from 200 to 1400 clusters and hyperparameters set to H = 5 and R = 10. On the left, we show the operations complexity, measured according to the number of distance evaluations d (n) c , and on the right, the time complexity in seconds. Runtimes were measured on an Intel Core i7-9700K CPU. Results suggest K-means is linear in both time and operations complexity while the proposed algorithms scale sub-linearly. POKER-DVS is a relatively small dataset, so the overhead associated with sampling from the proposal distribution (D-GMM) compared to uniform sampling (S-GMM) does not appear in the runtime. For larger datasets such as N-CARS, the overhead becomes non-negligible, reducing the difference in time complexity between D-GMM and S-GMM compared to the operations complexity.

Stability
We assess the proposed algorithms' ability to recover the same clusters using different initializations. We cluster N-CARS and POKER-DVS with hyperparameters set to M = 500, H = 5, and R = 10, and compare the  learned centers of each pair of runs using a normalized RMSE. Figure 6 shows the mean and standard deviation between errors. On both datasets, both algorithms show improvements in stability over K-means which is important for reproducibility. The increased stability of S-GMM compared to D-GMM is likely due to a better local optima avoidance when drawing R clusters based on the uniform distribution instead of limiting it to cluster neighborhoods [70]. These results lend further credibility to our rationale for exploring stochastic truncated approximations.

Hyperparameter selection
On POKER-DVS and N-CARS with M = 500, when selecting optimal values for the hyperparameters, we set H = 5 under the assumption that cluster centers with lower probabilities have a negligible effect because the probability values of the exact posterior decay exponentially. As seen in figure A1 of the appendix, on both datasets, D-GMM is faster and more computationally efficient than both S-GMM and the k-means baseline for values of R in the range of [10,30]. Beyond that, the overhead associated with sampling values from a proposal distribution impacts the runtime. Although, this is purely dependent on the implementation and can be optimized further. Our choice of R = 10 throughout these experiments effectively boils down to a compromise between runtime and computational efficiency.

Performance with coresets
Lightweight coresets can also be used to further improve performance by summarizing the training data [47]. Figure 7 investigates the cluster quality for different coreset sizes on POKER-DVS and N-CARS. This is measured using the relative quantization error with k-means on the full dataset as our baseline according to: where Q algorithm = N n=1 min c μ c − y (n) 2 is the quantization error of the GMM trained with the respective algorithm. Marker size on these plots represents the number of distance evaluations and results suggest an increase in performance associated with larger coresets. In all cases, the proposed algorithms cluster data with the lowest relative quantization error to the baseline for the least amount of distance evaluations which translates into higher quality clusters than k-means.
We also expand the analysis with coresets on larger event-based datasets. Results are summarized in table 2. The D-GMM and S-GMM algorithms are parameterized with H = 5 and R = 10 and are allowed to execute a different number of iterations until they satisfy the stopping criterion. Since the truncated algorithms are going through fewer clusters in each iteration, convergence requires more iterations than K-means. Nevertheless, iterations are significantly cheaper in terms of distance evaluations d (n) c leading to the observed increase in speed and computational efficiency. With coresets, D-GMM consistently has the lowest relative quantization error η across all datasets and reaches the highest average speedups in terms of runtime and distance evaluations.

Classification on event-based data
As an unsupervised learning method, clustering is useful for data exploration and to find structure in unlabeled data. For benchmarking purposes, we run a classification task on the same four event-based datasets used in the previous section.
To create features for training D-GMM and S-GMM, we generate time-surfaces for each event according to equation (1). Parameters are set to side length r = 5, decay factor τ = 80 000 μs, and polarity p = 1, ∀ p ∈ {−1, 1}, resulting in a patch of dimension D = r × r = 25. After training, each time-surface is assigned to a cluster center. In order to take into account the spatial component of a visual scene, we split the pixel array into sub-regions using a 4 × 4 grid. We use the cluster centers in each sub-region to create spatial feature histograms [71] describing the local statistics of the visual scene. From there, we concatenate these spatial histograms into a global histogram used to train a logistic regression classifier. The global histogram is normalized so that each dimension has zero mean and unit variance. This pre-processing step is important to get faster convergence.
In table 3, we provide an overview of the classification results compared to supervised and unsupervised methods specifically designed to deal with event-based datasets such as HOTS [7], HATS [32] and BLS [72]. To better highlight the significance of our event-based clustering pipeline, we also run the classification task on web-scale k-means [73], a popular k-means approximation, and other efficient clustering algorithms optimized for three-dimensional point clouds such as HGMR [58] and JRMPC [59]. For the last two methods, instead of generating time surfaces, we extract xyt point clouds from our data. Classification results on our algorithms are reported as an average over 5 trials; the rest is taken from the literature [7,20,32,72,74].
We set the hyperparameters to M = 500, N = 2 16 , H = 5 and R = 10, effectively avoiding any dataspecific parameter tuning to showcase the low parametrization effort required to get close to the state of the art in event-based classification. Both proposed algorithms have a similar accuracy on most datasets and show improvements over web-scale k-means. These results are in agreement with the clustering analysis which showed better local optima avoidance when working with a stochastic approximation to the EM algorithm over a truncated subspace. The classification accuracy of HGMR and JRMPC is mainly impacted by the temporal component in the data. Treating events as point clouds instead of simply preprocessing them into time-surfaces is a complex problem that is treated by a separate body of work [15][16][17]. The proposed clustering algorithm is better at classifying event-based datasets than most unsupervised feature extraction methods except for HATS which needs to have access to all past events to build average time-surfaces as evidenced by the use of a large decay constant. This process is costly and not scalable to large streams of events. Furthermore, HATS requires significant parametrization efforts for each new dataset compared to our method and makes use of a stronger and less interpretable classifier. State-of-the-art methods such as SLAYER [20] rely on multilayer networks to perform better on classification tasks. Using similar constraints, we also benchmark our method with a stronger classifier, a multi-layer perceptron (MLP) with one hidden layer of 256 neurons. Compared to the logistic regression, classification accuracy is improved across all datasets. Our model reaches an accuracy of 91.6 on DvsGesture and 90.38% on N-CARS, which is the best score for a clustering-based approach. Our results are noteworthy considering the speed and efficiency of our pipeline, which is more suitable for resource-constrained environments.

Discussion and conclusion
This paper introduced a novel clustering algorithm based on GMMs and tailored for large streams of events. The speed and computational efficiency of the proposed algorithm are significantly increased without adding complex extraneous computation unlike other efficient approximations found in the literature [41]. The algorithm is also more stable than k-means in recovering cluster centers which is important for reproducibility. A fast and user-friendly C++ implementation is publicly available. The simplicity of the algorithm also makes it easily transferable to widely used machine learning frameworks.
The speedup in operations and time complexity is particularly interesting when dealing with highdefinition event-based cameras or in the context of neuromorphic computing applications with very large datasets. The time complexity of current event-by-event algorithms quickly becomes prohibitive after a certain number of events [7,20,32], an issue that is not well addressed in the literature due to the reliance on subsets, small datasets, and event-based cameras with low spatial resolutions.
The proposed clustering algorithm exploits the extremely sparse nature of event-based data by building time surfaces that extract the spatio-temporal neighborhood around each event. These features can then be combined in a bag-of-words approach for classification with a simple linear model. At this stage, nonlinear classifiers can be explored for even better classification accuracy at the cost of efficiency and interpretability. This embodies the current state of neuromorphic computing which is associated with a tradeoff between efficiency and accuracy. Pre-processing events into time-surfaces allows retaining fine-grained temporal information about scene dynamics that is typically lost when binning events to frames. This step turns out to be essential for clustering event-based data due to the non-homogeneous dimensions of raw events.
The constraints introduced to the covariance structure can affect the expressive power of our algorithm; however, event-based pixels are not strongly correlated and can be described using simpler statistics. Less constrained models were briefly explored, and as expected, non-isotropic covariance matrices significantly increased complexity with limited returns.
To maximize the potential of our algorithm, we can combine it with lightweight coresets and AFK-MC 2 initialization [45,47] which are state of the art methods in the literature for seeding cluster centers and data summarization respectively. Future works aim at exploring an online learning variant of the proposed algorithm through incremental parameter updates. Figure A1 shows the results of a hyperparameter sweep to determine the optimal value for the number of new samples R in each EM iteration. The distance evaluation speedup and runtime speedup are measured compared to the k-means baseline according to the following equation: where S is the speedup, D is the number of distance evaluations, and R denotes the algorithm runtime. Values of R ranging from 10 to 40 seem to be optimal in terms of runtime and distance evaluations.