General spiking neural network framework for the learning trajectory from a noisy mmWave radar

Xin Liu; Mingyu Yan; Lei Deng; Yujie Wu; De Han; Guoqi Li; Xiaochun Ye; Dongrui Fan

doi:10.1088/2634-4386/ac889b

1. Introduction

In recent years, diverse applications in measurement and estimation have been rapidly emerging in industrial scenarios. As a promising detection and ranging system, millimeter wave (mmWave) radars have attracted substantial attention in solving various tasks [3, 8, 12, 26, 27, 31, 39]. Generally, a mmWave radar operates in the frequency spectrum from 30 to 300 GHz. By transmitting an electromagnetic signal and receiving the return signal, it measures the surrounding information of the target object that contains relative speed, range, and angle, even if in a harsh environment [12]. However, processing and learning from mmWave radars is non-trivial. The point cloud generated by a mmWave radar inevitably contains outliers. Such outliers are irregularly distributed in the point cloud space and are located far away from the target object, which contributes nothing but noise to the following data processing [39]. Moreover, the generated point cloud is usually sparse, making the processing more difficult. In addition, considering a common combination of the mmWave radar and other systems, such as a visual sensing system, in a complex task [3], the temporal dependency between multiple systems needs to be handled well during modeling.

Typically, it is common to process mmWave radar data using signal processing approaches [2, 36]. Nevertheless, due to the intrinsic properties of mmWave radars and the increasing data volume, conventional approaches gradually present limitations. As a growing trend, deep learning based approaches are proposed to process mmWave radar data in recent literature [8, 12, 26, 31]. For example, milliMap [26] leverages a generative adversarial network [11] to address the noise and sparsity issues, in which a conditional generator is designed to denoise and densify image patches, and a discriminator is designed to distinguish the real and generated images. mm-Pose [31] uses convolutional neural networks (CNNs) [24] to map the encoded RGB images obtained from the mmWave radar into a high-dimensional space. Unfortunately, existing deep learning based methods for processing mmWave radar data are often customized for specific scenarios [8]. In addition, the temporal dependency in mmWave radar data is challenging to model. mmMesh [39] introduces a long short-term memory (LSTM) [17] model and adopts global/local LSTM layers to encode the temporal information, which brings huge parameters and a high computation cost. Moreover, extra costs and modules are required for dealing with the noise and sparsity issues in mmWave radar data. Therefore, a generic approach to solve all the above challenges is highly expected.

Spiking neural networks (SNNs), known as the third generation of neural networks [28], exploit the rich spatial-temporal dynamics of neuromorphic neurons to represent the complex information in spike patterns. Specifically, each spiking neuron in an SNN model continuously integrates weighted spikes from afferent neurons along the temporal dimension onto the membrane potential, and fires a spike event to efferent neurons once its membrane potential crosses a threshold. In this way, SNNs have natural advantages in capturing the temporal dependency of input data [6]. Moreover, SNNs are superior in handling noisy and sparse features, owing to the robust leakage and firing mechanisms of spiking neurons [4, 6, 16]. The above unique features of SNNs imply that it is promising to build a general framework with SNNs for processing the temporal, noisy, and sparse mmWave radar data, which is the goal and focus of this work.

To this end, we first propose mm-SNN, a general neuromorphic framework for processing mmWave radar data with SNN models, which resists the defects of noise and sparsity while effectively capturing the temporal dependency. Then, we introduce multiple attention-based mechanisms into the proposed framework to further improve the data representation and model accuracy. At last, we conduct extensive experiments to evaluate the proposed framework by processing real-world mmWave radar data in a trajectory estimation scenario. Specifically, the contributions of this paper can be summarized as follows:

We propose mm-SNN, a general neuromorphic framework for processing mmWave radar data with SNN models. mm-SNN is the first work that processes mmWave radar data without using extra modules for alleviating noise and sparsity issues.
To demonstrate the effectiveness and robustness of mm-SNN, we artificially add noises in the inputs of both mm-SNN and non-spiking models. The experimental results reveal the robustness of mm-SNN.
We apply mm-SNN to the scenario of trajectory estimation and introduce multiple attention-based mechanisms to improve the overall performance of mm-SNN. We find that the general mechanisms used in deep learning, such as attention, are also able to benefit the performance of SNNs by enhancing the data representation.

The rest of this paper is organized as follows. In section 2, we first introduce the background of the mmWave radar. Then, we indicate the present obstacles in learning mmWave radar data caused by the noise and sparsity issues. To solve the challenges, we further highlight the great potential of SNNs that can be fully leveraged to process mmWave radar data. In section 3, we first introduce the foundation of SNN models. Then, we propose the design of our general neuromorphic framework, i.e., mm-SNN, for processing mmWave radar data. Based on the proposed framework, we introduce attention-based mechanisms to improve the data representation and model accuracy. In section 4, we conduct extensive experiments along with comprehensive comparisons and analyses to demonstrate the effectiveness and robustness of mm-SNN. Finally, we conclude this paper in section 5.

2. Preliminary

2.1. Backgrounds of mmWave radar

mmWave is a particular electromagnetic wave whose wavelength lies in 1 to 10 mm. The property of short-wavelength increases the potential of frequency reuse and makes the mmWave a wide use in military projects. A mmWave radar transmits signals that are in the short-wavelength range and operates in the high-frequency bands. It generally utilizes techniques, such as frequency modulated continuous wave [33], to emit and receive signals for measuring the surrounding information of the target object in a 3D environment. Using short-wavelength signals, mmWave radars lower the size requirement of system components (e.g., antennas) and enhance the ability to detect highly tiny movements [22]. Recently, mmWave radars have been actively applied to many scenarios, owing to their inherent advantage of propagating in a harsh environment. Prior work have explored employing the usage of mmWave radars in diverse applications such as human sensing [31, 39] and environment sensing [8, 26]. Moreover, in some industrial scenarios, a mmWave radar can be tentatively utilized as a critical component for determining the distance and angle in the automotive industry [9, 21], since it works well at night and in environments with airborne floatage (e.g., smoke, fog, and dust) [10, 12, 26]. Achievements of adopting mmWave radars in both academia and industry thus prove their popularity.

2.2. Challenges in processing mmWave radar data

mmWave radars show promising performance in tasks of measurement and estimation. However, due to the intrinsic properties, processing and learning from mmWave radar data is still a challenge. Herein, we list two typical properties, i.e., noise and sparsity in mmWave data. Moreover, problems caused by these properties and some possible solutions are also highlighted as follows.

Noise. As a long-standing issue, noise in data can be generated for a variety of factors. For instance, multi-path noise usually occurs in the process of light spread and reflection [26, 27]. Unfortunately, the noise, especially some outliers in data, can lead to non-negligible problems. As the noisy data is fed into a neural network based framework for learning, outliers in the input data can deteriorate the model training by causing gradient explosion. Existing works have spent considerable efforts to deal with the noise or remove outliers in data. mmMesh [39] proposes to remove the noisy mmWave signals reflected by static surrounding objects via a pre-processing component. SuperRF [8] proposes to design a compressed sensing based approach to reduce the influence of the noise on model output accuracy. However, the denoising process usually brings about an extra cost that cannot always be neglected, especially when the model size is not large enough to cover the preprocessing cost.

Sparsity. Distinct from other techniques, such as Light Detection and Ranging (LiDAR), a mmWave radar generally generates highly sparse data. For instance, a typical LiDAR scan has 100× more points than a mmWave radar in the same situation, which means the point cloud generated by a mmWave radar is 100× sparser than a LiDAR [26]. Such a sparsity property can cause several problems. First, processing a large size point cloud where only a few points are valid will bring about a unnecessary computational cost [31]. Second, sparsity can degrade the quality of data, thus invalidating conventional methods that are designed for processing LiDAR data [26]. Previous work have explored solving the sparsity issue via various approaches such as point cloud upsampling [42] and attention mechanisms [39]. Despite these existing solutions, dealing with the sparsity issue of mmWave radar data still needs a case-by-case analysis.

2.3. Potential of SNNs

The aforementioned properties, i.e., the noise and sparsity in data, cause challenges in the downstream processing and learning. Moreover, many existing approaches proposed to solve the noise and sparsity issues are generally tailored for specific scenarios. For example, SuperRF [8] customizes its method for the radio frequency (RF) sensing scenario considering the distinction between RF signals and the ordinary point clouds. Thereby, given a case where multiple types of sensory data (more than only mmWave radar data) can be flexibly introduced for modelling, it is expected to have all data processed by a general framework, instead of designing case-by-case modules for each type of data.

Brain-inspired SNNs have drawn increasing attention, owing to their biological plausibility and model robustness. As the third generation of neural network models, SNNs work in a neuro-biomimetic approach, in which each neuron decides to update the membrane potential or fire a spike according to the received input, the memorized state, and the difference between the membrane potential and the threshold. By leveraging the strengths of SNNs, some existing works have achieved significant success in several applications [5, 30, 40, 43]. Recently, there has been emerging research interest in exploring the intrinsic properties of SNNs, especially the robustness of SNNs. LISNN [4] suppresses the noisy area in the input and improves the anti-noise ability of the SNN model by adding a lateral interaction mechanism. HIRE-SNN [23] leverages a timing dependent backpropagation approach to train an SNN model and enhance its intrinsic robustness. It is also discussed in prior literature [6, 16] that SNNs can use a self-neuron recurrence restriction mechanism to restrict the weight update for resisting the impact from noise and stabilizing the SNN model.

Since the point cloud generated by a mmWave radar is sparse and generally contains noise, learning from such data will inevitably face two challenges: extracting and learning features from the sparse input, and dealing with noise in the input data. As aforementioned, SNNs have natural advantages in processing noisy data, which is benefited from the leakage and firing mechanisms. Moreover, SNNs use sparse spike activities to represent information and show promising potential in processing sparse features [16]. The above characteristics provide opportunities for learning mmWave radar data via an SNN-based framework.

3. Methodology

3.1. Fundamentals of SNN

The leaky integrate and fire (LIF) model is one of the commonly used neural models for describing the behaviors of a spiking neuron, such as input integration, membrane potential update, and spike generation. In artificial neural networks (ANNs), the ReLU function (and the variant leaky ReLU) is used for activation. Similarly, SNNs use the LIF model to determine whether to update the membrane potential or reset the potential and fire a spike simultaneously. Applying the LIF model to SNNs has the following advantages: (1) it is tractable in the aspect of mathematical calculation. (2) It preserves the biological characteristics of neurons in the brain to a certain extent (3) it improves the robustness of SNN models against external noise with the help of leakage and firing mechanisms. Typically, we describe a differential format of the LIF model as follows:

$\begin{equation}\begin{aligned}\hfill \tau \frac{\mathrm{d}u(t)}{\mathrm{d}t}=-u(t)+I(t),\quad u(t)< {u}_{\text{threshold}},\\ \hfill \mathrm{F}\mathrm{i}\mathrm{r}\mathrm{e}\ \mathrm{a}\ \mathrm{s}\mathrm{p}\mathrm{i}\mathrm{k}\mathrm{e}\ \& \ \mathrm{R}\mathrm{e}\mathrm{s}\mathrm{e}\mathrm{t}\ u(t),\quad u(t)\geqslant {u}_{\text{threshold}},\end{aligned}\end{equation} \tag{ 1 }$

where u(t) denotes the postsynaptic membrane potential that varies with t, I(t) denotes the presynaptic input. τ is a time constant and u_threshold is the preset threshold for firing spikes. Apparently, according to equation (1), the membrane potential u(t) is updated based on the input I(t) and the historic state of u(t). In another case, the target neuron fires a spike, and the membrane potential u(t) is reset to a preset value. Moreover, to make the LIF model more explicit and easier for implementation on existing deep learning frameworks (e.g., Tensorflow [1], Pytorch [29]), we adopt an iterative version proposed in the literature [38]. The equations we used to implement the LIF model in SNNs are governed by

$\begin{equation}{u}^{n+1,i}(t+1)={k}_{\text{decay}}{u}^{n+1,i}(t)(1-{o}^{n+1,i}(t))+\sum\limits _{j=1}{w}^{n,ij}{o}^{n,j}(t+1),\end{equation} \tag{ 2 }$

$\begin{equation} {o}^{n+1,i}(t+1)=H({u}^{n+1,i}(t+1)-{u}_{\text{threshold}}),\end{equation} \tag{ 3 }$

where more specific representations are added to the original LIF model, that is, u^n,i(t) denotes the membrane potential of the ith neuron in the nth layer, o^n,i(t) denotes the spiking activity (i.e., 1 for firing and 0 for not) of the ith neuron in the nth layer. Additionally, k_decay is a decay factor and w^n,ij is the synaptic weight. The function H(⋅) is a Heaviside function that determines whether firing a spike event or not by evaluating the relationship of amplitude between the membrane potential u^n+1,i(t + 1) and the threshold u_threshold. In the case of u^n+1,i(t + 1) ⩾ u_threshold, the output of H(⋅) equals 1, i.e., firing a spike event.

With respect to the training scheme, we adopt the spatio-temporal backpropagation (STBP) [37] for SNNs. In STBP, the gradients, i.e., $\frac{\mathrm{d}L}{\mathrm{d}{u}_{t}^{n,i}}$ and $\frac{\mathrm{d}L}{\mathrm{d}{o}_{t}^{n,i}}$ (L denotes the loss function) are needed to compute for the model update. However, $\frac{\mathrm{d}{o}_{t}^{n,i}}{\mathrm{d}{u}_{t}^{n,i}}$ does not exist due to the nondifferentiable nature of the spiking activity. A surrogate function is introduced to approximate the derivative of H(⋅):

$\begin{equation}\frac{\mathrm{d}o}{\mathrm{d}u}\approx \begin{cases}\frac{1}{a},\quad \hfill & \vert u-{u}_{\text{threshold}}\vert \leqslant \frac{a}{2}\hfill \\ 0,\quad \hfill & \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\hfill \end{cases}\end{equation} \tag{ 4 }$

where u denotes the membrane potential, and a is an adjustment coefficient for controlling the membrane potential window. Equation (4) indicates that the gradient is allowed to pass through only the membrane potential is close the threshold.

3.2. General SNN framework

(1) Overall framework: figure 1 illustrates the overall framework of mm-SNN. Taking raw data as the input, mm-SNN outputs the learned representation that can be directly used in downstream tasks. Systematically, we divide the overall process into the following five substeps:

❶ Processing data: the raw data is preprocessed via approaches such as projection and discretization [25]. After preprocessing, the data can be directly processed by SNN models.

❷ Learning data: in mm-SNN, we choose suitable SNN models for diverse types of processed data. For instance, a 3D point cloud is converted into the 2D structured format that includes abundant information in the spatial dimension, just like images. Inspired by the successful attempt of estimating optical flow through an SNN model [14], we explicitly construct a temporal relationship between two continuous batches of the 2D data (i.e., t = T and t = T + 1) to capture the temporal dependency.

❸ Merging features: given the features of diverse data that have been extracted by SNN models, to fit the design of a multi-sensor system, mm-SNN synthetically utilizes multiple features to generate a global feature for modeling temporal dependency. The detailed process will be introduced in section 3.

❹ Modeling temporal dependency: temporal dependency lies in two situations. In one sensory data, such as mmWave radar data, the relationship in the temporal dimension of two continuous batches has been captured by a spiking CNN model. Furthermore, since multiple features are merged into a global one, the temporal dependency between these features needs to be modeled via a spiking multi-layer perceptron (MLP).

❺ Transformation: the use of fully connected (FC) layers in mm-SNN follows the general idea of using FC in CNNs for feature transformation and classification. The features generated by the spiking MLP are transformed into a low-dimensional space for output.

The information from a mmWave radar can be supplemented by the information from other sensors to form high-quality data for downstream tasks [13, 31]. As an example, two sensory data, i.e., a point cloud generated by a mmWave radar and an inertial sequence generated by an inertial sensor, are fed to the SNN model for learning. Please note that, as a general framework, the architecture of mm-SNN is flexible to extend to handle extra sensors by simply introducing SNN modules for processing the data from those sensors.

(2) Dependency in temporal dimension: temporal dependency is critical in learning information among multiple relative objects in the temporal dimension. We explain the detailed approaches and the significance of modeling the temporal dependency in two situations as follows. First, as aforementioned, to resist the impact of the sparsity issue in the point cloud, we utilize a projection function to convert the 3D point coordinates into the corresponding 2D positions to form the 2D structured data. Since point cloud data is widely used in object detection and motion estimation tasks, we thereby start with some attempts in the computer vision domain to help extract useful representation from the 2D structured data. Inspired by some CNN-based approaches [7, 20] that take two images as the respective inputs of two identical branch networks, we take the concept of optical flow [18] and construct the relationship in the temporal dimension of two continuous 2D data (i.e., t = T and t = T + 1). Instead of processing two images via a branched CNN model, we combine two continuous 2D data as an integrated input and feed it into a spiking CNN. The spiking CNN is qualified to extract spatial-temporal features from the structured 2D data, owing to the excellent ability of capturing temporal dependency. Second, to promote generality, we design the mm-SNN to fit a multi-sensor system, in which more than one sensors are utilized to collect data from one target object. For example, a mmWave radar detects the motion attitude and orientation, while an inertial sensor measures the rotation and acceleration. The recorded point cloud and sequences are simultaneously used to estimate the overall motion situation. In this case, it is critical to capture the temporal dependency among data generated by multiple sensors, especially for the merged features. Therefore, we adopt a spiking MLP to model the temporal dependency of the features after merging. A spiking MLP generally has a shallow network architecture, fewer parameters, and enough ability of capturing temporal dependency, helping achieve comparable performance in temporal modeling with a lower cost.

3.3. Improvements: leveraging attention mechanisms

Targeting processing mmWave radar data and other sensory data in a general manner is the fundamental design purpose of mm-SNN. By adopting diverse SNN models, mm-SNN is able to effectively learn from inputs generated by a multi-sensor system and yields outputs for downstream tasks. Moreover, we argue that the performance of mm-SNN can be further improved by adding extra mechanisms. Herein, we choose the attention mechanism as a case study owing to its wide usage, considerable performance, and 'plug and play' convenience. To be universal, we introduce general and straightforward attention-based improvements into mm-SNN to enhance the data representation during learning. Please note that, we merely introduce some typical attention-based mechanisms, while it is of great potential to leverage more advanced attention modules for further improvement. We demonstrate the performance of adding attention-based mechanisms in section 2.

(1) Channel-wise attention module for 2D structured data: SENet [19] was first proposed to adaptively recalibrate weights among channels and capture the channel-wise relationship. The core mechanism of SENet is the ingenuity of a squeeze-and-excitation (SE) block. Specifically, the squeeze module first applies a global average pooling operation to gather spatial information. Then, the excitation module captures the channel-wise dependency by utilizing a stacked structure of FC layers and non-linear activation functions (ReLU and Sigmoid). The above process can be formulated as

$\begin{equation}{z}_{c}=\mathtt{S}\mathtt{q}\mathtt{u}\mathtt{e}\mathtt{e}\mathtt{z}\mathtt{e}({\mathbf{x}}_{c})=\frac{1}{H\times W}\sum\limits _{j=1}^{H}\sum\limits _{k=1}^{W}{x}_{c}(j,k),\end{equation} \tag{ 5 }$

$\begin{equation}\ \mathbf{s}=\mathtt{E}\mathtt{x}\mathtt{c}\mathtt{i}\mathtt{t}\mathtt{a}\mathtt{t}\mathtt{i}\mathtt{o}\mathtt{n}(\mathbf{W},\mathbf{z})=\sigma ({\mathbf{W}}_{2}\delta ({\mathbf{W}}_{1}\mathbf{z})),\end{equation} \tag{ 6 }$

$\begin{equation} {\mathbf{y}}_{c}=\mathtt{S}\mathtt{c}\mathtt{a}\mathtt{l}\mathtt{e}({s}_{c},{\mathbf{x}}_{c})={s}_{c}\cdot {\mathbf{x}}_{c},\end{equation} \tag{ 7 }$

where x_c denotes the input feature map in the shape of ${\mathbb{R}}^{H\times W}$ , H × W is the spatial resolution. W₁ and W₂ denote learnable parameters for reducing computational complexity. σ(⋅) and δ(⋅) denote non-linear activation functions, i.e., ReLU and Sigmoid. The final output y_c is acquired by adopting the channel-wise multiplication between x_c and the scalar s_c. As illustrated in figure 2(a), the module can be plugged between two continuous layers in a spiking CNN. The reason for choosing this attention module is two-fold: (1) SENet is a classic mechanism that pioneers the channel attention for improving the representation ability; (2) since the preprocessing operation projects the unstructured point cloud to the 2D structured data for further learning, we adopt a spiking CNN to extract features from such data, which imitates the case of processing images via a typical non-spiking CNN. Therefore, we argue that the typical attention mechanism, i.e., SENet, can also enhance the representation in spiking models, and refer to such module as SELayer2D in the rest of this paper.

**Figure 2.** Improvements via attention mechanisms: (a) SELayer2D, an attention module that focuses on the channel-wise information and can be directly plugged into a spiking CNN; (b) SELayer1D, an attention module that focuses on the temporal-wise information and can be directly plugged after a spiking MLP; (c) (upper) a naive approach for handling multiple features, and (lower) a preferable approach that leverages hybrid attention for fusing multiple features.
Download figure:
Standard image High-resolution image

(2) Temporal-wise attention module for sequences: a temporal-wise attention module for SNNs is recently proposed in TA-SNN [41] to improve the classification performance of event streams. Execution of the module can also be divided into two steps, i.e., the squeeze step and the excitation step, taking inspiration from the SENet [19]. The squeeze step calculates a statistical vector sⁿ⁻¹ of event numbers for each timestep t in the nth layer by using a global average pooling operation. The excitation step feds the statistical vector into a two-layer FC model with nonlinear activation functions to obtain the correlation vector dⁿ⁻¹ between different frames. The above process can be formulated as

$\begin{equation}{s}_{t}^{n-1}=\frac{1}{C\times H\times W}\sum\limits _{c=1}^{C}\sum\limits _{i=1}^{H}\sum\limits _{j=1}^{W}{\mathbf{X}}_{t}^{n-1}(c,i,j),\end{equation} \tag{ 8 }$

$\begin{equation}\qquad {\mathbf{d}}^{n-1}=\begin{cases}\sigma ({\mathbf{W}}_{2}^{n}\delta ({\mathbf{W}}_{1}^{n}{\mathbf{s}}^{n-1})),\quad \hfill & \mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}\mathrm{i}\mathrm{n}\mathrm{g},\hfill \\ \,f(\sigma ({\mathbf{W}}_{2}^{n}\delta ({\mathbf{W}}_{1}^{n}{\mathbf{s}}^{n-1}))-{d}_{\mathrm{t}\mathrm{h}}),\quad \hfill & \mathrm{i}\mathrm{n}\mathrm{f}\mathrm{e}\mathrm{r}\mathrm{e}\mathrm{n}\mathrm{c}\mathrm{e},\hfill \end{cases}\end{equation} \tag{ 9 }$

where ${\mathbf{X}}_{t}^{n-1}$ denotes the spatial input tensor in the nth layer at the tth timestep, and C denotes the channel size. ${\mathbf{W}}_{1}^{n}$ and ${\mathbf{W}}_{2}^{n}$ denote learnable parameters in the nth layer. f(⋅) denotes a unit step function, σ(⋅) and δ(⋅) are the same as those in equation (6). d_th is a threshold parameter used to adjust dⁿ⁻¹ in the inference phase. In the situation of processing 1D sequences, only the training phase is focused. As illustrated in figure 2(b), we modify the above process by performing an adaptive global average pooling on the sequence data, in which the temporal dimension is preserved for modeling interdependency. Specifically, we implement the module using a similar structure of SELayer2D, and we refer to the temporal-wise attention module as SELayer1D in the rest of this paper.

(3) Hybrid attention module for merged features: hybrid attention is inspired by recent literature [27, 34], in which multiple attention modules are combined to acquire focused areas in merged features. Nevertheless, hybrid attention is not the only approach to handle features captured from multiple sensors. As illustrated in figure 2(c), a naive approach is that features can be directly concatenated before being fed into an attention module. Since features extracted from multiple sensors are concatenated as a whole, using a single attention module has limitations when multi-dimension (i.e., spatial and temporal dimensions) dependencies abundantly exist between various types of features. Therefore, we use the hybrid attention, a combination of two single attention (function named $\mathtt{A}\mathtt{t}\mathtt{t}(\cdot )$ ) and two cross attention (function named $\mathtt{C}\mathtt{r}\mathtt{o}\mathtt{s}\mathtt{s}\mathtt{A}\mathtt{t}\mathtt{t}(\cdot )$ ) modules, to establish a relationship between various types of features. We formulate the procedures of performing the hybrid attention as

$\begin{equation}{\mathbf{z}}_{{A}^{\prime }}=\mathtt{A}\mathtt{t}\mathtt{t}({\mathbf{z}}_{A})=\sigma ({\mathbf{W}}_{2}\delta ({\mathbf{W}}_{1}{\mathbf{z}}_{A}))\otimes {\mathbf{z}}_{A},\end{equation} \tag{ 10 }$

$\begin{equation}{\mathbf{z}}_{{B}^{\prime }}=\mathtt{A}\mathtt{t}\mathtt{t}({\mathbf{z}}_{B})=\sigma ({\mathbf{W}}_{4}\delta ({\mathbf{W}}_{3}{\mathbf{z}}_{B}))\otimes {\mathbf{z}}_{B},\end{equation} \tag{ 11 }$

$\begin{equation}\qquad {\mathbf{s}}_{AB}=\mathtt{C}\mathtt{r}\mathtt{o}\mathtt{s}\mathtt{s}\mathtt{A}\mathtt{t}\mathtt{t}({\mathbf{z}}_{{A}^{\prime }},{\mathbf{z}}_{{B}^{\prime }})=\sigma ({\mathbf{W}}_{6}\delta ({\mathbf{W}}_{5}{\mathbf{z}}_{{A}^{\prime }}))\otimes {\mathbf{z}}_{{B}^{\prime }},\end{equation} \tag{ 12 }$

$\begin{equation}\qquad {\mathbf{s}}_{BA}=\mathtt{C}\mathtt{r}\mathtt{o}\mathtt{s}\mathtt{s}\mathtt{A}\mathtt{t}\mathtt{t}({\mathbf{z}}_{{B}^{\prime }},{\mathbf{z}}_{{A}^{\prime }})=\sigma ({\mathbf{W}}_{8}\delta ({\mathbf{W}}_{7}{\mathbf{z}}_{{B}^{\prime }}))\otimes {\mathbf{z}}_{{A}^{\prime }},\end{equation} \tag{ 13 }$

$\begin{equation} \mathbf{y}=\mathtt{C}\mathtt{o}\mathtt{n}\mathtt{c}\mathtt{a}\mathtt{t}({\mathbf{s}}_{AB},{\mathbf{s}}_{BA})={\mathbf{s}}_{AB}\oplus {\mathbf{s}}_{BA},\end{equation} \tag{ 14 }$

where z_A and z_B denote the feature A and B extracted by a spiking CNN and a spiking MLP, respectively. ⊗ denotes the element-wise multiplication between two matrices, and ⊕ denotes the feature-wise concatenation between two matrices. Notably, definitions of W_n, σ(⋅), and δ(⋅) are the same as those in equation (6). The output y is used to model the temporal dependencies between various types of features in a spiking MLP.

4. Experiments

4.1. Data preparation

To verify the effectiveness and robustness of mm-SNN, we conduct experiments to estimate the trajectory using a real-world dataset [27]. The dataset consists of 49 sequences for training and 8 sequences for testing. The raw data is collected by a single-chip mmWave radar (TI AWR1843) and a commercial-grade inertial measurement unit (IMU). As for one sequence, it is filled with the preprocessed data from the mmWave radar and IMU. Specifically, the mmWave radar data in the dataset has been preprocessed as follows:

$\begin{equation}\theta =\mathrm{a}\mathrm{t}\mathrm{a}\mathrm{n}\mathrm{2}(y,x),\end{equation} \tag{ 15 }$

$\begin{equation} \phi =\mathrm{arcsin}(z/\sqrt{{x}^{2}+{y}^{2}+{z}^{2}}),\end{equation} \tag{ 16 }$

$\begin{equation} r=\lfloor \theta /{\Delta}\theta \rfloor ,\end{equation} \tag{ 17 }$

$\begin{equation} c=\lfloor \phi /{\Delta}\phi \rfloor ,\end{equation} \tag{ 18 }$

where the 2D map position (r, c) is projected from (x, y, z) points in the 3D space. θ and ϕ are the azimuth and elevation angle through point observation. Δθ and Δϕ are the average horizontal and vertical angle resolution between consecutive beam emitters. More details please refer to [25]. Following [27], a normalization technique is applied to the 2D data after projection.

Despite the preprocessing, we discover that the properties of mmWave radar data are generally unchanged by analyzing the processed data. We load the mmWave radar data that have been converted to the 2D data and analyze such data in an element-wise manner. By adopting normalization, values of (pixel-level) elements in the 2D data are standardized into the range of (−1, 1). However, it is observed that a value v ≈ −0.0340 repeatedly exists in each sequence. We count the number of the highly repeated values, i.e., v, and quantify the proportion of v among all elements of the 2D data in all sequences. Surprisingly, v takes an 84.19% proportion of all training sequences. Similar to the sparsity issue in the 3D cloud point, we argue that a mass of repeated values in the 2D data will generally weaken the ability of feature representation and pattern learning, which brings about challenges in processing such data.

Moreover, to demonstrate the performance of handling noisy data, two types of noises are artificially added into the input for testing. Specifically, as illustrated in figure 3, we add the impulse noise and the Gaussian noise into the projected 2D structured data. The raw data is regarded as the variable x, and the noise is defined as Δx. The noisy input data y can be obtained by adding x and Δx. Descriptions of the above noises are formulated as follows:

$\begin{equation} P({\Delta}x)=\frac{1}{\sqrt{2\pi }\sigma }\,{\mathrm{e}}^{-{({\Delta}x-\mu )}^{2}/2{\sigma }^{2}},\end{equation} \tag{ 19 }$

$\begin{equation}P({\Delta}x)=\begin{cases}{P}_{a},\quad \hfill & {\Delta}x=a-x\enspace (\text{i.e.},x+{\Delta}x=a)\hfill \\ {P}_{b},\quad \hfill & {\Delta}x=b-x\enspace (\text{i.e.},x+{\Delta}x=b)\ \hfill \\ 1-{P}_{a}-{P}_{b},\quad \hfill & {\Delta}x=0\enspace (\text{i.e.,}\;\text{without}\;\text{noise})\hfill \end{cases}.\end{equation} \tag{ 20 }$

Herein, since values in the 2D space have been normalized into the range of (−1, 1), we modify the setting of parameters to a reasonable range when generating the noises. For the impulse noise, we set the parameters a and b to 0 and 1 respectively. The probability parameters P_a and P_b are flexible to adjust. In this case, we define an ever-changing random value p between 0 and 1. For each (pixel-level) element in the 2D data, we set the corresponding element to zero when p is less than P_a, and set the corresponding element to one when p is larger than (1 − P_b). For other cases (P_a ⩽ p ⩽ 1 − P_b), the corresponding element remains unchanged. For the Gaussian noise, we set the parameter σ to 0.01. μ is adjustable and is set to be an increasing parameter from 0 to reflect the growing intensity of Gaussian noise. We directly add the generated Gaussian noise in the 2D structured data. For convenience, we refer to the input with impulse noise and Gaussian noise as the impulse-noisy input and the Gaussian-noisy input to distinguish the original input in the rest of the paper.

**Figure 3.** Illustration of artificially modified inputs with impulse noise and Gaussian noise added. We add noises into the 2D structured data respectively.
Download figure:
Standard image High-resolution image

4.2. Experimental setting

(1) Benchmark and setting: to validate the effectiveness of mm-SNN, we conduct comparisons between the predicted trajectories and the ground truth. We introduce the absolute trajectory error (ATE) to quantify the accuracy of an entire trajectory [15]. The quality of the trajectory estimation can be quantified by statistic metrics of ATE, e.g., root mean square error (RMSE) [32], mean, median, standard deviation (Std.), and Max error. Specifically, the experiments are organized as follows: in section 1, we focus on the entire process of the model learning of mm-SNN and a baseline inspired by [27], i.e., a non-spiking framework, and plot the curves of the test loss evolving with the training epoch; in section 2, we conduct an ablation study to explore how attention-based modules benefit the vanilla mm-SNN; in section 3, we utilize ATE to quantify the accuracy of prediction between mm-SNN and the non-spiking framework; in section 4, we conduct experiments with different noise levels to compare the capability of noise resistance between mm-SNN and the non-spiking framework. Notice that the non-spiking framework consists of non-spiking models which share the same training configurations (see section 3) and basic network structures with those in mm-SNN.

(2) Structures of models: to perform the task of trajectory estimation, we specify some simple yet efficient structures for SNN models. Specifically, we utilize the basic configurations (e.g., number of channels, receptive field sizes) of a CNN model proposed in [35]. Our spiking CNN model mainly distinguishes from a typical CNN model at the mechanism of neuronal computation. Instead of direct propagating after a nonlinear activation function, each neuron in the spiking CNN model performs an update based on the LIF behaviors. Moreover, the operation of batch normalization is used at the end of each layer to benefit the training. The spiking CNN model is leveraged in mm-SNN to extract features and learn representation from the mmWave radar data. In addition to the spiking CNN model, we utilize a one-layer spiking MLP model (named Spiking MLP_1) to process the sequence data of IMU, and a two-layer spiking MLP model (named Spiking MLP_2) to model the temporal dependency of the merged features. Note that SNNs' structures are specified for our targeted task and are flexible to change according to the scenario where mm-SNN is adopted. Detailed information on model structures can be found in table 1. To be clear, 'C64_RF7' denotes a convolutional layer whose number of channels is 64 and receptive field size is 7 × 7, and 'C128_Linear' denotes a linear layer whose number of channels is 128.

Table 1. Model structures used in mm-SNN.

Model	Network structure
Spiking CNN	C64_RF7-C128_RF5-C256_RF5-C256_RF3-C512_RF3-C512_RF3-C512_RF3-C512_RF3-C1024_RF3
Spiking MLP_1	C128_Linear
Spiking MLP_2	C1024_Linear-C512_Linear

(3) Training configurations: the training of the overall models is performed under the following configurations. The batch size and the maximum number of epochs are set to 64 and 100, respectively. In the condition that the performance promotion is slight, the general training permits an early stop to ensure a preferable generalization ability and a lower time cost. We use the root mean square prop optimizer with a dynamically adjusted learning rate to tune the training. The learning rate is initialized to 1 × 10⁻⁵ with a regular decay by 75% per 25 epochs. Moreover, we highlight the setting of hyper-parameters used in SNN models. The threshold u_threshold used in the LIF model in equation (1) is set to 0.4. The decay factor used in the membrane update in equation (2) is set to 0.2. The coefficient a used for approximating the derivative of spike activity in equation (4) is set to 1. We implement all models in mm-SNN in Pytorch [29]. All experiments are conducted on a Linux server equipped with dual 14-core Intel Xeon E5-2683 v3 CPUs and an NVIDIA Tesla V100 GPU (16 GB memory).

4.3. Result and analysis

(1) Robustness to noise: regular data generally contains abundant spatial information and is easy to process. The mmWave radar data, however, is more complicated even after projection. As aforementioned, two obstacles exist in the learning of preprocessed mmWave radar data: (1) a specific value repeatedly exists in all sequences; (2) noise is artificially added into the input to estimate the ability of resisting noise. Without the intrinsic property of robustness, we argue that the above obstacles could deteriorate the model learning of conventional ANNs. We thereby conduct experiments between a non-spiking framework and mm-SNN. For both frameworks, all models are trained with the original input and are tested with three types of inputs, i.e., original input, the impulse-noisy input, and the Gaussian-noisy input. For fairness, models in the non-spiking framework and mm-SNN share the same network structures. For example, mm-SNN uses a spiking CNN to extract features from the mmWave radar data, while the non-spiking framework leverages a typical non-spiking CNN with the same structure as the spiking CNN in mm-SNN for identical use. Moreover, the capability of modeling temporal dependency is unique to spiking neurons, but not to non-spiking ones. Therefore, we introduce an LSTM in the non-spiking framework. With respect to the learning schemes, the non-spiking framework and mm-SNN are trained by the ANN-oriented backpropagation through time algorithm and the SNN-oriented STBP algorithm, respectively.

We visualize the model learning process by recording the test loss evolving by epoch. As illustrated in figure 4, given three types of inputs, the curves of the test loss of mm-SNN consistently show a stable and quick downtrend. However, for the non-spiking framework, the curves of the test loss are obviously unstable and fluctuant, revealing an unsatisfied performance on model generalization. Notably, it is observed that many leap points exist in the curves of the non-spiking framework, especially for models tested with noisy inputs (see figures 4(b) and (c)). Herein, the leap point means a loss point with a nontrivial leap compared to the overall tendency. We argue that the unstable curves of the non-spiking framework are blamed for the disturbance in data, e.g., the two obstacles in the mmWave radar data. In addition, we discover that different noisy input affects models in the non-spiking framework to different degrees. The impulse-noisy input generally has a greater effect on models in the non-spiking framework than the Gaussian-noisy input by comparing the overall trend of curves of the test loss. On the other hand, with the intrinsic advantages in resisting noises, SNN models in mm-SNN are more robust to handle the noisy inputs, making the model learning stable. Specifically, the robustness of mm-SNN might be benefited from spiking neurons with leakage, firing, and reset mechanisms, which can naturally act as a noise filter. In addition, it is observed that injecting the Gaussian noise into the sequence can help the feature learning (see figures 4(a) and (c)). As aforementioned, a value v ≈ −0.0340 repeatedly appears in each sequence. Injecting the Gaussian noise would change the imbalanced data distribution in a sequence to a more stochastic distribution, thus yielding a slight improvement in accuracy.

(2) Ablation study on attention-based improvements: we visualize the predicted trajectories and the corresponding trajectories of ground truth to highlight the gap using four distinct models, i.e., 'mm-SNN', the vanilla mm-SNN framework equipped with a channel-wise attention module ('W. SELayer2D'), the vanilla mm-SNN framework equipped with a temporal-wise module ('W. SELayer1D'), the vanilla mm-SNN framework equipped with a hybrid attention module ('W. HybridAtt.') in figure 5. In addition to attention modules, all models are trained under the same configuration. We choose the best models within 100 training epochs for visualization. From an overall perspective, 'mm-SNN' achieves superior prediction performance since the predicted trajectories precisely fit the ground truth in most cases. From the perspective of the ablation study, the prediction performance of the vanilla mm-SNN equipped with a single attention module varies by test sequences. It is observed that 'W. SELayer2D' achieves a sub-optimal performance compared to 'mm-SNN' on three test sequences (see figures 5(a), (c) and (d)). In addition, 'W. SELayer1D' performs well in the other three test sequences (see figures 5(b), (c) and (e)). Unfortunately, the prediction performance of 'W. HybridAtt.' is not comparable on any test sequence. However, only 'mm-SNN' shows good prediction performance on all test sequences, which is difficult to achieve for all vanilla mm-SNN frameworks equipped with one attention module merely.

**Figure 5.** Comparisons among predicted trajectories and the ground truth. Subplots (a)–(e) denote five distinct test sequences. In all subplots, 'mm-SNN' denotes the proposed framework with all attention modules equipped. 'W. SELayer2D' denotes the vanilla mm-SNN framework with merely the channel-wise attention module, i.e., SELayer2D, equipped. Similarly, 'W. SELayer1D' and 'W. HybridAtt.' denote the vanilla mm-SNN framework with temporal-wise attention and hybrid attention modules equipped, respectively.
Download figure:
Standard image High-resolution image

Next, we discuss the reason for distinguishing prediction performance among models with different attention modules equipped. As mentioned in section 3, the channel-wise attention module (SELayer2D) and the temporal-wise attention module (SELayer1D) are plugged into SNN models, i.e., the spiking CNN and the spiking MLP, for extracting features, while the hybrid attention module is utilized in the process of feature fusion. We argue that an attention module plugged into a feature extractor is better for enhancing the data representation of features. In consequence, the analysis on the ablation study reveals three facts for us to design a framework with multiple SNN models integrated: (1) Vanilla SNN models can be improved via attention-based modules. (2) A framework equipped with more applicable attention-based modules can gain better performance overall. (3) The optimization effect of an attention-based module is distinct by the plugged location.

(3) Quantified comparison and analysis: we adopt ATE to quantify the accuracy of prediction between mm-SNN and the non-spiking framework with five statistic metrics: RMSE, mean, median, Std., and Max. The performance on different test sequences and the averaged results are given in table 2. To reveal the effect of noisy inputs on the model performance, we test models in the non-spiking framework and mm-SNN with two types of inputs, i.e., the impulse-noisy input and the Gaussian-noisy input. For the impulse-noisy input, the parameters P_a and P_b are set to 0.01 and 0.01. For the Gaussian-noisy input, the parameters μ and σ are set to 0.01 and 0.01. We choose the best models within 100 training epochs for quantifying the prediction performance. It is first observed that mm-SNN achieves the lowest error on all metrics from an overall perspective (see rows named 'Average' marked with ★), revealing the superior performance of mm-SNN against noisy inputs on most test sequences. For instance, using the impulse-noisy input, the RMSE metric of mm-SNN is about 2 (on 'Test Seq. 2') to 11 (on 'Test Seq. 1') times lower than that of the non-spiking framework. Although the gap is slight, we also find situations that sometimes mm-SNN cannot significantly outperform the non-spiking framework (e.g., prediction on 'Test Seq. 4'). The reason that mm-SNN cannot always perform superiorly on all test sequences might be that each test sequence is independently collected and generated with distinct characteristics. In consequence, quantified results undoubtedly demonstrate a superior performance of mm-SNN from an overall perspective (see rows named 'Average' marked with ★) and better performance on most test sequences.

Table 2. Quantified analysis on statistic metrics of ATE. The lower these values, the better performance in general. The row named 'average' denotes the averaged result among five test sequences using the same models (in the non-spiking framework or mm-SNN) for quantification. The lowest error quantified by the five metrics on each test sequence is set in bold for convenience.

Test data	Model	Sequence	RMSE	Mean	Median	Std.	Max
Impulse-noisy input	Non-spiking framework	Test Seq. 1	25.7434	20.1859	16.3460	15.9766	55.5138
		Test Seq. 2	6.8999	5.0490	3.2079	4.7028	16.3836
		Test Seq. 3	3.1194	2.4471	1.4886	1.9345	5.9314
		Test Seq. 4	0.6401	0.5744	0.4891	0.2825	1.3621
		Test Seq. 5	3.2286	2.8724	2.9408	1.4743	5.3434
		Average	7.9263	6.2257	4.8945	4.8741	16.9069
	mm-SNN	Test Seq. 1	2.3166	1.6939	1.3617	1.5803	5.9049
		Test Seq. 2	2.7535	2.3125	2.3321	1.4946	4.8256
		Test Seq. 3	0.6855	0.6515	0.6001	0.2131	1.1990
		Test Seq. 4	0.6954	0.6499	0.6270	0.2474	1.6544
		Test Seq. 5	1.0635	0.9779	1.1652	0.4180	1.5884
		Average ★	1.5029	1.2571	1.2172	0.7907	3.0344
Gaussian-noisy input	Non-spiking framework	Test Seq. 1	26.0004	20.3956	16.7588	16.1258	55.9151
		Test Seq. 2	6.4851	4.6652	2.7420	4.5048	15.6540
		Test Seq. 3	2.6071	1.9871	1.3903	1.6877	4.6410
		Test Seq. 4	0.7916	0.6863	0.5219	0.3945	1.7236
		Test Seq. 5	3.2586	2.9492	3.2516	1.3859	5.1572
		Average	7.8286	6.1367	4.9329	4.8197	16.6182
	mm-SNN	Test Seq. 1	3.9502	3.2775	3.1745	2.2050	8.0452
		Test Seq. 2	0.5780	0.4657	0.4580	0.3424	1.5530
		Test Seq. 3	0.6459	0.6215	0.5480	0.1760	0.9336
		Test Seq. 4	0.9181	0.8597	0.8352	0.3221	1.6290
		Test Seq. 5	0.7653	0.7168	0.7436	0.2680	1.1705
		Average ★	1.3715	1.1882	1.1519	0.6627	2.6663

(4) Study on noise resistance: we further conduct experiments to compare the capability of noise resistance between mm-SNN and the non-spiking framework. Specifically, we first add a Gaussian noise to one test sequence and gradually increase the noise level by changing mean (the parameter μ) of the Gaussian noise. Then, we quantify the statistic metrics on the same test sequence when running mm-SNN and the non-spiking framework. We reminder that an increase in any of the metrics indicates a decrease in the prediction performance of a model.

As depicted in figure 6, for both mm-SNN and the non-spiking framework, the model performance degrades as the noise level grows. Different metrics generally show different increments under the same noise increment. We discover that the Max and Std. metrics in the non-spiking framework are larger than those in mm-SNN, given the noise level is lower than 0.09 (figures 6(a) and (b)). When the noise level further increases, the Max and Std. metrics in mm-SNN surpass those in the non-spiking framework, indicating a significant degradation in the prediction performance. The RMSE metric presents a similar trend although the metrics are closer before the noise level approaches 0.06. The above observations inspire the following interesting conclusion. Compared to the smooth degradation of the non-spiking framework, mm-SNN is more robust until the noise increases to a certain level. Since the state space of SNNs is discrete and each LIF neuron is a natural noise filter, the states are harder to be changed by injecting a small noise but easier to collapse once the noise level becomes high enough.

5. Conclusion

This paper presents a general neuromorphic framework, named mm-SNN, to process the mmWave radar data via SNN-based models. The proposed framework is robust to noisy inputs, and is scalable for multi-sensor systems. Moreover, we find that the generic mechanisms for data enhancement used in deep learning can also provide considerable improvement in performance of mm-SNN. Extensive experiments are conducted to demonstrate the effectiveness and robustness of mm-SNN. mm-SNN is not only a paradigm to learn representation from the mmWave radar data but also provides an end-to-end solution to flexibly handle data with temporal dependency acquired by a complex system.

Acknowledgments

This work was partly supported by National Natural Science Foundation of China (Grant Nos. 62106119, 61732018 and 61872335), Austrian-Chinese Cooperative R&D Project (FFG and CAS) (Grant No. 171111KYSB20200002), CAS Project for Young Scientists in Basic Research (Grant No. YSBR-029), and CAS Project for Youth Innovation Promotion Association.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

General spiking neural network framework for the learning trajectory from a noisy mmWave radar

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction