The Intel neuromorphic DNS challenge

Jonathan Timcheck; Sumit Bam Shrestha; Daniel Ben Dayan Rubin; Adam Kupryjanow; Garrick Orchard; Lukasz Pindor; Timothy Shea; Mike Davies

doi:10.1088/2634-4386/ace737

1. Introduction

Neuromorphic computing achieves excellent performance with power and latency savings for certain algorithms [1], and the field stands to greatly benefit from focusing on well-defined neuromorphic challenge problems motivated by recent progress. Challenge problems facilitate the consistent evaluation and comparison of different approaches to solving important classes of problems and can help align researchers toward the most promising directions, thus accelerating progress. Historically, challenge problems have often spurred breakthroughs in the field of machine learning, e.g. MNIST [2], CIFAR-10 [3], and ImageNet [4]. However, the less-mature field of neuromorphic computing lacks unifying challenge problems. Most results of benchmarking neuromorphic systems are bespoke, where custom tasks are conceived chiefly to highlight the capabilities of a given neuromorphic system, making it difficult to compare across different systems and solutions, whether neuromorphic or conventional [5].

Any neuromorphic challenge problem must be chosen and structured carefully. A poorly-selected problem could direct focus in the wrong direction, on tasks for which neuromorphic hardware is unlikely to provide advantages over conventional hardware. This includes many existing popular machine learning tasks, such as those involving static image processing. Similarly, defining a challenge problem without an accompanying methodology for comprehensively evaluating neuromorphic compute cost makes it difficult to rigorously compare different solutions.

Researchers have discussed at length what makes for good neuromorphic challenge problems and benchmarks [6], identifying qualities such as easy access and use, freely available data, not computationally prohibitive, representative of an important real-world task, and unsaturated [7]. Existing neuromorphic benchmarks support these goals, but they are few in number and have key shortcomings. We briefly discuss several existing neuromorphic challenge problems in the following section.

1.1. Past neuromorphic challenge problems

One of the first prominent neuromorphic challenge problems was image classification on the N-MNIST or N-Caltech101 datasets [8]. N-MNIST and N-Caltech101 are neuromrophic versions of the classic MNIST [2] and Caltech101 [9] datasets: the neuromorphic datasets were captured using an event-based camera moving in a precise saccadic motion while pointed at a computer monitor displaying an MNIST or Caltech101 static image. While N-MNIST and N-Caltech101 were instrumental in advancing neuromorphic vision research and provided common datasets to compare various neuromorphic algorithms, the inherent source of information is a static image and lacks spatiotemporal information content [10], especially once the saccadic motion is compensated. Thus these datasets are generally not ideal for showcasing the full potential of neuromorphic computational models which aim to exploit neuronal dynamics inspired by biological neurons for efficient temporal signal processing (section 3).

Another popular neuromorphic vision challenge problem is gesture recognition on the DVS Gesture dataset [11]. The DVS Gesture dataset is naturally matched to neuromorphic computing—the sparse, event-based, and spatiotemporal nature of dynamic vision sensor data naturally lends itself to neuromorphic processors that also possess these attributes. Evaluated as a neuromorphic challenge problem, however, DVS Gesture uses specialized event-based sensor data which limits widespread applicability, and the dataset is small (1342 instances). Neuromorphic solutions on DVS Gesture achieve a latency of 104 ms [11] on TrueNorth and 12.5 ms on Loihi [1] processing at 1 ms per step. Further study shows that the accuracy on the task improves with a coarser timestep of up to 25 ms [12]. This indicates that the fine-grained temporal information in the DVS Gesture dataset may not be vital in this task; intuitively, common gestures are likely slow enough to be sufficiently captured on a slower timescale.

A popular neuromorphic audio challenge problem is keyword spotting on the Spiking Heidelberg Datasets [7]. The Spiking Heidelberg Datasets target the widely-applicable task of keyword spotting, and importantly, audio is pre-processed with a neuroscience-inspired cochlea model. This provides a consistent neuromorphic encoding to spikes upon which researchers can build their keyword spotting algorithms, thus facilitating simple and fair task performance comparisons across different spiking neural network (SNN) algorithms. However, the cochlear encoding of the Spiking Heidelberg Datasets presents some critical shortcomings when viewed from the greater context of more general and more difficult audio processing tasks. Firstly, the information preservation of the cochlear encoding is unquantified, thus this encoding could artificially bottleneck the performance of keyword spotting, and perhaps severely bottleneck performance for more sophisticated audio processing tasks. Secondly, the power cost of computing the cochlear encoding is also unquantified, yet power is an important factor in real-world low-power audio processing systems. Indeed, how to encode an audio signal efficiently and faithfully for processing in a neuromorphic system is an open research question which plays an important role in our definition of the Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge).

Other neuromorphic benchmarks have been proposed that target applications that are also well-matched to the spatiotemporal event-based neuromorphic computing style, such as Braille letter reading [13] and gesture recognition using electromyograph and dynamic vision sensor fusion [14]. However, these benchmarks involve niche sensors and applications, limiting their real-world impact and interest compared to more mainstream AI problems dealing with images, video, text, or audio.

1.2. Audio denoising as a neuromorphic challenge

In this work, we identify audio denoising as an excellent neuromorphic challenge task. As detailed in subsequent sections, audio denoising has ubiquitous real-world applicability and plays to the strengths of neuromorphic computing. We have developed the Intel N-DNS Challenge to make the task easily accessible, free to all, unsaturated, and designed specifically to make it easy to compare solutions over a comprehensive set of metrics.

The Intel N-DNS Challenge is inspired by the Microsoft DNS Challenge, an audio denoising challenge that has been running since 2020 [15–18]. At a basic level, the Microsoft DNS Challenge has focused on improving speech denoising solutions as measured by human perceptual audio quality metrics and the Challenge included a track with the constraint that solutions must run in real-time on an Intel i5 or equivalent processor; essentially, the goal was to obtain the highest audio quality possible under the compute architecture constraint. In contrast, in the Intel N-DNS Challenge, we are changing this architecture constraint and taking a more holistic approach to evaluating solutions by defining metrics for power and latency in addition to audio quality metrics.

The spirit of the Intel N-DNS Challenge is to achieve production-level (near-SOTA) denoising performance in a system with at least an order of magnitude reduction in power, while also reducing latency, compared to real-time denoising solutions on conventional architectures. Our belief is that the neuromorphic computing features of Intel's Loihi 2 chip—representative of future commercial neuromorphic devices—will enable the realization of such gains. Thus in the Intel N-DNS Challenge we define one track focusing on evaluating solutions on existing neuromorphic hardware (Loihi 2) and another track focusing on neuromorphic algorithm development, which may motivate new features in future neuromorphic hardware.

The Intel N-DNS Challenge has a 1 year timeline, but we invite the community to continue using the Intel N-DNS Challenge as a benchmark after the challenge ends. More broadly, we view the N-DNS challenge as a single iteration in a continuing effort to develop challenge problems that help to advance neuromorphic computing to commercial maturity.

We define the audio denoising task in section 2, discuss neuromorphic computing as it pertains to this work in section 3, overview the Intel Neuromorphic DNS Challenge in section 4, describe the data in section 5, specify evaluation criteria in section 6, describe our baseline solution in section 7, address additional clarifications in section 8, and summarize our contributions in section 9. We make our code publicly available for obtaining the challenge data, evaluation pipeline, and the example baseline solution in the Intel N-DNS Challenge Github Repository (https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge) with a permissive MIT license.

2. Primer on audio denoising

Digital audio signal denoising, also called audio signal enhancement, is a fast-growing research area, but its origin can be traced back to the late 70s and early 80s when spectral subtraction [19] and Wiener filter [20] algorithms were introduced. Subsequently, beamforming techniques were successfully adopted [21, 22]. While a significant advance, beamforming was not practical due to several limitations, namely, that multiple microphones are needed to perform noise reduction, source-to-noise ratio(SNR) improvement is highly correlated with the number of microphones, and compute complexity increases with the square of the number of microphones. Furthermore, in the last few years, there has been an increased research interest in single microphone denoising. Single-microphone device configurations are omnipresent, and the utilization of deep neural networks has enabled very successful single-microphone denoising [23–25]. We address the single-microphone audio denoising task in the Intel N-DNS Challenge (figure 1).

**Figure 1.** The audio denoising task. Audio denoising is ubiquitous and has many attributes that are likely to reap benefits from neuromorphic hardware.
Download figure:
Standard image High-resolution image

Typically, the signal captured by a microphone contains a source signal, like speech or music, and stationary or non-stationary noises. Stationary noises change amplitude and frequency profile slowly in time, whereas non-stationary noises vary quickly over time. Some examples of the former are an air conditioner, dishwasher, fan, or engine noises. Examples of the latter are a baby crying, a dog barking, or keyboard typing. Notably, reduction of stationary noises is a significantly simpler task than the removal of non-stationary noises. Noise is an additive distortion defined in the time domain according to

$\begin{align} y(t) = x(t) + n(t), \end{align} \tag{ 1 }$

where x(t) is the amplitude of the source signal for time index t, n(t) is the noise signal, and y(t) is the noisy signal captured by the microphone.

Furthermore, most recordings are conducted in a reverberant environment; e.g. in indoor conditions, the signal is contaminated by reverb. Noisy reverberant signals can be expressed as

$\begin{align} y(t) = h(t) * x(t) + n(t), \end{align} \tag{ 2 }$

where h(t) represents impulse response and only a single noise source is represented. Since reverb is a multiplicative distortion, most denoising algorithms will focus on noise removal [26]. There are alternative approaches that perform reverb reduction and noise removal in one shot [27] or use a cascade of processing with reverb reduction [22] in a first stage, followed by a denoising stage. Audio denoising refers specifically to the process of enhancing an audio signal by subtracting noise from it; this is the task in the Intel N-DNS Challenge (figure 1).

Audio denoising is commonly utilized in both real-time and non-real-time scenarios. An example of a real-time scenario is a voice call which is performed on an end-user device, such as a PC, phone, headset, or smart device, or inside applications like Microsoft Teams or Zoom. In this use case, algorithms must not introduce latency greater than 40 ms. Furthermore, the compute load must be light enough to fit into existing power and memory constraints without degrading user experience. Another example of a real-time application is speech enhancement in human-to-computer communication, where denoising is performed to improve the accuracy of downstream processing such as keyword spotting or automatic speech recognition. There are other use cases, such as transcribing meeting minutes, where denoising can be performed offline. These are viewed as non-real-time scenarios that impose fewer restrictions on the algorithm, e.g. permitting non-causal filtering.

2.1. Current state-of-the-art solutions

Recently neural networks (NN) based algorithms have been extensively applied to audio denoising problems. Initial solutions focused exclusively on denoising quality and used large models to solve the problem with great breakthroughs in accuracy. However, as the models become more and more accurate, the focus has shifted to real-time denoising performance. In fact, the most recent Microsoft DNS Challenges have dropped the non-real-time track [15, 17, 18].

Non-real-time solutions focus purely on the quality of denoising and are typically non-causal. Non-real-time solutions from the speech enhancement and source separation literature include attention architectures [28], temporal convolutional networks (TCN) [29], the Convolutional Time-domain audio separation Network (Conv TasNet) [24], convolutional phase and amplitude processing (PHASEN) [30], and audio source separation with nested depthwise convolutional downsampling (SuDoRM-RF) [31]. For the denoising task in speech enhancement, the desired enhancement is the removal of noise, and in source separation, the desired separation is between speech and noise. Real-time solutions focus on making the network lightweight and causal while maintaining denoising performance. Some examples include causal forms of TCN, Conv TasNet [29], and recurrent topologies with stacked LSTM or GRU [32].

The most common encoding-decoding method of choice is STFT-ISTFT [28, 30, 32] or its similar spectrogram transformation like DCT [33], while methods like SuDoRM-RF [31] directly process the raw audio samples. There are different approaches for processing the complex STFT input in the literature. Some methods only make use of the magnitude information [24, 29, 32], some process the magnitude and phase separately and combine them [34], while some process the complex spectrum directly using complex convolutional filters [28].

The majority of the solutions use backpropagation-based supervised training. However, a wide variety of losses have been used in different works. The most common ones are the mean-square error of the resulting spectrum or maximization of the signal-to-noise ratio. A survey of various loss metrics used in audio denoising with their benefits is described in [35]. Some solutions even prioritize speech over suppression with an additional loss penalty term [34]. In addition, unsupervised or semi-supervised training methods have also been investigated to achieve a general solution even on out-of-distribution datasets. A particularly interesting method is the teacher-student training method proposed in RemixIT [36] where a teacher network trained on out-of-distribution data is used to bootstrap the noisy signals to multiply the variety of in-distribution data samples.

It is evident that noise suppression with deep neural networks is an active area of research with new methods being introduced regularly. Recent efforts have not only focused on the quality of denoising but also on the size of models and satisfying real-time requirements. There is a vast body of research from which to borrow for neuromorphic audio denoising.

3. Neuromorphic audio denoising

We chose the audio denoising task for this challenge because it presents an excellent opportunity for neuromorphic algorithm innovation (figure 1). Audio denoising is a ubiquitous power-constrained task with commercial relevance. It is often performed on mobile devices, and every Intel Core™ CPU in production now includes AI hardware acceleration support for it. Given the significant compute load of today's denoising solutions, lowering the power with a neuromorphic solution could not only lead to longer battery lives and smaller form factors but could bring the functionality to even more power-constrained devices such as headsets, earbuds, hearing aids, and cochlear implants. Moreover, it is a temporal signal processing task, which neuromorphic systems are expected to excel at [6]. Indeed, commercial neuromorphic vendors are already targeting speech-enhancing hearing aids, promising orders-of-magnitude gains [37]. Looking forward, the audio denoising task represents a starting point for the development of more general neuromorphic audio processing algorithms that operate in real time with imperceptible latency, such as audio environment emulation, speech separation, voice morphing, and speech-to-speech language translation.

Furthermore, audio denoising is especially timely as a neuromorphic research vector. It is a generative task unsolved in neuromorphic computing, and audio is a low data-rate signal that is well-matched to current neuromorphic chips and designs that generally target low-power edge processing. Solutions can be readily compared to recent conventional machine learning advances, including models deployed in production, and can leverage insights, methods, and datasets from those recent efforts.

3.1. Neuromorphic computing and Loihi 2

Neuromorphic computing aims to apply fundamental principles of the brain's information processing mechanisms to engineered computing devices. The brain consumes a mere 20 watts of power yet can execute remarkable feats of perception, planning, control, and learning while operating in real time processing sequential data streams. In contrast, our conventional computer systems today struggle to emulate even a narrow subset of such feats with much larger power budgets, even though they have the advantage of precisely engineered ultra-fast nanoscale transistors as a computational substrate [38]. Indeed, biological inspiration is compelling. However, when computer architects go about designing neuromorphic systems, they face a fundamental question: What biology-inspired computational strategies unlock neuromorphic performance advantages versus conventional architectures?

Neuromorphic researchers have identified several promising strategies, such as analog computation, sparse connectivity, spike-based communication, in-memory computation, local synaptic learning rules, recurrent feedback, and stateful, dynamic neuron models [39]. Subsets of these computational strategies are being implemented in hardware, e.g. novel analog devices [40], analog computation in conventional circuits [41–44], digital processing with spike-based communication [45–49], and many others.

In the Intel Neuromorphic Computing Lab, we focus on designing all-digital neuromorphic processors that can be manufactured in state-of-the-art semiconductor process technology. The SOTA process enables direct comparisons to SOTA conventional architectures, and the all-digital character allows a broad range of architectural features to be rapidly prototyped with fully deterministic and repeatable execution. While the all-digital character sacrifices some efficiency benefits of analog computation, we believe it is most important to first rapidly explore the architecture-algorithm co-design space before undertaking the more difficult, slower, and currently less area-efficient path of analog circuit design and novel device engineering. We believe the subset of neuromorphic computational principles supported by our latest chip, Loihi 2, are sufficient to show significant gains in power and latency compared to conventional computer architectures, and that this will motivate further optimizations via more nascent neuromorphic computing principles.

Loihi 2 is a state-of-the-art neuromorphic chip designed to efficiently compute temporal dynamics in sparse networks using sparse, event-based communication [50]. Like its predecessor [46], Loihi 2 consists of neuron cores that compute the temporal dynamics of stateful neural models and a communication mesh optimized for spike-based communication. Loihi 2 implements a number of generalizations and optimizations motivated by the learnings and pain points of its predecessor. These include microcode-programmed neuron models, which enable a much wider variety of neurons as seen in the brain [51] as well as in novel neuromorphic algorithms [52] and promising computational benefits in heterogeneous networks [53]. Loihi 2 also features graded spikes, i.e. spikes that carry an integer value, rather than binary spikes. While not biologically motivated, graded spikes are only marginally more costly to support than binary spikes in digital neuromorphic hardware and offer straightforward gains in algorithmic precision and processing speed. Loihi 2 also enhances Loihi 1's learning support so arbitrary local modulating factors ('third factors') may be computed by postsynaptic neuron microcode. We believe Loihi 2's rich feature set is sufficient to unveil significant performance gains in tasks well-suited to temporal dynamics processing, hence the spirit of using Loihi 2 as a model for neuromorphic processing in the Intel N-DNS Challenge.

3.2. Neuromorphic audio processing and promising directions

The computational model implemented by neuromorphic processors such as Loihi 2 is that of a discretized dynamical system. Unlike conventional artificial neurons from machine learning, the state variables of a dynamical system evolve and process inputs in time—i.e. time is a fundamental ingredient of the computation. Thus we expect neuromorphic processors to naturally excel in temporal processing tasks, such as audio processing. Indeed, precisely-timed spiking codes are well-known to underlie audio processing in the brain [54–56], and cochleas perform sophisticated transformations to encode incoming audio for effective processing [55, 56]. These insights from neuroscience provide clear hope for the feasibility and success of neuromorphic audio processing, and recent progress on tasks such as keyword spotting provide some evidence thereof [7, 52, 57, 58].

One can immediately ascertain three critical research questions when designing a neuromorphic audio processing system: (1) How to efficiently represent an audio waveform with high fidelity in the neuromorphic domain? (2) How to efficiently perform the desired audio processing (denoising) on this neuromorphic representation? and (3) How to efficiently invert the neuromorphic representation to yield an output (waveform)?

A natural place to start answering these questions is to start with the first: how to efficiently represent a waveform in the neuromorphic domain. There exist a variety of possibilities for representing data neuromorphicly—e.g. binary spikes, graded spikes, population codes, sparse distributed codes, and phase codes—and a variety of encoding algorithms—e.g. biology-inspired cochleogram models [7, 52], Short-Time Fourier Transforms (STFTs) [59], and Mel-frequency cepstral coefficients [60]. Taking inspiration from biology, in developing our baseline solution for the Intel N-DNS Challenge, we initiated our study of neuromorphic audio encodings on cochleogram models, which can provide sparse representations in binary spikes, high sensitivity, frequency selectivity, large dynamic range, pitch-shifting, and self-peak normalization [52, 56, 61, 62]. However, we quickly realized that cochleogram models such as [7, 63] are generally computationally expensive to invert with high fidelity, which is prohibitive for a low-power denoising system. As an alternative, we developed our initial baseline solution for the Intel N-DNS Challenge using a more conventional audio encoding, the STFT [59], which is easy to invert and has perfect fidelity (aside from quantization and numerical error); furthermore, the STFT encoding can take advantage of graded spikes which are supported on Loihi 2.

While we select an STFT encoding for our baseline, we emphasize that new solutions to the Intel N-DNS Challenge have a wide range of encoding strategies to explore, e.g. designing invertible bio-inspired cochleogram models, utilizing sparse STFTs [52], or even encoding schemes that depend on feedback from other portions of the neuromorphic denoising system, much like the recurrent feedback connections from deeper areas of the brain to more low-level sensory encoding areas. Importantly, the encoding used in a neuromorphic audio processing system must be co-designed with the task for efficient operation; indeed, such synergistic design is observed in biology [64, 65].

Secondly, after audio is encoded, the actual execution of the audio processing in the neuromorphic domain is a very open research opportunity. Neuromorphic audio processing systems can employ a wide variety of strategies to perform processing in the neuromorphic domain, such as simplistic DNN conversion [66], using a network of feedforward or recurrent leaky integrate-and-fire neurons [7, 67, 68], a network of complex resonate-and-fire neurons [52], or a sigma-delta neural network (SDNN) as we describe in the following subsection for our baseline solution. Methodologies inspired by conventional deep learning, e.g. multi-timescale networks [29, 31, 36] or attention [28], if mapped efficiently to the neuromorphic domain, could be promising directions as well. And finally for completeness—to address the third question posed above—decoding the output of the neuromorphically-processed audio again depends on the processing used and must be tailored appropriately to operate in an efficient manner.

Thus we see much opportunity for innovation throughout a neuromorphic processing pipeline—encoding, processing, and decoding. Furthermore, the audio denoising task represents just one potential audio processing task that opens the door to tackling many others with methods that are transferable to other signal processing domains such as wireless, biosensors, and control.

3.3. Baseline neuromorphic solution

We have developed a simple baseline neuromorphic solution to the audio denoising task, and we already begin to see evidence of significant energy efficiency gains from using neuromorphic features. The baseline solution uses a SDNN, an adaptation of a conventional feedforward ReLU neural network architecture that exploits sparse message passing with graded spikes and stateful neurons—computational strategies that can be implemented efficiently in neuromorphic architectures and that are supported by Loihi 2 in particular. The SDNN baseline solution achieves similar audio quality to a conventional baseline solution NsNet2 from the Micrsoft DNS Challenge 2022, but with an order of magnitude fewer operations and less than half its latency. We provide a more detailed overview of the baseline solution architecture and its performance in section 7.

Importantly, our SDNN baseline solution is a very basic feedforward architecture, and does not exploit several of the aforementioned neuromorphic features that perform well on Loihi 2 (table 1). As new solutions incorporate more of these features, such as recurrent and sparse connectivity, we anticipate further significant improvements in power and model size.

Table 1. Neuromorphic features that are performant on Loihi 2 and their utilization in our N-DNS baseline solution.

Neuromorphic feature	In baseline solution
Sparse activity	✔
Sparse connectivity	✘
Recurrence	✘
Stateful neurons	✔
Neuron temporal dynamics	✘
Synaptic plasticity	✘
Graded spikes	✔
Delay as computational element	✔

4. Intel neuromorphic DNS challenge

Just like the Microsoft DNS Challenge, The objective of the Intel Neuromorphic DNS Challenge is to create a system that removes the noise from noisy human speech in real-time. However, in contrast to the denoising system that runs on a conventional CPU in the Microsoft DNS Challenge, the Intel N-DNS Challenge targets the Loihi 2 neuromorphic processor aiming to realize the neuromorphic system's potential for power and latency improvements. To this end, the Intel N-DNS Challenge hosts two tracks:

1)
Algorithmic. The objective in Track 1 is to develop a high-quality audio denoising solution that operates efficiently on a neuromorphic system. The algorithm is not required to run on actual neuromorphic hardware, but rather will be simulated on conventional hardware. Latency and a neuromorphic proxy power are estimated.
2)
Loihi 2. The objective in Track 2 is to develop a high-quality audio denoising system that operates efficiently on Loihi 2 [50]. The power and latency of the denoising solution will be measured by running it on actual Loihi 2 hardware.

Track 1 provides freedom to explore a wide range of neuromorphic denoising solutions, without the need to demonstrate the solutions on actual neuromorphic hardware; this track is intended for rapid development and potentially to inspire future neuromorphic hardware features. Track 2 guarantees that neuromorphic denoising solutions can indeed run on actual neuromorphic hardware. This track provides a rigorous demonstration of power and latency benefits realized by neuromorphic hardware.

Both tracks follow the same structure: noisy audio is encoded into a form suitable for processing on a neuromorphic system, processed on a neuromorphic system (simulated for Track 1, or real hardware system for Track 2), and decoded into a clean output audio waveform (figure 2). Solutions are evaluated by an audio quality metric and a computational resource usage metric and are subject to a minimum audio quality and maximum latency (real-time) requirement.

**Figure 2.** Intel neuromorphic DNS challenge solution structure. Input noisy audio is encoded before it enters the neuromorphic denoiser (N-DNS). The neuromorophic denoiser processes its input, and the output of the neuromorphic denoiser is decoded to produce the final output clean audio. The encoder, decoder, and neuromorphic denoiser are the constituents of a solution to the Intel N-DNS Challenge and their power and latency are evaluated, in addition to the output audio quality. In Track 1, all components run on CPU, while in Track 2, the neuromorphic denoiser runs on Loihi 2.
Download figure:
Standard image High-resolution image

The selection procedure for the winner of each track is described in the Intel N-DNS Challenge Github Repository, along with challenge logistics and timeline. Solutions will be judged not only on the measured or estimated computational metrics, but also on commercial relevance, broader research impact, and quality of solution write-up. We describe the dataset, evaluation metrics, and an example baseline solution in the following sections.

5. Dataset

The Intel N-DNS dataset is derived from the Microsoft DNS Challenge dataset, which is a corpus of human speech audio samples of various categories including but not limited to English, German, French, Spanish, Russian and various categories of noises (DNS Challenge Github Repository). We provide a synthesizer script that generates 30 s segments of clean (ground truth), noise (additive), and noisy (ground truth + noise) audio data for both the training and validation dataset in the challenge repository. For training the network, participants are free to choose and/or tweak the data synthesis parameters or choose only a subset of the Microsoft DNS Dataset language and noise categories, or even include additional speech and noise corpus for synthesis. The default is 500 total hours (60 000 samples) of audio data with the synthesized SNR between 20 dB to −5 dB at 16 kHz with a bit depth of 16 bits. The validation set, on the other hand, is generated using the default settings in the audio synthesis script.

The testing data for Intel N-DNS Challenge will be provided at a later point after participant models are frozen. Thereafter, there can be no changes to the submitted models in order to ensure a fair evaluation on the test set. The characteristics of the testing data will be similar to the training and validation set. Note that this model freeze is only a feature for administering the challenge in a fixed timeline with blinded test set, and we encourage the continued use of the Intel N-DNS Challenge resources and framework as a general, non-time-bound challenge problem for neuromorphic research.

In addition, we include general dataloader modules in the Intel N-DNS Challenge that load the clean, noise, and noisy audio from the training, validation, and testing samples. Optionally, the dataloader also provides metadata about synthesized audio samples like the clean audio sources, noise sources, the noise mixture level and so on.

6. Evaluation

There is no single metric that captures the overall performance of a solution in the Intel N-DNS Challenge. Instead, there are multiple metrics that characterize different dimensions of performance. Naturally, we must quantify the output audio quality of the N-DNS system, and so we define metrics for this related to signal-to-noise ratio and perceptual audio quality. Equally important for the objective of the challenge is to assess computational resource costs: latency to ensure real-time processing, power to quantify energy efficiency, and chip resources required to support the solution on neuromorphic hardware. With these four performance dimensions covered, we can comprehensively evaluate each solution. We also consider certain derived figures of merit, such as power-delay product, a common quantity used to represent the tradeoff between speed and energy efficiency in electronics systems.

This collection of metrics allows us to compare solutions designed for different points in performance space, i.e. its positioning on a Pareto frontier with top-performing solutions designed for low-power or high-power, with correspondingly lower or higher audio quality.

6.1. Audio quality metrics and minimum audio quality improvement

6.1.1. SI-SNR metric

Task performance in the N-DNS Challenge is measured as the output audio quality; we use the Scale-Invariant Source-to-Noise Ratio (SI-SNR)—SI-SNR is a common metric in the audio processing literature (e.g. [69, 70]). SI-SNR measures how clear the human speech is above the noise in the output of the N-DNS system, similar to a Source-to-Noise Ratio (SNR) [70]. But importantly, SI-SNR is also scale-invariant—i.e. changing the overall magnitude (volume) of the output does not change the SI-SNR; intuitively, we do not wish to favor solutions over others' that simply increase the output volume.

For a single input waveform, a real-valued zero-mean vector s, and the corresponding output waveform from the N-DNS system $\hat{s}$ , the SI-SNR is defined as

$\begin{equation} \textrm{SI-SNR} : = 10 \log_{10} \frac{|| s_\textrm{target} ||^2 }{||e_\textrm{noise} ||^2}, \end{equation} \tag{ 3 }$

where $s_\textrm{target} : = \frac{\langle \hat{s}, s \rangle s}{||s||^2}$ and $e_\textrm{noise} : = \hat{s} - s_\textrm{target}$ .

We choose SI-SNR as one of our metrics for its simplicity and generality, rather than more complicated audio quality metrics, such as speech-to-text word accuracy used in the Microsoft DNS Challenge [18]. The focus of the N-DNS challenge is on neuromorphic algorithm innovation; this in itself constitutes a sufficiently challenging task. Moreover, we view the audio denoising task as a representative of a general audio processing workload, and some commercial applications may not specifically prioritize human-listener perceptual quality. Finally, the SI-SNR can be conveniently used as a loss function for machine learning approaches.

The mean $\textrm{SI-SNR}$ on the test set will be used to compare solutions. A script for computing mean SI-SNR is provided in the Intel N-DNS Challenge Github Repository.

6.1.2. Minimum SI-SNR improvement

Since solutions in the Intel N-DNS Challenge are evaluated holistically, solutions may target high audio quality by using a large amount of power, or lower audio quality using a smaller amount of power, or any audio quality-power point in between. However, to ensure that the audio denoising task is being solved to some significant extent, we require solutions to achieve a minimum audio quality improvement over the noisy input audio quailty. Moreover, per our emphasis on neuromorphic computing, we require that the neuromorphic component of the N-DNS system be responsible for a significant portion of the audio quality improvement; a solution may optionally perform some denoising in the encoder and decoder, but the spirit of the Intel N-DNS Challenge is in performing neuromorphic denoising.

Therefore, we define two measures of audio quality (SI-SNR) improvement (i) relative to (1) the noisy data ( $\textrm{SI-SNRi}_{\textrm{data}}$ ) and (2) the encode+decode processing ( $\textrm{SI-SNRi}_{\textrm{encode+decode}}$ ), expressed by the following inequalities:

$\begin{align} \textrm{SI-SNRi}_{\textrm{data}} &> 3~\textrm{dB} \end{align} \tag{ 4 }$

$\begin{align} \textrm{SI-SNRi}_{\textrm{enc+dec}} &> 3~\textrm{dB}, \end{align} \tag{ 5 }$

where

$\textrm{SI-SNRi}_\textrm{data} = \textrm{SI-SNR}_{\textrm{full system}} - \textrm{SI-SNR}_{\textrm{data}}$ ,
$\textrm{SI-SNRi}_\textrm{enc+dec} = \textrm{SI-SNR}_{\textrm{full system}} - \textrm{SI-SNR}_{\textrm{enc+dec}}$ ,
$\textrm{SI-SNR}_{\textrm{full system}}$ is the mean test-set SI-SNR from the full N-DNS system (input audio $\rightarrow$ encode $\rightarrow$ neuromorphic denoiser $\rightarrow$ decode $\rightarrow$ output audio),
$\textrm{SI-SNR}_{\textrm{enc+dec}}$ is the mean test-set SI-SNR from running only encoder and decoder (input audio $\rightarrow$ encode $\rightarrow$ decode $\rightarrow$ output audio), and
$\textrm{SI-SNR}_{\textrm{data}}$ is the mean test-set SI-SNR on the noisy input audio (no transformations).

Equation (4) ensures that the solution achieves a minimum audio quality improvement, and equation (5) ensures that the neuromorphic denoiser itself is responsible for a minimum audio quality improvement. These definitions allow for some amount of denoising to occur in the encoder and decoder, but critically, adding the neuromrophic denoiser must further improve audio quality. Similarly, additional pre/post-processing could be performed within the neuromorphic denoiser itself, to reduce the amount of computation in the encoder and decoder. But importantly, the computations allocated to the encoder/decoder or the neuromorphic denoiser are accounted for differently in the computational resource and chip usage metrics, as described in later sections.

6.1.3. DNSMOS metric

For audio signals, the perceptual quality of the audio signal is important in addition to the signal quality measured by SI-SNR. We use the widely adopted DNSMOS [71] metric to evaluate the perceptual quality of the solution. In DNSMOS, the perceptual quality score is predicted by a deep network that is trained to reflect the human perceptual quality expressed in Mean Opinion Score (MOS) in its training corpus. MOS score ranges from 1 to 5, where 1 corresponds to poor quality, and 5 corresponds to excellent quality. DNSMOS is particularly effective because it has been shown to generate scores that are highly correlated with human perceptual assessment [71] compared to other similar methods like Perceptual Evaluation of Speech Quality (PESQ) [72], Perceptual Objective Listening Quality Analysis (POLQA) [73], or VisQL [74]. There exist commercial alternatives like 3QUEST, but its use is limited due to its proprietary nature.

A DNSMOS score consists of three values: speech signal quality (SIG), background noise quality (BAK), and overall audio quality (OVRL). From the perspective of speech enhancement, the SIG score reflects the change in speech quality due to processing. Usually, most denoising algorithms do not improve SIG score significantly compared to the unprocessed signal. BAK score reflects the degree of noise present in the signal. Thus, after a speech enhancement, a significant improvement in this score is expected. Finally, OVRL score reflects the general audio quality assessment. It is not a simple average of SIG and BAK scores, but rather a general assessment of audio quality. After denoising, a signal should have a higher OVRL score.

DNSMOS provides a valuable additional facet in evaluating audio quality in the Intel N-DNS Challenge. In addition, it gives another point of comparison to existing denoising systems; namely, DNSMOS (OVRL) was used in the Microsoft DNS Challenge [18]. However, we note that while DNSMOS is an important metric, we emphasize that it is not the only metric used for the evaluation of audio quality in the Intel N-DNS Challenge; indeed, the spirit of the Intel N-DNS Challenge is directed toward holistic innovation on neuromorphic denoising systems. Furthermore, to minimize the complexity of the Intel N-DNS Challenge, we choose to not introduce additional audio quality metrics, such as STOI [75], as the pairing of SI-SNR and DNSMOS already provides an objective and a perceptual audio quality evaluation.

6.2. Computational resource usage and real-time requirement

Computational resource cost is evaluated in terms of power, latency, number of parameters, and model size. To qualify as a real-time solution, the end-to-end latency must not be greater than 40 ms. We measure power and latency on neuromorphic hardware in Track 2, but for Track 1, we introduce proxy metrics.

6.2.1. Latency

An audio denoising system takes some amount of time to process input audio as the audio streams into the system; this results in the output human speech being delayed relative to the input human speech. This delay is the latency of the denoising system. For the denoising system to be considered real-time, the latency must be less than some human perceptual threshold, which in our case we choose to be 40 ms.

We define latency as the maximum time difference between any corresponding segment of audio in the input and output of the N-DNS system. Intuitively, the longest delay in any segment of audio is the overall delay the output must be presented at in order to not introduce playback speed fluctuations in the output audio.

Latency should be calculated by considering a real-time input propagating through an entire N-DNS system (figure 3). This includes data buffer latency, encoder-decoder latency, and network latency (N-DNS latency):

1)
Data buffer latency is the time required to collect the audio stream to process one discrete timestep, however that may be defined for a given encoding scheme. For the STFT encoder in our SDNN baseline solution, the data buffer latency is equal to the STFT window length.
2)
Encoder-decoder latency is the wall clock processing time to encode one discrete timestep-worth of the audio data, to be processed by the N-DNS network, and decode it back.
3)
Network latency (N-DNS latency) is the latency introduced by the neuromorphic denoising network. It is measured by the maximum cross-correlation between the clean target audio and the denoised audio from the network.

In Track 1, notably, the (CPU) processing time for the neuromorphic denoiser (N-DNS) portion of the solution is not included in the latency calculation. We assume that the neuromorphic processing time will be small relative to the real-time timestep due to the high degree of parallelization in neuromorphic algorithms and hardware. In the case of the baseline SDNN, for example, the network must process a new STFT frame every 8 ms , whereas Loihi 2 circuits typically complete all spike processing and neuron evaluations for a timestep within microseconds. We provide an example Track 1 latency calculation in section 7.

For Track 2, latency is simply measured on a reference CPU + Loihi 2 system. The measurement methodology and an example will be provided in the Intel N-DNS Challenge Github Repository later in the challenge.

6.2.2. Power

For Track 1, we calculate a power proxy by estimating the effective number of synaptic operations per second:

$\begin{equation} P_\textrm{proxy} = \textrm{Effective SynOPS} = \textrm{SynOPS} + 10 \times \textrm{NeuronOPS}, \end{equation} \tag{ 6 }$

where SynOPS and NeuronOPS are the mean number of synaptic operations and mean number of neuron updates, respectively, per second of audio processed in the N-DNS stage. Synaptic operations and neuron operations can be considered the computational primitives of a neuromorphic system, and energy usage is roughly proportional to their number, with the approximate weighting of the energy of one neuron operation being equal to that of about ten synaptic operations in our experience with the Loihi architecture [46]. While P_proxy gives only a crude power estimate, it provides a simple and sufficiently reliable assessment of a neuromorphic power advantage without needing to run on neuromorphic hardware.

The power consumption of the encoder and decoder is not taken into account in Track 1. We make this choice for simplicity, in expectation of the neuromorphic power dominating in realistic solutions. Note that the real-time requirement implicitly bounds the amount of computation that can be performed in the encoder and decoder.

In Track 2, the encoder and decoder are implemented on a CPU and the N-DNS stage is implemented on a Loihi 2 system. The power is simply measured on a reference CPU and Loihi 2 system. Note that since both CPU and Loihi 2 power components will be measured, any attempt to implement a disproportionate amount of the denoising functionality inside the encoding/decoding CPU stages will result in a very high power result. Details for measuring power on a reference system will be provided in the Intel N-DNS Challenge Github Repository.

6.2.3. Power delay product

The power delay product (PDP) metric combines both latency and power efficiency in one number that allows comparing between different solutions that make different tradeoffs between running faster at higher power versus running slower at lower power. For Track 1, a proxy PDP measure is given by

$\begin{equation} \mathrm{PDP}_{\textrm{proxy}} = P_{\textrm{proxy}} \times L, \end{equation} \tag{ 7 }$

which is in units of Ops because P_proxy (equation (6)) has units of Ops/s and the latency, L, has units of seconds.

For Track 2, PDP is directly calculated from the measured power as

$\begin{equation} \mathrm{PDP} = P_\textrm{Track 2} \times L. \end{equation} \tag{ 8 }$

6.2.4. Chip resources

The physical resource cost of mapping networks into neuromorphic architectures is an important evaluation metric since chip resources impose a hard constraint on network complexity. Compared to conventional architectures that scale through the use of bountiful off-chip memory, neuromorphic architectures embed all network configuration on-chip, hence are limited by available state for representing synaptic weights, network routing tables, neuron parameters, and other configuration parameters.

For Loihi 2 and similar architectures, the ultimate measure of a workload's chip resource cost is core count. For Track 2, this is the definitive chip resource utilization metric used in this challenge.

Before networks are successfully mapped to chip, it is difficult to reliably estimate core count requirements, so for Track 1, we assess solutions by indirect measures of resource cost: parameter count and total model size.

A network's parameter count includes its total synaptic state (e.g. weights and delays) and neuron parameters such as decay factors. Only unique parameters are to be counted, as expected to be uniquely configured in on-chip memories and tables leveraging convolutional and other network compression features. Note that a network's trainable parameters will be a subset of its total unique configuration parameters.

Model size is the sum over the bit widths of all unique parameters, measured in bytes. Since Loihi 2 supports a range of synaptic weights from one to eight bits, it is possible for two networks with the same parameter counts to have very different model sizes. All else being equal, solutions with smaller model sizes are preferred.

7. Baseline solution

We provide a baseline solution for Track 1 of the Intel N-DNS Challenge, available in the Intel N-DNS Challenge Github Repository. In this section, we outline the baseline solution architecture, a sigma-delta neural network, and the evaluation of the baseline solution on the metrics defined in section 6. Later in the challenge, we will provide a Loihi 2 version of the baseline solution and evaluate it on a Loihi 2 system; we will also release the Track 2 baseline associated code.

7.1. Sigma-delta neural network architecture

The proposed neuromorphic solution is a simple feedforward sigma-delta ReLU neural network (SDNN). The solution makes use of two neuromorphic computation ideologies: sparse message passing using sigma-delta neuron and temporal computation using axonal delays.

The delta encoding exploits the temporal similarity in the data. It sparsifies the data communicated to the next layer by sending only a change that is higher in magnitude than a certain threshold. The sigma encoding, on the other hand, reconstructs the original signal at the receiving end. A combination of sigma and delta units wrapped around a dynamics or a non-linearity (ReLU in this case) is a sigma-delta neuron [76]. Sigma-delta neurons make use of the sparse messaging paradigm in neuromorphic hardware and result in a significant reduction in synaptic computations.

The axonal delays endow the network with a short-term memory capability that allows the interaction of audio/features originating at different points in time. Learnable axonal delays have been shown to increase the expressivity and performance of networks, particularly for applications with spatio-temporal features [68, 77]. Audio denoising is one such application.

The structure of the SDNN baseline solution is illustrated in figure 4, and we describe the solution in the following.

Encoder: The encoder is a straightforward STFT [59] of the noisy audio waveform followed by delta encoding of the STFT magnitude. The STFT uses a window length of 512 with a hop length of 128 ( $^1\!/\!_4$ window length), leading to 8 ms per time-step, as the signal is at 16 kHz. These parameters are user-configurable. The delta encoding sparsifies the STFT magnitude which is then fed to the N-DNS network.

N-DNS: The neuromorphic denoiser (N-DNS) network is a three-layer feedforward sigma-delta ReLU network with axonal delays. The sigma-delta layer efficiently performs denoising in the sparse domain. The axonal delays provide the network with short term memory which can be used to incorporate previous temporal patterns during denoising. The N-DNS network predicts a multiplicative mask at some delay which is then used to mask the STFT magnitude. The STFT phase and magnitude from the encoder need to be delayed accordingly during the decoding phase.

Decoder: The decoder combines the multiplicative mask predicted by the N-DNS network with the delayed STFT phase and magnitude of the noisy audio signal and performs inverse STFT with the same window length and hop length as the encoder. The resulting output is the clean reconstruction (denoised) audio waveform.

The SDNN baseline network was trained with Lava-dl⁴ , which includes the extended version of the SNN backpropagation training tool SLAYER [77]. Lava-dl SLAYER uses a surrogate gradient method (e.g. see [78]) to address the critical challenge in training spiking neural networks—the non-differentiability of spikes. The baseline network was trained with Loihi 2's fixed precision computation in mind and trained with appropriate quantization for synapse and neuron dynamics. We used a combination of negative SI-SNR and a mean-square error measuring the STFT magnitude reconstruction quality as the minimization loss and a RADAM optimizer for training. The detailed training procedure, as well as Lava⁵ evaluation of the baseline network, are available in Intel N-DNS Challenge Github Repository.

7.2. Evaluation metrics

We evaluated the SDNN baseline solution, Microsoft NsNet2 (the baseline network for Microsoft DNS 2022), and Intel DNS network using Track 1 metrics on the validation set. The metrics are summarized in table 2. All three networks use STFT encoding and ISTFT decoding.

Table 2. Evaluation metrics comparison.

Network	SI-SNR dB	SI-SNRi		DNSMOS ^b			Latency		Power proxy M-Ops s⁻¹	PDP proxy M-Ops	Param count $\times 10^3$	Model size KB
Network	SI-SNR dB	data dB	enc+dec dB	OVRL	SIG	BAK	enc+dec ^a ms	total ms	Power proxy M-Ops s⁻¹	PDP proxy M-Ops	Param count $\times 10^3$	Model size KB
Microsoft NsNet2	11.89	4.26	4.26	2.95	3.27	3.94	0.024	20.024	136.13	2.72	2681	10 500
Intel DNS network	12.71	5.09	5.09	3.09	3.35	4.08	0.036	32.036	—	—	1901	3802
SDNN baseline	12.50	4.88	4.88	2.71	3.21	3.46	0.036	32.036	14.54	0.44	525	465
Validation set (noisy)	7.62	—	—	2.45	3.19	2.72	—	—	—	—	—	—
	Higher is better ( $\uparrow$ )						Lower is better ( $\downarrow$ )

^aLatency results measured on a system with Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz and 32 GB RAM as of February 2023 and may not reflect all publicly available security updates. Results may vary.^bPlease note that the DNSMOS scores in this table are not directly comparable to the DNSMOS scores presented in the results of the Microsoft DNS Challenge due to differing composition of validation/test sets.

Intel DNS network is an Intel proprietary network used in production. The model is causal, operates in real-time, and is built from LSTM and 2D convolution layers. Power metrics for this network are not available. The network was trained using proprietary datasets and augmentation techniques, and as such we view its audio quality results as upper-bound aspirational targets for challenge submissions.

The audio quality metrics include DNSMOS scores, SI-SNR, and improvement in SI-SNR (SI-SNRi). The encoder and decoder for all three networks perform lossless transformation using STFT and ISTFT. As a result, relative performance differences across models in SI-SNR and SI-SNRi are equal.

The latency was calculated by summing data buffer latency, encoder-decoder latency, and network latency (N-DNS latency), as described in section 6.

Power proxy and PDP proxy metrics provide some measure of the relative power and power-delay-product across the three networks suitable for Track 1 comparisons. For the SDNN baseline, these are calculated according to equations (6) and (7), respectively. For the conventional Microsoft NsNet2 network, Ops refer to multiply–accumulate operations without considering the negligible cost of per-neuron ReLU evaluation.

We see that our SDNN baseline is a promising neuromorphic solution to the audio denoising problem. In terms of audio quality, the SDNN baseline has a higher SI-SNR relative to the NsNet2 baseline solution from the Microsoft DNS Challenge 2022, and lower relative DNSMOS scores. Notably, our baseline solution training targeted an SI-SNR loss, thus better relative SI-SNR performance may be expected. Nonetheless, it is encouraging to see substantial DNSMOS improvement over the unprocessed noisy input in a system not trained specifically for perceptual quality. And importantly, the SDNN solution is an order of magnitude more efficient than the NsNet2 baseline in terms of the power proxy even though it processes data at a throughput 1.25 × higher than the NsNet2 baseline, and it uses 5 × fewer parameters. The quantization-aware training of the baseline SDNN solution further reduces the model size by 22 × compared to NsNet2.

Naturally, the NsNet2 solution is a baseline and does not represent state-of-the-art for audio denoising today. For example, the Intel production DNS model (Intel DNS network) achieves higher SI-SNR and DNSMOS than both NsNet2 and the SDNN baseline solution (table 2). Given the simplicity of our SDNN baseline solution as a starting point for neuromorphic audio denoising, we believe it will be possible to significantly improve its denoising quality while also reducing its computational resources with further algorithmic innovations in the Intel N-DNS Challenge.

Notably, the sigma-delta approach in our baseline solution is quite general. Sigma-delta sparsification can be applied to any conventional ReLU-like nonlinearity as well as to the dynamics present in typical neuromorphic neuron models such as leaky integrators and resonators. Furthermore, sigma-delta sparsification represents just one neuromorphic feature available of many to exploit by participants in the challenge. We see a wide space of uncharted waters to explore for the Intel N-DNS Challenge. Our baseline solution represents just a first step, and we find it encouraging that it already provides promising results.

8. Additional information

Please see the Intel N-DNS Challenge Github Repository for the official competition rules, timeline, registration procedure, metrics boards, code, and datasets. Any additional clarifications that may arise during the challenge will be posted there.

9. Conclusion

We introduce the Intel Neuromorphic DNS Challenge to fulfill a vital need for a widely-applicable challenge problem that facilities algorithm innovation leading to a clear demonstration of neuromorphic hardware benefits.

We include two tracks to encourage (1) algorithmic innovation and (2) demonstration on neuromorphic hardware, and we specify task performance metrics and computational cost metrics to make it easy to compare different solutions. Furthermore, we provide permissively-licensed dataloader scripts, evaluation scripts, and an example neuromorphic baseline solution for accessibility, convenience, consistency, and extensibility. We also offer a monetary prize to encourage participation.

We look forward to the learnings that we gain as a community through the Intel N-DNS Challenge, both in terms of the innovation that occurs in the solution space, as well as the insights that can inform the development of future neuromorphic challenge problems.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/microsoft/DNS-Challenge.

The Intel neuromorphic DNS challenge

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Past neuromorphic challenge problems

1.2. Audio denoising as a neuromorphic challenge