Quantized non-volatile nanomagnetic domain wall synapse based autoencoder for efficient unsupervised network anomaly detection

Anomaly detection in real-time using autoencoders implemented on edge devices is exceedingly challenging due to limited hardware, energy, and computational resources. We show that these limitations can be addressed by designing an autoencoder with low-resolution non-volatile memory-based synapses and employing an effective quantized neural network learning algorithm. We further propose nanoscale ferromagnetic racetracks with engineered notches hosting magnetic domain walls (DW) as exemplary non-volatile memory-based autoencoder synapses, where limited state (5-state) synaptic weights are manipulated by spin orbit torque (SOT) current pulses to write different magnetoresistance states. The performance of anomaly detection of the proposed autoencoder model is evaluated on the NSL-KDD dataset. Limited resolution and DW device stochasticity aware training of the autoencoder is performed, which yields comparable anomaly detection performance to the autoencoder having floating-point precision weights. While the limited number of quantized states and the inherent stochastic nature of DW synaptic weights in nanoscale devices are typically known to negatively impact the performance, our hardware-aware training algorithm is shown to leverage these imperfect device characteristics to generate an improvement in anomaly detection accuracy (90.98%) compared to accuracy obtained with floating-point synaptic weights that are extremely memory intensive. Furthermore, our DW-based approach demonstrates a remarkable reduction of at least three orders of magnitude in weight updates during training compared to the floating-point approach, implying significant reduction in operation energy for our method. This work could stimulate the development of extremely energy efficient non-volatile multi-state synapse-based processors that can perform real-time training and inference on the edge with unsupervised data.


Introduction
In today's interconnected world, the security and integrity of computer networks are of paramount importance.By 2030, it is estimated that 500 billion devices will be connected to the internet [1], with a significant portion comprising internet of things (IoT) devices.While the rapid growth of network-based devices, applications and services offer immense convenience, it also has led to an increasing number of cyber threats and attacks [2].The proliferation of cybercrimes and network intrusions underscores the need for developing robust solutions that can safeguard network security.Anomaly detection plays a vital role in safeguarding these networks by identifying and mitigating abnormal or malicious activities that deviate from expected patterns [3].Detecting such anomalies in real-time is crucial to prevent potential damage, data breaches, and service disruptions [4].As we look towards the future, where IoT and edge computing gain prominence, the need for efficient and effective anomaly detection becomes even more critical.least three orders of magnitude in weight updates during training in contrast to the floating-point approach, indicating substantial energy conservation benefits inherent to our method.While our study focuses on the use of the DW device as an illustrative example, the insights gained from our research can be extended to other non-volatile multi-state memory technologies.This exploration opens up avenues for implementing quantized autoencoders in anomaly detection, leading to improved efficiency and effectiveness.
The subsequent sections of the paper are structured as follows: section 2 explores related research in both anomaly detection with autoencoders and use of DW devices as non-volatile memory for synaptic weights.Section 3 delves into the data and the data preprocessing steps.Section 4 provides comprehensive information regarding the proposed quantized autoencoder-based anomaly detection, including details on the autoencoder model, quantization-aware training process, anomaly detection workflow, and algorithm.Section 5 presents the design of the DW synapse and the micromagnetic simulations required for the DW synapse-based autoencoder.Section 6 investigates the performance of anomaly detection using the proposed method.Finally, section 7 concludes the paper by discussing the outcomes of this study and outlining potential future directions.

Related work: anomaly detection with autoencoders and nanoscale magnetic DW synapses
In recent years, the increasing adoption of machine learning approaches for anomaly detection has been driven by the limitations and high costs associated with conventional signature-based intrusion detection techniques [28].These traditional methods prove inadequate in effectively detecting zero-day attacks, which are characterized by their unknown and previously unseen nature [29].Various supervised learning-based classification algorithms and hybrid models combining multiple algorithms have been explored to identify network anomalies and detect attacks with high accuracy.Notable algorithms include Support Vector Machine, Decision Tree, Naive Bayes Network, Naive Bayes Tree, J48, Fuzzy logic, and Artificial Neural Networks [27,[30][31][32].However, the effectiveness of these algorithms hinges on accurate labels and balanced training data [33].The availability of such data, particularly in the realm of network intrusion detection, is limited due to factors like privacy concerns and data confidentiality [34].To overcome this constraint, researchers have turned to unsupervised learning techniques, such as anomaly detection algorithms based on autoencoders [6,7].Moreover, studies have been conducted on unsupervised deep learning circuits using memristors to enable real-time anomaly detection with autoencoders on low-power devices [35].
In the emerging field of spintronic memory devices, researchers have made significant advancements in leveraging their properties for efficient computing.It has been shown that non-volatile nanomagnetic devices can be controlled efficiently using voltage-controlled magnetic anisotropy [36][37][38], voltage-induced strain [39][40][41][42][43], current control [44,45], and a combination of both current and voltage control [23,46,47].Studies have demonstrated that, despite having imperfect device characteristics, nanomagnetic devices can be implemented as multi-state synapses for deep neural networks [23,24,48].However, for quantized neural network implementation, weight gradients need to be stored in high-precision memory during the training phase to retain high accuracy [11,49].In recent studies, it has been observed that by incorporating an effective quantization-aware training algorithm, stochastic and extremely low-resolution (less than 3-bit) DW devices can achieve comparable accuracy levels to floating-point precision neural networks [24].
While the feasibility and implementation of various deep neural network architectures using emerging spintronic memory devices are actively being studied, the implementation of autoencoders using nanomagnetic devices remains relatively unexplored.This research gap serves as the motivation to explore the connection between the fields of unsupervised autoencoder-based anomaly detection and spintronic nanomagnetic synapse technology.

Data and preprocessing
In this section, we explore the characteristics of the NSL-KDD dataset and outline the preprocessing steps undertaken to prepare the data for further analysis.

NSL-KDD data
The NSL-KDD dataset is derived from the KDD Cup 1999 dataset, which represents a comprehensive collection of network traffic data containing both normal and various attack instances [27].Within the NSL-KDD dataset, two distinct sets of data exist: KDDTrain+ and KDDTest+.The training data (KDDTrain+) consists of 125 973 packets, each categorized into one of the 23 distinct data types.These types include malicious categories such as Ipsweep, Guess_passwd, Warezclient, Neptune, Multihop, Perl, Smurf, Phf, Rootkit, Imap, Loadmodule, Portsweep, Nmap, Back, Pod, Spy, Land, Warezmaster, Satan, Buffer_overflow, Teardrop, Ftp_write, as well as the category labeled Normal [50].Among these, 58 630 packets in the KDDTrain+ dataset are labeled as malicious, while the remaining 67 343 packets represent the normal packets.As for the KDDTest+ dataset, it comprises a total of 22 544 packets, with 12 833 packets categorized as malicious and the remaining 9711 packets labeled as normal.
Each packet in the NSL-KDD dataset consists of 41 features that are identical to those in the KDD Cup 1999 dataset.Table 1 provides an overview of the 41 features along with their corresponding data types.The 42nd feature represents the data label (normal/attack) [50].However, for training purposes, the packets do not include the data label.Among the selected features, 38 are numeric (int64/float64) and the remaining 3 are categorical (object) variables.

Preprocessing steps
Prior to the training phase, preprocessing is applied to all packets in the dataset.First, the categorical features are converted into numerical representations.Specifically, the second position (protocol/type), third position (service), and fourth position (flag) contain categorical data.The three categorical features are modified to one-hot encoded numerical values.For instance, in the 'protocol type' feature, the strings 'tcp' , 'udp' , and 'icmp' are replaced with their respective one-hot encoded representations: [1 0 0, 0 1 0], and [0 0 1].As a result, the three categorical features in the NSL-KDD dataset, namely 'protocol type,' 'service,' and 'flag,' which have 3, 70, and 11 distinct strings respectively, are transformed into a total of 84 features.This one-hot encoding process leads to a combined total of 122 features, including the selected original 38 numeric features.
Additionally, the dataset undergoes normalization to ensure consistency.Each feature vector is normalized by scaling them to fit in the range of [0, 1] based on the maximum value within that feature vector.All types of malicious data are labeled as '1,' while normal data packets are labeled as '0' .

Autoencoder
An autoencoder is a neural network architecture that employs unsupervised learning to reconstruct input data.Comprising multiple layers, including one or more hidden layers, the autoencoder maintains the same size for both the input and output layers.At the center of the network lies the bottleneck layer, which represents a compressed latent space representation of the input data.The encoder maps the input to the bottleneck layer representation, while the decoder reconstructs it in the output layer [7]. Figure 1 illustrates the architecture of a standard autoencoder.In the encoding phase, an n dimensional input vector X [x 1 , x 2 , x 3 ,…, x n ] is mapped to hidden layer representation H.The encoding operation is expressed as equation ( 1): where W 1 is the weight matrix, b 1 is the bias vector, and F 1 denotes the encoder activation function.
In the decoding phase, the latent space representation H is mapped to reconstruct the input (X) as output R [r 1 , r 2 , r 3 ,…, r n ].The decoding operation is expressed as equation ( 2): where W 2 is the weight matrix, b 2 is the bias vector, and F 2 denotes the decoder activation function.
As the autoencoder goes through training with backpropagation, the weights and biases are updated to minimize the loss function (L).The loss function is used to minimize the reconstruction error.It can be expressed as equation (3): where Θ is the autoencoder parameters (weights and biases).Here, L represents a mean squared error (MSE) loss function.
The reconstruction error is used to determine whether a network traffic sample is normal or malicious.During the testing phase, if a network sample shows a high reconstruction error, it is likely to be considered as a malicious packet.This is because an autoencoder trained on normal network traffic packets generally has low reconstruction error for normal data.

Autoencoder model
In this study, we use an autoencoder architecture comprising five layers.The number of nodes in each layer, ranging from the input to the output layer, is [122-32-10-32-122], influenced by the model architectures investigated in [7].However, it is important to note that our primary focus is not the impact of different autoencoder model architectures; instead, we aim to compare the performance of an autoencoder with low-resolution quantized DW synapses to a similar autoencoder with floating-point synapses.The autoencoder comprises three hidden layers: the first hidden layer consists of 3904 synapses, the second hidden layer consists of 320 synapses, and the third hidden layer also contains 320 synapses.Furthermore, the output layer is composed of 3904 synapses.Thus, the total number of artificial synapses within the autoencoder architecture sums up to 8448.
The input features undergo a sequence of weighted sum operations and activation functions within the autoencoder model until they are reconstructed in the output layer.Specifically, we employ the sigmoid function as the activation function across all hidden and output layers within the autoencoder architecture.This activation choice introduced essential non-linearity into the model, enhancing its ability to learn intricate patterns and representations embedded within the input data.The sigmoid function, which maps input values to the interval [0, 1], aligns well with our autoencoder's requirements, where both input data and reconstructed outputs are confined to this range.By utilizing sigmoid activation in the output layer, our model's reconstructed outputs naturally conform to the [0, 1] input data scale, simplifying direct comparison and reconstruction tasks.In contrast, alternative activation functions might necessitate additional normalization or scaling methods to match the input/output data range.Moreover, the sigmoid function's smoothness and differentiability contribute to stable and efficient training of our autoencoder model, particularly when employing gradient-based optimization techniques.The autoencoder is trained using the backpropagation algorithm in conjunction with stochastic gradient descent.This enables the adjustment of the weights and biases of the model based on the difference between the predicted output and the actual input.The performance of the autoencoder is measured using MSE as the loss function.Figure 2 illustrates the autoencoder architecture.

Autoencoder quantization
Since the proposed nanomagnetic synapses have a limited number of states, the autoencoder parameters are quantized in the training and the inference stages.The quantization process can be mathematically expressed with following functions [51]: where N q is the quantized value, N fp is the full precision value, [l; h] is the quantization range, and n is the number of quantization levels.

Autoencoder training process
While KDDTrain+ and KDDTest+ carry both normal and malicious traffic samples, only the normal traffic samples from the KDDTrain+ are used for training the quantized autoencoder.Figure 3 illustrates the workflow for quantized autoencoder-based anomaly detection.In the training phase, the processed normal traffic samples are sent to the autoencoder, where the original features are encoded to a latent space representation.Next, output features are reconstructed from this latent space.The reconstruction error is assessed from the difference between the reconstructed traffic sample and the original sample.The reconstruction error is measured for all the traffic samples and the standard deviation is estimated, which acts as the threshold for detecting anomalies.The details of threshold calculation can be found in the Algorithm section.In the inference phase, network traffic samples (carrying both normal and malicious samples) are sent to the trained autoencoder, and reconstruction error is calculated for each sample.The difference between the reconstruction error of a sample and the mean error of samples is referred to as the anomaly score (AS).The AS is then compared to the threshold.If the AS is higher than or equal to the threshold, then the traffic sample is inferred as malicious.Otherwise, the sample is inferred as benign.Quantization-aware training of the autoencoder is performed by quantizing the weights according to (4) in the feed-forward phase and using the backpropagation algorithm based on low-precision neural network training [11].In this process, while all weights are quantized in the forward pass, weight gradients are stored in separate high precision memory units during the backward pass to retain accuracy.However, the weight gradients need to propagate through non-differentiable quantization blocks in the backward pass.To tackle this issue, the straight through estimator is used, which provides a workaround by treating the quantization operation as a simple identity function during backpropagation [11].This allows the gradients to be backpropagated as if the quantization were not applied.for j = 1 to N do for number of training iterations do T j is an anomaly end insert T j to Anomaly Set //Step 3: Weight Update else for k = 1 to L do T j is not an anomaly

DW synapse design
In this section, we discuss the proposed low-resolution (quantized) DW synapse design for the autoencoder neurons.Being non-volatile, the DW memory retains data for a long time even in absence of power.We design the device by simulating a thin ferromagnetic racetrack with five engineered notches where DW positions can be controlled with current pulse.The racetrack has a dimension of 560 nm × 60 nm × 1 nm.Along the racetrack, engineered notches are incorporated at a regular interval of 100 nm.However, the first and the fifth notches are positioned at 80 nm and 480 nm respectively.Figure 4 illustrates the design of the nanomagnetic DW-based synaptic device.
In the designed DW synapse configuration, fixed magnetization zones are strategically positioned at the ends of the DW track to ensure stability and control over the DW movement.Specifically, at the start of the DW track, a pinned up-state domain is established on the left, and conversely, a pinned down-state domain is established on the right at the end of the DW track.Due to this configuration, the left side of the DW track, adjacent to notch 1, remains consistently in the up-state, while the right side, near notch 5, remains in the down-state.This fixed zone arrangement effectively confines the DW within the racetrack structure.
To control the DW position, current pulses with specific amplitude and fixed pulse duration are applied across the heavy metal layer.The current pulses generate SOT, which acts on the magnetic racetrack above it.By varying the number and the direction of current pulses, DW can be moved to different positions.
An MTJ is formed by combining the racetrack (free layer), an insulator (MgO tunneling layer), and a ferromagnetic material (reference layer) as seen in figure 4(b).The MTJ is used to read the state of the racetrack's magnetization.DW positions are encoded as the conductance of MTJ, thereby creating a synapse that can be programmed with current pulses.For the DW-based MTJ device, resistance values of 16.75 kΩ and 23.45 kΩ represent the low and high resistance states respectively, with a difference of 6.7 kΩ between these states.This configuration results in a tunneling magnetoresistance value of 40% during read operations [52].

Micromagnetic simulations
Extensive micromagnetic simulations are performed using mumax3 [53], which simulate the magnetization dynamics of the DW in the magnetic racetrack while considering thermal noise at room temperature (300 K).The simulations provide insights into the evolution of the DW synapse.Table 2 presents the parameters employed for the micromagnetic simulation.
Figure 4(a) illustrates the micromagnetic configuration of the racetrack's free layer.A fixed amplitude current pulse of 85 × 10 9 A m − 2 with a fixed duration of 0.5 ns is applied through the heavy metal layer to initiate the DW depinning from its initial pinned position and move it towards the intended adjacent notch.However, due to the DW tilting caused by the presence of Dzyaloshinskii-Moriya interaction [54] and thermal noise, the DW exhibits significant stochastic motion when driven by the SOT current pulses.As a result, the DW might get pinned at a different notch position rather than getting pinned at the intended specific notch position after the SOT current pulses are applied.Figure 5 illustrates the probabilistic distribution of the DW positions due to stochastic variation in the DW motion.The equilibrium pinned Table 2. Parameters for the micromagnetic simulations [23].positions of the DWs are used to calculate the conductance of the MTJ using the subsequent equation ( 5) [24]:

Symbol
where ⟨m z ⟩ denotes the average magnetization moment of the ferromagnetic racetrack along the z-direction.
The magnetization of the reference ferromagnetic layer is assumed to point upward in the +z−direction.G max and G min respectively denote the maximum and minimum conductance of the synaptic device.The conductance of the DW device (G synapse ) which corresponds to the position of the DW, represents the weights.However, this means that only positive weights can be attained using the DW device since the conductance values are inherently positive quantity.Consequently, synaptic weight updates are confined to positive values spanning from G min to G max .To address the need for weight updates in both positive and negative directions, a circuit model illustrated in figure 6 is designed [24].This model utilizes two separate rows connected to a single column (bit line or BL) in the crossbar.These rows are supplied with opposite polarity voltages and are responsible for connecting the synaptic devices to additional conductance (G parallel ) in parallel.A negative input voltage is applied to G parallel to accommodate both positive and negative linear weight updates.The synapse conductance can be calculated using Kirchhoff 's current law for a single column [55]: where G parallel = Gmax+G min 2 (average of the maximum and minimum conductance values in DW device).Solution of equation ( 6) for single synapse: where W 1,1 is G synapse with respect to the position of the DW.Equivalent conductance and resistance can be formulated as: where R synapse is the DW device resistance corresponding to the position of DW.R min and R max are resistances corresponding to G min and G max , respectively.Consequently, this approach enables the realization of linear weights in both positive and negative directions.
To enable read and write operations, additional components such as the write and read word lines (WWL/RWL) and source line (SL) are introduced (as shown in figure 6).It is worth noting that for simplicity, the WWL for the parallel conductance is not shown in the figure.To perform column sum read/write operations, the RWL or WWL is activated accordingly.When programming a device, the WWL is activated, and the SL and BL are adjusted to high or low levels based on the direction and number of current pulses.The WWL/RWL, SL, and BL are controlled and operated by external transistors, which function as switches regulating the current flow to these lines based on the required operations.These transistors play a pivotal role in overseeing the read and write processes of synaptic devices, particularly in managing current pulses during programming and inference phases.

Performance measures
We employ accuracy, precision, true positive rate (TPR), and F1 score to assess the performance of the autoencoder model.These performance measures can be expressed using four quantities: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).TP represents the number of correctly classified malicious samples, TN represents the number of correctly classified normal samples, FP represents the number of normal samples incorrectly classified as malicious samples, and FN represents the number of malicious samples incorrectly classified as normal samples.
Accuracy quantifies the overall correctness of the classification model and represents the ratio of correctly classified packets to the total number of packets.It is represented by the following equation (11): Precision quantifies the proportion of correctly classified malicious samples out of all samples predicted as malicious.It is expressed as equation ( 12): TPR measures the proportion of correctly classified malicious samples out of all malicious packets.It is expressed as equation ( 13): The F1 score is a harmonic mean of precision and TPR, providing a balanced measure of both metrics.It is expressed as equation ( 14):

Results for quantized autoencoder
In this section, we compare the anomaly detection performance of the proposed quantized autoencoder (without DW Synapse) with 2, 3, and 5-level weight quantization to a structurally identical autoencoder with floating-point precision (32-bit) weights.The performance evaluation is based on testing accuracy, precision, TPR, and F1 score.Figure 7 illustrates these performance metrics for the autoencoder with 2, 3, 5-level quantized weights, and floating-point precision weights.Figure 7(a) demonstrates that employing only 2 quantization levels in the autoencoder leads to random fluctuations in the resulting accuracy as the training progresses through successive epochs.Thus, training with 2 quantization levels shows occasional high accuracy with extended training cycles; however, predicting the number of epochs required to achieve high accuracy remains challenging since the accuracy does not converge over time.Similarly, training the autoencoder with 3 quantization levels yields a comparable outcome, showing random fluctuations in accuracy with extended training cycles.However, the fluctuation pattern is less random compared to the previous case.On the other hand, training the autoencoder with 5 quantization levels demonstrates a more deterministic accuracy progression.Notably, across all epochs, the accuracy for training with 5-level quantized weights is competitive compared to the accuracy for training with floating-point weights.Similar conclusions can be drawn for the remaining performance metrics illustrated in figures 7(b)-(d).
Based on the results illustrated in figure 7, it is evident that training with 5 quantization levels yields competitive performance metrics compared to training with floating-point weights, despite the significantly reduced number of bits and weight precision required by 5-level quantization.Quantization-aware training enables the autoencoder model to adapt to lower-precision weight representation.While the use of a limited number of quantization levels is typically considered detrimental to training and testing performance, quantization-aware training can leverage quantization noise to achieve a significant improvement in performance metrics.In this case, quantization acts as a regularization operation that limits overfitting of the trained weights.By reducing the precision of numerical values in network parameters, quantization reduces model complexity and prevents the network from memorizing outliers or noise in the training data.Therefore, this reduction in precision introduces a controlled level of noise or approximation error, which helps smooth out decision boundaries and makes the network less sensitive to small variations in the input data.Additionally, the computational efficiency gained from quantization, such as faster inference and reduced memory requirements, indirectly contributes to regularization by reducing the risk of overfitting that can arise from longer training times or limited training data.Furthermore, the results demonstrate that, for this specific anomaly detection problem, 5-level quantization is optimal as it represents the minimum level of quantization where performance metrics closely match those obtained with floating-point weights.For example, a reduced number of quantization levels (2 or 3-level) could be more hardware efficient; however, the performance is not consistent.This motivates our study of a multi-state non-volatile synaptic memory that can maintain at least five different non-volatile resistance states.

Results for quantized DW-based autoencoder
In this section, we evaluate and compare the effectiveness of anomaly detection in three different configurations of autoencoders: one with quantized synapses (without DW device), another with quantized DW-based stochastic synapses, and the third with floating-point precision synapses.
Figure 8 illustrates anomaly detection testing accuracy, precision, TPR, and F1 score for different configurations of the autoencoder.From figure 8(a) it can be inferred that when using only five quantization levels (without DW device), the quantized autoencoder achieves competitive anomaly detection accuracy compared to the autoencoder with floating-point precision weights.This performance can be attributed to quantization acting as a regularization operation, as explained in the previous section.However, the autoencoder with quantized DW-based stochastic synapses shows higher accuracy than both the autoencoder with quantized synapses (without DW device) and the autoencoder with floating-point precision synapses.For the autoencoder with DW-based synapses, non-volatile synapses are designed using a specific hardware technology called racetrack MTJ.The synapses can encode multiple non-volatile states and the training process considers the characteristics of the device, such as noise and stochasticity.By introducing randomness during the training process, stochasticity serves as a regularization technique.It adds noise to the model and encourages exploration of different solutions, thus reducing the risk of overfitting to specific patterns in the training data [56].The results obtained with the DW synapses illustrate a higher anomaly detection accuracy (90.98%) surpassing even the floating-point accuracy (90.85%).Similar conclusions can be drawn for the remaining performance metrics illustrated in figures 8(b)-(d).The findings presented in figure 8 indicate that combining stochasticity with quantization further improves the performance of anomaly detection.This combination acts as a better regularization process.Moreover, the fact that stochasticity arises from the inherent properties of the DW devices, rather than being added separately makes it energy efficient as generating random numbers in software can be energy inefficient.Thus, stochasticity inherent to nanoscale devices, which is decremental to Boolean logic, is beneficial to hardware AI applications at no additional energy cost.

Total number of programmed weights
In this section, we conduct a comparison of the total number of programmed weights (weight updates) across different autoencoder synapse schemes.9 shows a significant distinction: the proposed DW-based approach exhibits a remarkable reduction of at least three orders of magnitude in weight updates when compared to the floating-point approach.The 2, 3, and 5-level quantized weight autoencoders also demonstrate notably fewer weight updates compared to their floating-point weight-based counterparts.Among these three quantization approaches, the 5-level quantized weight autoencoder requires the least number of weight updates.However, weight updates were further reduced with 5-state DW-based autoencoder.Moreover, it can be inferred from figure 9(a) that the inclusion of stochasticity with the 5-sate DW-based autoencoder results in diminishing number of weight updates at each epoch instance as the training epochs progress, in contrast to the pattern observed in other autoencoder configurations.Consequently, the proposed DW device-based autoencoder demonstrates a higher degree of computational resource efficiency.

Energy dissipation
The energy consumption in DW synapses stems from the I 2 R losses induced by SOT current pulses in the heavy metal layer.Assuming the heavy metal layer is composed of Platinum (Pt) with a specific resistance of 100 Ω nm, the resistance of a synapse is determined.Given the heavy metal layer's dimensions of 560 × 60 × 5 nm 3 , the calculated resistance is approximately 186.67 Ω.Subsequently, the energy dissipated for a current pulse of 85 × 10 9 A m −2 applied for 0.5 ns in the heavy metal layer is computed to be 0.06 fJ.Consequently, the energy expenditure for programming a synapse with a current pulse is estimated to be around 0.06 fJ.The highest testing accuracy is obtained from the autoencoder with 5-state DW synapses with relatively low number of weight updates, considering high noise tolerance margin during training.By incorporating a noise tolerance margin of α = 0.25, the overall count of weight updates reaches roughly 1.74 million after conducting training for 9 epochs.Consequently, the energy cost for programming DW synapses is approximately 1.56 f J for a single inference event following the weight updates for each packet of 67 434 normal packets of KDDTrain+.
Next, we study the energy requirements within both the analog and digital domains for the training and inference processes of the proposed network.In the analog domain, energy utilization is contingent on executing matrix-vector multiplication during both forward and backward propagation, as well as updating the weights of the DW devices.On the digital front, energy is expended on tasks such as computing neuron activations, determining error gradients, and accumulating gradients for subsequent weight updates.The analog computation phase incurs energy consumption across various stages.This includes the sequential input and output data transfer to and from the crossbar rows and columns, the conversion of input data to voltage pulses using pulse-width modulation (PWM), regulation of column voltage to a specified value for a corresponding voltage drop across DW devices, reading the analog weighted sum in the crossbar arrays, and the analog-to-digital conversion (ADC) of the weighted sum before transmitting it to the crossbar for the implementation of subsequent layers in the autoencoder.We adhere to an 8-bit resolution for both PWM and ADC, as detailed in [26].Given the parameters specified in [24,26], the energy consumption during the analog computation phase is approximated at ∼0.32 nJ for forward propagation and ∼0.18 nJ for backward propagation per training packet.The energy calculation guidelines are outlined in [26].For DW device weight updates, an average of ∼25 devices undergo updates during each training instance, resulting in an energy expenditure of ∼1.56 fJ per training instance, with a write energy of ∼0.06 fJ per update.Consequently, the cumulative energy in the analog computation phase is estimated to be ∼0.5 nJ per training instance.
For updating DW device weights, gradients are initially accumulated in a digital unit using 32-bit precision memory.However, when quantizing neuron activations (during forward propagation) and error gradients (during backward propagation) to 3 bits, a significantly reduced number of 32-bit memory accesses becomes feasible, owing to the diminished non-zero entries resulting from quantization, without compromising accuracy [26,57].Moreover, if memory access occurs close to 1% of the total synapses (approximately 85 synapses, constituting 1% of 8448 synapses), the energy expended for weight updates in the digital unit can be approximated at ∼2.64 nJ based on [26].Considering that an analog read operation is preceded by a 4-bit PWM input signal (as a 4-bit PWM is sufficient, given that PWM output voltage resolution does not impact the reading of device conductance), followed by read voltage regulation and subsequent ADC operation, the total read energy is estimated to be ∼0.21nJ per training instance for reading device conductances and parallel conductances.Assuming comparator energy equals ADC energy at ∼330 fJ, the combined energies for quantizer, read (including comparator), and other operations are estimated at ∼3.51 nJ per training instance.
Furthermore, the energy dissipated in a digital unit for computing neuron activations (during forward propagation), error gradients (during backward propagation), and accurately addressing analog DW devices for sending write pulses can be extrapolated from the application-specific integrated circuit design implemented in [26] using on-chip static random access memory.In their design, energy consumption for forward and backward passes in a digital unit is demonstrated to be ∼9 nJ and ∼3 nJ per training instance.Given that the number of synapses in [26] is approximately 23.5 times that of our network, we can estimate energy consumption for our architecture at ∼0.38 nJ and ∼0.13 nJ.Consequently, the total energy consumption in the digital unit is projected to be ∼6.66 nJ per training instance.Combining the energies of the analog and digital units, the overall energy consumption is estimated to be ∼7.16nJ per training instance and ∼0.7 nJ per inference instance (considering energies solely for forward propagations).Therefore, energy consumption is comparable to the state-of-the-art non-volatile technologies [24,26] and would be significantly more efficient than using traditional von-Neumann schemes with purely CMOS devices [26].As discussed, our algorithm also guarantees a substantially reduced frequency of weight programming, resulting in a minimal energy cost for training cost.

Conclusion
The state-of-the-art autoencoder-based unsupervised anomaly detection methods have shown promising results in detecting network anomalies.However, implementing these methods on edge devices with limited hardware, computational resources, and energy has been a challenge.In this paper, we proposed a solution to this challenge by designing a quantized autoencoder with low-resolution non-volatile DW-based synaptic weights to detect anomalies efficiently on edge devices.We designed the synapses using racetrack MTJ in which the synapses can encode multiple non-volatile states.The hardware-aware training performed on the 5-state quantized DW-based autoencoder yields higher anomaly detection performance compared to the floating-point weight autoencoder.Therefore, our proposed solution offers a promising avenue for implementing efficient anomaly detection methods on edge devices with limited hardware resources.This technology is particularly well-suited for devices with size and power constraints, supporting applications in smart sensors, wearables, and IoT, where local processing and privacy are key considerations.
In the future, we would like to explore the compatibility of the proposed quantized DW-based autoencoder with diverse datasets and anomaly scenarios.Additionally, we are interested in integrating more advanced in-memory computing technologies into the autoencoder synapse design.Furthermore, we plan to investigate more complex networks, such as a transformer model designed with quantized DW-based stochastic synapses to perform anomaly detection.

Figure 4 .
Figure 4. Schematic diagram of magnetic domain wall non-volatile synapses driven by SOT current pulses: (a) a sample of micromagnetic simulations showing pinned position of the domain wall, (b) configuration of DW device with 5 notches.

Figure 5 .
Figure 5. Probabilistic distribution of DW positions and DW counts after SOT current pulses are driven to move the DW to target notches from adjacent notch positions.

Figure 6 .
Figure 6.Implementation of autoencoder in crossbar architecture with non-volatile DW memory-based synapses.

Figure 9 .
Figure 9. (a) Comparison of the total programmed weights at each epoch instance vs. epoch, and (b) comparison of the cumulative total programmed weights vs. epoch for different autoencoder configurations.The proposed 5-state quantized DW synapse-based autoencoder shows significantly fewer weight updates during training compared to other autoencoder configurations.

Figure 9 (
Figure9(a) depicts the graphical representation of the total programmed weights at each epoch instance against the number of training epochs for the 5-state quantized DW synapse-based autoencoder, as well as the 2, 3, 5-level quantized weight autoencoders (without DW device), and the floating-point weight-based autoencoder.Similarly, figure9(b) illustrates the cumulative total programmed weights against the number of training epochs for different autoencoder configurations.The data from figure9shows a significant distinction: the proposed DW-based approach exhibits a remarkable reduction of at least three orders of magnitude in weight updates when compared to the floating-point approach.The 2, 3, and 5-level quantized weight autoencoders also demonstrate notably fewer weight updates compared to their floating-point weight-based counterparts.Among these three quantization approaches, the 5-level quantized weight autoencoder requires the least number of weight updates.However, weight updates were further reduced with 5-state DW-based autoencoder.Moreover, it can be inferred from figure9(a) that the inclusion of stochasticity with the 5-sate DW-based autoencoder results in diminishing number of weight updates at each epoch instance as the training epochs progress, in contrast to the pattern observed in other autoencoder configurations.Consequently, the proposed DW device-based autoencoder demonstrates a higher degree of computational resource efficiency.

Table 1 .
Features and data types of NSL-KDD dataset.