ReARTSim: an ReRAM ARray Transient Simulator with GPU optimized runtime acceleration

The demand for computation driven by machine learning and deep learning applications has experienced exponential growth over the past five years (Sevilla et al 2022 2022 International Joint Conference on Neural Networks (IJCNN) (IEEE) pp 1-8), leading to a significant surge in computing hardware products. Meanwhile, this rapid increase has exacerbated the memory wall bottleneck within mainstream Von Neumann architectures (Hennessy and Patterson et al 2011 Computer architecture: a quantitative approach (Elsevier)). For instance, NVIDIA graphical processing units (GPUs) have gained nearly a 200x increase in fp32 computing power, transitioning from P100 to H100 in the last five years (NVIDIA Tesla P100 2023 (www.nvidia.com/en-us/data-center/tesla-p100/); NVIDIA H100 Tensor Core GPU 2023 (www.nvidia.com/en-us/data-center/h100/)), accompanied by a mere 8x scaling in memory bandwidth. Addressing the need to mitigate data movement challenges, process-in-memory designs, especially resistive random-access memory (ReRAM)-based solutions, have emerged as compelling candidates (Verma et al 2019 IEEE Solid-State Circuits Mag. 11 43–55; Sze et al 2017 Proc. IEEE 105 2295–329). However, this shift in hardware design poses distinct challenges at the design phase, given the limitations of existing hardware design tools. Popular design tools today can be used to characterize analog behavior via SPICE tools (PrimeSim HSPICE 2023 (www.synopsys.com/implementation-and-signoff/ams-simulation/primesim-hspice.html)), system and logical behavior using Verilog tools (VCS 2023 (www.synopsys.com/verification/simulation/vcs.html)), and mixed signal behavior through toolbox like CPPSIM (Meninger 2023 (www.cppsim.org/Tutorials/wideband_fracn_tutorial.pdf)). Nonetheless, the design of in-memory computing systems, especially those involving non-CMOS devices, presents a unique need for characterizing mixed-signal computing behavior across a large number of cells within a memory bank. This requirement falls beyond the scope of conventional design tools. In this paper, we bridge this gap by introducing the ReARTSim framework—a GPU-accelerated mixed-signal transient simulator for analyzing ReRAM crossbar array. This tool facilitates the characterization of analog circuit and device behavior on a large scale, while also providing enhanced simulation performance for complex algorithm analysis, sign-off, and verification.


Introduction
Resistive random-access memory (ReRAM), also known as memristor, has undergone significant evolution in semiconductor research over the past 20 years.It shows promise as a potential replacement for NAND flash in extensive storage or as embedded non-volatile memory for applications like the internet-of-things [1].Moreover, ReRAM's analog storage capability and advancements in in-memory computing have paved the way for emerging AI-driven applications, including large language models and neuromorphic computing.Recent demonstrations by companies like Panasonic [2] and CrossBar Inc [3] have showcased commercially viable ReRAM chips, utilizing 40 nm or more advanced CMOS technology.
Conventional ReRAM operates as a two-terminal device, featuring upper and lower electrodes that sandwich a dielectric layer capable of modulating its conductance through input voltage or current.A prevailing understanding attributes the device's resistive switching to the formation and rupture of a conductive filament (CF) within the dielectric switching material [4,5].This process involves CF creation during the forming stage under high current, followed by subsequent set and reset actions, triggered by reduced input current, which toggles the device between high resistance (HRS) and low resistance (LRS) states.Notably, the filamentary nature ensures switching current independence from device size, enhancing the device scalability [1].Based on the ions implicated in the CF, ReRAM diverges into two categories: metal oxide resistive memory (OxRAM) and conductive bridge resistive memory (CBRAM).In OxRAM, the CF is formed through the activation of oxygen vacancy (V o ) within the dielectric [6], while CBRAM involves the migration of metal cations (M + s) from the active electrode [7].CBRAM typically exhibits an elevated on/off ratio and lower programming currents, while OxRAM demonstrates superior endurance [7,8].
In addition to its role in memory applications by storing digital information as ReRAM's HRS and LRS, the ReRAM's unique analog storage ability-achieved by setting the device resistance to specific values between HRS and LRS-renders it suitable for building in-memory computing systems.Specifically, organizing ReRAM cells in a crossbar configuration, where every ReRAM cell occupies the intersection between a row (where wordline resides) and a column (where bitline resides), establishes a crossbar matrix in which the values correspond to the conductance of the respective ReRAM cell.The equivalent matrix-vector-multiplication (MVM) is realized by applying input voltages along rows and summing currents along columns.MVM is a key operation in diverse computational tasks like graph computation, image processing, and deep learning.Employing ReRAM crossbars for MVM computation eliminates data movement between memory and processor, reducing computation complexity to O(1) compared to the conventional O(n 2 ) time complexity.Compute-in-memory (CIM) hardware architectures based on ReRAM crossbars have been proposed for accelerating neural network inference [9][10][11].Furthermore, the ReRAM cell's second-order dynamics facilitate emulation of intricate synaptic behaviors akin to biological neurons, including spike-timing-dependent plasticity (STDP).Leveraging ReRAM crossbar setups, spiking neural networks (SNNs) capable of processing temporal spiking input information have been conceptualized [12,13].
Given the significant progress in ReRAM device and in-memory computing system development, it is crucial to accurately simulate ReRAM crossbar arrays before actual chip fabrication.For circuit-level simulation, one effective approach involves creating a ReRAM model that's compatible with SPICE software, and then integrating these models into modern SPICE simulators like LTSpice [14], NGSpice [15], and HSpice [16].These commercially available simulators are specifically designed for accurate circuit-level simulations.They excel at analyzing complex individual analog circuit components such as operational amplifiers and phase-locked loops, which are common in electronic systems.These simulators are well-equipped to perform direct current, alternating current, and quick changes (transient) simulations.By employing SPICE simulations, valuable insights into the behavior of ReRAM crossbars can be gained.
However, utilizing SPICE as the simulation framework for ReRAM crossbars presents certain challenges.Firstly, SPICE software heavily relies on the central processing unit (CPU) [17], and when tasked with simulating extensive parallel circuit structures like memory arrays, simulation speed becomes a bottleneck [18].Although some contemporary SPICE-based simulation tools, such as Xyce [19], have introduced support for parallel circuit simulation using multiple threads on multicore CPUs, the emergence of hardware acceleration platforms like graphical processing units (GPUs) or field-programmable gated arrays offers the potential for heightened parallelism in ReRAM crossbar circuit-level simulations.Notably, research has explored employing GPUs for repeated analog circuit simulations [20].Secondly, the SPICE software's core reliance on SPICE model scripts and circuit netlists may impede its utilization as a subroutine within other software systems.To facilitate seamless integration with contemporary high-level computational software libraries, enhancing the portability of circuit simulation software and providing a user-friendly suite of application programming interface (API) calls is desirable.
In addition to SPICE-based simulations, non-SPICE-based ReRAM simulators tailored specifically for deep learning (DL) systems have gained popularity in recent years [21][22][23][24].These simulators concentrate on capturing the intricacies of ReRAM cells during both deep learning network (DNN) training and inference stages.They serve to assess model behavior, such as inference accuracy and throughput, employing ReRAM crossbars and their peripheral circuits as the underlying computational hardware.These simulators are very well integrated with leading commercial deep learning frameworks like PyTorch [25], enabling the simulation of complex DNN models with minimal adaptations by software developers.However, these simulation frameworks primarily emphasize computations and lack the comprehensive implementation of circuit-level components present in SPICE-based simulators, thus constraining their transient simulation capabilities.Meanwhile, due to their integration with DNN frameworks, extending their application to other ReRAM application scenarios is not a straightforward process.
In this work, we have created the ReARTSim framework, a circuit-level transient simulator designed to model computational algorithms performed on ReRAM crossbars, augmented with GPU runtime acceleration (figure 1).Our simulator harnesses the strengths of both SPICE-based and non-SPICE-based simulation frameworks.Similar to SPICE-based simulators, it demonstrates the capability to conduct transient-level circuit simulations, and the simulation results with timestamps can be saved and further analyzed.The simulation library encompasses various ReRAM models, and adding new custom models based on user requirements is straightforward.The simulation framework incorporates non-ideal factors like device variations and peripheral circuit effects, including trace resistance and capacitance.Analogous to non-SPICE-based simulators, this simulation framework can integrate with contemporary high-level computational libraries like PyTorch for neural network (NN) computations.This integration allows us to estimate algorithm performance when executed on ReRAM crossbars.Furthermore, the framework can be extended to other application scenarios with notable programmability.Implemented in C++ on the CPU platform, our framework supports GPU acceleration through the CUDA library to enhance simulation speed [26].In the following sections, section 2 describes the software system architecture, section 3 presents testing outcomes involving diverse high-level computation tasks, and section 4 summarizes the article and identifies the future research directions.

Device and circuit model
The computational models of ReRAM have evolved along with the advancement of ReRAM devices.A shared feature among these models is the utilization of a state variable, denoted as w(t), to represent the resistance of the device at a given time t.This state variable, with a range between 0.0 and 1.0, serves as a proxy for the doping layer thickness within the ReRAM cell.This thickness dynamically changes in response to applied voltage or current.A higher value of w(t) represents a greater doping thickness, indicating a lower resistance, whereas a smaller w(t) value corresponds to higher resistance [27].The static I-V relation involving the applied voltage, ReRAM cell current, and the state variable can either be linear: in which R MEM is an equivalent ReRAM resistance, R on is the on-state resistance (minimum) and R off denotes the off-state resistance (maximum).Alternatively, this relation could be non-linear, as seen in the model described in [28].
Throughout the evolution of ReRAM modeling, two sets of models have been proposed to describe the behavior of the state variable w under applied voltage and current.The first set of models characterizes the rate of change for w in proportional to cell current: in which α is a scaling coefficient, f(w) denotes a window function, and i is the ReRAM cell current.The window function f(w) sets rate of state variable change to zero at boundary conditions, when w approaches

Model Model equation I-V relation Model parameters
Cell A [30] 0 or 1.Another set of models incorporates a programming threshold voltage, observed in ReRAM switching behavior as reported in [29].This set of model uses the equation as: in which α and f(w) retain their meanings from equation (2.2), while g(V) is a nonlinear function of applied voltage V, capable of simulating the threshold voltage effect.Various g(V) models have been proposed to emulate the I-V curves of ReRAM devices based on experimental data.In this study, we have chosen four example models from literatures (referred to as Cell A to Cell D) that fall within the above two sets.These models have been implemented within ReARTSim framework's device library, and detailed information is provided in table 1.For the models exhibiting a linear I-V relationship, we have selected a fixed R on value of 100 Ω and R off value of 10 000 Ω (resulting in R off /R on = 100, which is a reasonable ratio) to simplify the MVM weight mapping.The remaining device parameters have been chosen based on values reported in relevant literature or assigned approximate values for the purpose of this study.Note that for Cell D, since w represents undoped layer thickness in the original literature [32], we have made slight modifications to both the original equations and signs of parameters, to ensure consistency in the definition of w with respect to this work.It is important to note that these models have been utilized solely for demonstrating the system's ReRAM crossbar transient simulation capabilities.Users have the flexibility to effortlessly modify these model parameters or incorporate additional device models that align more closely with their specific requirements into the framework.
We simulate the time-stamped transient I-V response of the models using the Euler method: a fundamental numerical method to solve differential equations.In this method, the simulation timestep size, denoted as h, can have the unit of nanoseconds (ns), microseconds (µs), or milliseconds (ms).The variable w(t 0 ) denotes the known state variable at time t 0 , and w(t 0 + h) represents the state variable at the subsequent timestamp t 0 + h.By utilizing the calculated dw dt (t 0 ) obtained from equation (2.2) or equation (2.3) and substituting it into equation (2.4), the time-stamped function w(t) and consequently the transient I-V curve can be derived.More advanced numerical methods for solving differential equations, such as the Runge-Kutta method, can be implemented in the future to enhance simulation accuracy [33].In the example simulations, we applied triangular input voltage waves with both positive and negative amplitudes.Figures 2(a  the transition in cell current under the applied voltage V, with arrowheads indicating the rising and falling slopes of the applied triangular voltage.The triangular pulse counts are labeled for these plots.The simulated curves align with those reported in the initial paper [28,31,32].As further detailed in section 2.2, we leveraged the Euler method along with the device model outlined in table 1 to conduct comprehensive transient simulations of the ReRAM crossbar. As introduced in the earlier sections, an array of ReRAM cells can be utilized to construct a ReRAM crossbar, and used for performing MVM computation.Figure 3 provides an illustration of a typical ReRAM crossbar structure.Each individual ReRAM device possesses a conductance value denoted as G rc and is linked to a wordline in its corresponding row and a bitline in its column.Upon applying a voltage V r to the wordline in row r, the current in the bitline at column c will either be V r × G rc if the column is activated, or 0 if the column is deactivated.Due to finite values of R on and R off , the conductance G rc is confined within the range [1/R off , 1/R on ].In our ReRAM model, where we have chosen fixed R on and R off values (100 Ω and 10 000 Ω), the achievable conductance range within the crossbar spans from 0.0001 to 0.01 S. For standardization, we have chosen a conductance value of 0.001 S as our base unit, with this value corresponding to a matrix value of 1.0.Consequently, valid matrix values achievable by the ReRAM crossbar span from 0.1 to 10, covering In the process of MVM computation, the input vector is converted from digital to analog voltages in the unit of volt (V) using a digital-to-analog converter (DAC), subsequently driving the respective wordlines.For instance, an input vector of [1, 1.5] will be transformed into a 1 V pulse on the first row and a 1.5 V pulse on the second row, each with a specific duration.During the MVM computation, all columns are activated simultaneously, and within each bitline, the currents from all ReRAM devices in the same column are summed.The bitline currents are sampled by a sense amplifier (SA) and analog-to-digital converter, transforming them into digital outputs that constitute the output vector.As the base unit of conductance is set to 0.001 S, the bitline current value, measured in mA, is directly utilized as the output vector element.Continuing from the previous example, the output current in the first column is computed as 1 × 0.0002 + 1.5 × 0.0003 = 0.00065 A, which translates to 0.65 mA.This value corresponds to the first element of the MVM output vector, aligning with the direct MVM computation result.Notably, parasitic trace resistance and capacitance are present within the crossbar, exerting an impact on the computation outcome, particularly at high speeds.Streaming data can be fed into the input processing logic and results are retrieved at the output data processing logic, which are coordinated and controlled by the central control logic.
Several non-idealities could exist in ReRAM crossbar arrays.It is established that ReRAM switching behavior is subject to variations from device-to-device and cycle-to-cycle [34,35].Device-to-device variation is typically influenced by non-ideal fabrication processes [36], while cycle-to-cycle variation may arise due to stochastic effects during filament formation in ReRAM resistive switching [37].Such variations can potentially lead to incorrect computation results in MVM.Proposed training methods, such as those outlined in [38,39], seek to mitigate these issues by considering device variations during ReRAM crossbar network training.Meanwhile, similar to any resistive component, thermal noise can contribute to output bitline current.As a result, it is important for the software to possess the capability to conduct simulations while accounting for these non-idealities.

Software architecture
The ReARTSim simulation framework is implemented using the C++ programming language, featuring a class-based structure depicted in figure 4(a).The Cell class serves to describe the behavioral attributes of the ReRAM cell.Within this class, the ReRAM cell's model, model parameters, state variable w, and time-stamped voltage, current, and power are included.It also includes values for parasitic trace resistance and capacitance, essential for high-speed simulation.The initiation of time-stamped voltage originates from user-defined inputs, and one approach is through the use of the Instruction class, detailed below.Subsequently, by executing the runSimEuler() method, the temporal evolution of w is computed in accordance with equations (2.2)-(2.4),simultaneously updating the values of current and power.
The Instruction class is utilized to generate a set of user-specific instructions, providing guidance on supplying time-stamped voltage input to individual ReRAM cells.These instructions cover a variety of operational scenarios, spanning basic cell read/write operations as well as more intricate MVM instructions.In the case of the read/write instruction, which targets a single cell, voltage is applied to a specific wordline for a specified duration, and the resulting current is collected from the corresponding activated bitline.Conversely, the MVM instruction involves applying voltage across all wordlines based on the input vector for a specified duration, followed by the collection of currents from all bitlines.The duration of applied voltage in read/write or MVM instruction is a variable specified by user.
The Array class functions as the foundational component that defines a ReRAM crossbar array.It incorporates a 2D vector containing Cell objects that represent ReRAM cells within the crossbar array.Additionally, it includes an array of Instruction objects, facilitating user input.Time-stamped data, including voltage for each row (VvDrives), the active status for each column (VvActives), current across columns (IvSenses), and the aggregate crossbar power (PvSys) are also stored within this class.Simulation parameters, including the timestamp unit (ns, µs, or ms), as well as the total count of timestamps in the simulation, can be configured in the Array class.For simulations requiring an extensive timestamp count, the software optimizes CPU memory usage by segmenting the simulation into multiple cycles.Within its runSimEuler() method (equally named at both array and cell levels), the Array object initiates by processing the instruction list in the current simulation cycle, subsequently updating the VvDrives and VvActives vectors.Following this, each individual cell invokes its respective runSimEuler() methods to simulate cell responses within the CPU-based workflow.Upon completion of cell simulations, the Array class computes the IvSenses and PvSys vectors by summing values from individual cells.
The Matrix class is developed to facilitate the initialization of ReRAM cells for subsequent use in MVM operations.This class contains a 2D vector that stores the original weight matrix values, which indicate the desired conductance for each ReRAM cell in the crossbar.In the Matrix class, a MatrixHelper object is employed to assist in generating write instructions for initializing the conductance of ReRAM cells.Within the MatrixHelper object, a pre-calculated data structure, gmap, is constructed in advance for each ReRAM cell model.This gmap maps a finite number of ReRAM conductance values to specific write instructions with fixed applied voltage levels and durations.To set the ReRAM cell to its target conductance value, the Matrix object queries the MatrixHelper to search the pre-calculated gmap, determining the optimal write instruction.In our framework, we have implemented a basic weight initialization algorithm wherein cells are individually programmed, which is briefly discussed in section 3.1.In this manner, for an r × c ReRAM array, the Matrix object generates a total of r × c write instructions.More efficient crossbar initialization algorithms, as described in [40], can also be incorporated into the framework.
To facilitate the need for verifying MVM results, the Array class includes functions capable of sampling the bitline current for each MVM operation.Subsequently, these functions generate an output vector using the bitline current values measured in mA, which are then compared with reference values computed directly through mathematical equation.The sampling takes place at a user-defined delay after the initiation of each MVM instruction.Additionally, a ConvKernel class has been developed to support 2D matrix convolution operations on the ReRAM crossbar, detailed further in section 3.2.In brief, the 2D convolution is translated into a sequence of MVM operations, utilizing a fixed weight matrix that has been programmed in the crossbar.The Array object then samples the bitline current and constructs convolutional output for this sequence of MVM operations, facilitating a comparison with the computed reference.
While the primary focus of the framework is on cell-level transient simulation, it also offers performance and power benchmarks, similar to non-SPICE-based ReRAM simulators.Following the procedures outlined in [22], the array object can report performance in giga floating-point operations per second (GFLOPS), energy and power in nJ and W, and energy efficiency in GFLOPS/W after completing each batch of MVM simulations.Users can leverage these metrics to analyze system power efficiency and determine the peak MVM performance for each array and cell model before any inference accuracy drop occurs.A more detailed example is provided in section 3.3, the AI inference section.
To account for device-to-device variations, the software possesses the capability to import pre-defined device parameters into each ReRAM Cell object.Furthermore, APIs have been provided to enable users to configure ReRAM cell parameters with specific trends of change, such as linear changes along rows or columns, aiming to mirror actual non-idealities encountered during the fabrication process.The software also permits users to set the noise level across bitlines within the simulation.These features facilitate the verification of training methods proposed in [38,39] for accommodating device variations, and [41] for enhancing noise tolerance, utilizing the framework.
The flowchart for a typical MVM computation program performed in this framework is depicted in figure 4(b).Upon simulation initiation, users begin by utilizing APIs to import both the weight matrix and a list of input vectors.A pre-processing step is executed on the imported weight matrix to constrain the matrix values within the range of 0.1 and 10, suitable for mapping onto the ReRAM crossbar.To initialize ReRAM cells, a Matrix object based on the imported weight matrix is declared first.Then APIs from the Matrix class are utilized to generate a specific write instruction for each weight matrix element, facilitating the setting of cell conductance to the desired value.Additionally, APIs have been developed to generate an MVM instruction for each input vector, where the voltage on each wordline aligns with the corresponding input vector values.The duration of the MVM instruction can also be specified through the APIs.Users then instantiate an Array object, specifying the ReRAM cell type and optionally incorporating ReRAM cell parameters.Throughout the simulation phase, users can select from two workflow choices: the CPU-based workflow and the GPU-based workflow.The CPU-based workflow, devoid of the need for specialized hardware support, conducts transient simulations one cell after another, in line with previous descriptions.The GPU-based workflow, that will be described in detail in section 2.3, requires GPU hardware installed to accelerate computation.In the GPU-based workflow, the CPU loads the simulation data-including cell parameters and timestamped row voltages-onto the GPU global memory, then waits for GPU-driven crossbar simulations.Subsequent to simulation, the CPU retrieves the data sent back from GPU.The CPU then undertakes data processing, perform tasks including bitline sampling to derive MVM outputs, and subsequent comparison against reference values.

GPU acceleration
The GPU, initially designed for display image processing, has gained attraction in various computational fields due to its architecture featuring numerous parallel processing cores [42].In the context of large ReRAM crossbar arrays, the computation of transient responses for a total of r × c cells (where r represents row count and c denotes column count) through a single-threaded program can prove time-consuming.While multithreaded CPU programming might enhance performance, scalability is restricted by the number of available CPU cores.To address this limitation, we have developed GPU kernel software utilizing the Nvidia CUDA library [26] to accelerate ReRAM simulations.Utilizing the common structure of each ReRAM cell in the crossbar, the GPU emerges as a robust hardware choice for simulating crossbar arrays.Its capability to handle programming patterns involving thousands of threads performing similar tasks reinforces its suitability for this purpose.
Figure 5 illustrates the architecture of the GPU acceleration software deployed in the GPU-based workflow.Analogous to the CPU-based workflow, the entire simulation duration is partitioned into multiple simulation cycles.Within each cycle, the CPU initiates the process by requesting the GPU to allocate global memory for all cells, which is used to store cell parameters, state variable w, time-stamped voltage, and current.These values are then loaded onto the GPU's global memory.Subsequently, a total of r × c computation kernels are instantiated on the GPU.These kernels, akin to the runSimEuler() methods in CPU-based workflow, calculate the transient response for each of the r × c cells.After these computation kernels are completed and synchronized, reduction kernels are used to aggregate current across each bitline and overall array power.With the simulation time in one cycle partitioned into t segments, a total of t reduction kernels are launched, with each reduction kernel computing current and power for one such segment.After the completion and synchronization of the reduction kernels, the GPU forwards the computed transient data to the CPU for post-processing, awaiting the next simulation cycle computation.To further reduce simulation time, user can opt to stop the retrieval of individual cell simulation result from the GPU.When this option is activated, only bitline current and total power are transferred back to CPU memory post-simulation.which cuts down the data transfer time.Given that modern GPUs can accommodate thousands of threads and blocks, substantial parallelization of these kernels can be achieved [43,44].

System performance evaluation
To assess the transient simulation capabilities of the ReARTSim framework, we executed illustrative tasks including matrix weight initialization and MVM computations.The operation schematics for these tasks are depicted in figures 6(a) and (b), respectively.In the matrix weight initialization task, a 4 × 4 ReRAM Array object with ReRAM cell type C was declared, chosen for its simplicity and inclusion of threshold programming voltage.It was slightly adjusted to incorporate a linear I-V relation, resulting in a fixed resistance under varying applied voltages.This adjustment was made to demonstrate the framework's MVM capabilities.Additionally, a 4 × 4 Matrix object was declared, representing the target weight matrix to be programmed onto the ReRAM array.The matrix element values were constrained within positive ranges, allowing their conversion to valid ReRAM cell conductance.Following this, the Matrix object generated a series of write instructions to initialize ReRAM cells.As described earlier, we implemented a basic weight initialization algorithm, programming each cell individually.The operation is illustrated in figure 6(a), where the voltage on each wordline was applied until all cells in the present row were initialized, and then the voltage was shifted to the next wordline.Bitlines were activated in a round-robin fashion, allowing cells from different columns to be programmed consecutively.The duration that cells in row m were being programmed was labeled as r m , and this duration is marked in figure 6(a).
The simulation was then executed until all ReRAM cells within the crossbar were initialized to the desired target weight.Figure 7(a) presents the transient current across the four bitlines, while figure 7(b) displays the transient power of the array.As seen from the figures, the time scale for the cell weight initialization task is relatively lengthy (measured in seconds).This extended duration results from the time required to alter the cell's state variable w, which is in line with the ReRAM cell model.The figures show noticeable peaks in current and power (∼1.5 s).Additionally, the user API supports the printing of the initialized conductance in comparison to the reference matrix.This functionality facilitates the evaluation of the accuracy of the matrix weight initialization process.
In the MVM computation task, we imported multiple input vectors of length 4, declaring MVM instructions for these vectors, and performed transient simulations on the Array object.As illustrated in figure 6(b), during each MVM operation, the voltages converted from the input vector were simultaneously applied to all wordlines, and all bitlines were activated concurrently.The current from each bitline was then Y Sui et al  amplified by SA and then subsequently sampled by DAC to form the output vector.Because MVM computation can be significantly faster compared with weight initialization task, since the state variable w change is not involved, a shorter simulation timescale (ns) was used for plotting the transient response.Additionally, parasitic resistors and capacitors were enabled to enhance the precision of modeling.MVM result computed directly.The minor discrepancy between the simulated MVM and the golden result is due to the deviations during the initialization of the weight matrix.Other factors, including device variations, voltage and current noise, and sampling time, may also contribute to minor computational discrepancies and can be assessed through the console output.
To assess the performance enhancement achieved through the transition from CPU-based to GPU-based workflows, we conducted transient simulations for MVM tasks across various ReRAM crossbar array sizes using both CPU and GPU platform.We measured the computation time for each task, and compared the values between two workflows.Figure 8 shows the MVM computation time for array sizes of 8 × 8, 16 × 16, 32 × 32, 64 × 64, and 128 × 128, all including a total of 10 000 timestamps, for both CPU and GPU platform (64 × 64 and 128 × 128 datapoints are indicated in the text rather than plotted).The findings reveal that GPU computation outperforms CPU for larger crossbar sizes-specifically, arrays larger than 16 × 16.At array sizes of 64 × 64 and 128 × 128, GPU performance is approximately 50 times faster than CPU.This disparity can be explained that when array size is small, the CPU is faster due to absence of the GPU kernel launch overhead, and relatively limited parallelism is achievable by GPU kernels.As the number of cells running in parallel increases, the kernel launch overhead becomes insignificant, and GPU kernels leverage higher parallelism [42,45].Still, the performance is constrained to a certain limit, which is likely due to GPU memory bound since only global memory is employed in the GPU acceleration kernel [46].Further optimization strategies, such as utilizing shared memory and wrapped synchronization, could be adopted to refine performance [42,47].

AI inference
To exemplify the simulation capabilities of the ReARTSim framework for practical applications on the ReRAM crossbar platform, we selected sample NN architectures.We integrated high-level APIs to handle the neural network structures and inputs, and deployed part of their computations onto our simulation framework.Figure 9(a) shows two representative NN architectures: LeNet [48] and Alexnet [49], that were simulated in our framework.Both architectures feature common computation layers: convolution layer (CONV) and fully-connected layer (FC).The FC layer includes MVM with non-linear activation for the output vector, and the MVM is a direct match to the MVM instruction in our framework.However, post-training, matrix weight values typically span from −1.0 to 1.0, while our ReRAM crossbar valid weight lies between 0.1 and 10.As a result, for MVM in the FC layer, we employed a differential-matrix method to handle weight values beyond the intended range.Two auxiliary matrices-each containing positive and negative values extracted from the original matrix-were independently constructed.A uniform offset was then applied to both matrices, elevating all weight values above 0.1 to address elements within the 0-0.1 absolute value range.After MVM computation, the output vector from the negative matrix was subtracted from the positive matrix output vector to yield the final MVM output.
For the CONV layer, we employed the technique outlined in [50] to transform the convolution computation into a series of MVM operations.The conceptual process is depicted in figure 9(b).The computation kernel, characterized by a 4D dimension of C out × C in × h × w (where C out is the number of output channels, C in is the number of input channels, and w and h are the 2D convolution kernel size), was flattened into a 2D vector of dimensions (C in × h × w) × C out .Similar to the convolution operation, an input vector of size C in × h × w was retrieved from the input data at each convolution step, with a sliding window traversing along h and w directions for subsequent convolution steps.This input vector was multiplied with the flattened matrix, equivalent to the element-wise convolution operation.The resulting output vector from the MVM operation was used to update the convolution output at a specific location (h i , w j ) across all C out output channels in the convolution output layer.In our framework, we have introduced a ConvKernel class responsible for holding the convolution kernel.This class features a flatten() method to transform the 4D convolution kernel into a 2D matrix, and can help to compute the convolution reference result.Padding zeros and variable stride lengths are supported in the framework.The Array class incorporates methods to store the sampled MVM outputs from the convolution steps, and updates the outcomes when a convolution operation concludes.Similarly, the previously mentioned differential matrix approach used in MVM was employed for convolutions.
Finally, we integrated our framework with the PyTorch API and conducted complete neural network inference.Initially, we trained the LeNet model using the CIFAR-10 dataset and utilized the built-in pre-trained model for AlexNet.We then input the testing data into both LeNet and AlexNet models and analyzed the inference result.During the inference process, the first two convolutional layers from both models were executed within our ReRAM crossbar simulation framework (highlighted by the red dashed rectangular in figure 9(a)) and the remaining layers were computed using PyTorch.The performance and power benchmarks of the first CONV layer for both NN network architectures, including total simulation time in ns, total number of floating-point (FP) operations, performance measured in GFLOPS, overall energy and power quantified in nJ and W, and energy efficiency measured in GFLOPS/W, are shown in figure 9(c).Similar performance benchmarks for the other NN layers can be presented in a comparable manner.Figure 9(d) shows the inference accuracy under different NN throughput of the first CONV layer in both NN networks, which reflects different MVM computation speed.At higher MVM speed, the first-order RC response is more apparent, and the sampling is more likely to occur before the bitline current stabilizes, leading to a drop in inference accuracy.Figure 9(e) shows the inference accuracy (top-1 error for LeNet, top-5 error for AlexNet) under varying noise levels, while figure 9(f) shows the inference accuracy with different levels of device variation.Since we trained the NN architecture in the conventional approach, a decline in inference accuracy is expected at higher noise and device variation level.This serves to demonstrate the functionality of our simulator.

PCA
In addition to its application in neural network inference, our framework extends its utility through the development of APIs capable of simulating other computation tasks that can be effectively mapped onto the ReRAM crossbar network.Notably, we implemented a simulation of principal component analysis (PCA), a statistical technique used for dimensionality reduction in input datasets based on computed feature vectors.Traditionally, PCA involves computationally intensive processes such as computing the covariance matrix and solving for eigenvalues and eigenvectors of that covariance matrix, with costs increasing quadratically with input dimensions [51].Alternatively, we employed Sanger's Rule, also known as the Generalized Hebbian Algorithm [52], to compute principal components.The algorithm suggests that for an input data dimension of n, we initialize an m × n matrix with random values.Weight updates for each input data point are calculated using the equation: in which α is the learning rate, x j is the input at index j, y i is the output at index i, and w ij represents the weight in row i and column j of the matrix.Initially, vector y is computed by multiplying input data x with the matrix.Subsequent steps involve using y as input and transposing the matrix for further computation ∑ i k=1 w kj y k .Through iterative processing of input data, the m rows of the initial m × n matrix progressively approach convergence with the first m principal components.
The generalized Hebbian algorithm has been successfully implemented in a ReRAM crossbar, as demonstrated in [53].Figure 10(a) demonstrates the custom API flow adopted within this software framework.The initial m × n matrix is mapped onto the crossbar with n rows and m columns, following the previously described MVM mapping routine.During the initial timestamp, the input vector x is applied across the rows, and the resulting current across the columns is sampled to compute the output vector y.This y vector is stored, then with all elements marked as zero except for the first element, y[0].In the subsequent m timestamps, the y vector, which is progressively updated, is applied across the columns, transitioning one value at a time from zero to the previously stored value.The transition occurs incrementally, start with y[1], followed by y [2], and so on, in accordance with the term ∑ i k=1 w kj y k defined in equation (3.1).Current sampling takes place across the rows for each timestamp.In the Instruction class, a backward MVM instruction is introduced to facilitate applying the input voltage across the columns and sampling output current at each row within the crossbar.This PCA concept is effectively illustrated using 2D Gaussian data.In figure 10(b), the sampled current across columns and rows during a single computation step is depicted, while figure 10(c) illustrates the computed principal components using the framework.

Conclusion and future work
In conclusion, we have developed ReARTSim framework, a C++ based circuit-level transient simulator for ReRAM crossbar, and demonstrated the framework applications.This simulation framework possesses the capability to perform transient simulations on the ReRAM devices and their associated peripheral circuits, using the ReRAM cell behavioral differential equations and fundamental circuit laws.The utilization of GPU acceleration enhances simulation speed, particularly for large-sized ReRAM crossbar arrays.We have designed a set of user-friendly APIs that empower users to define high-level computational tasks directly.The software maps these tasks onto the simulation framework, thereby evaluating their performance when executed on ReRAM crossbar configurations.We have successfully demonstrated the simulation framework's effectiveness through a range of tasks, including matrix multiplication up to 4096 × 4096, inference for multiple neural network structures, and PCA.Remarkably, these demonstrations have yielded over 50x acceleration on the GPU platform.The simulation framework facilitates exploration and analysis of ReRAM-based crossbar systems with fast simulation speed.
In terms of future work, one avenue involves the incorporation of event-based simulation into the software framework.The event-based simulation method, which is commonly employed in digital circuit simulations [54], involves modeling system state changes as discrete events and assumes that there are no system state changes between these events.Given that the state of ReRAM cells can remain static for extended periods when the input voltage remains constant and within threshold values, integrating event-based simulation could cut unnecessary computations during these intervals without compromising simulation accuracy.
Another avenue of development involves expanding the framework's capabilities to encompass more intricate ReRAM models capable of emulating synaptic behaviors like STDP [12,13].By incorporating these advanced ReRAM models, the software package could extend its transient simulation capabilities to accommodate more biologically realistic neural networks deployed on ReRAM crossbars, including SNNs.The simulation of SNNs could also benefit from the event-based simulation method.Given that SNN inputs appear as discrete voltage pulses at specific time points, applying event-based simulation could significantly reduce the overall simulation time by disregarding system state changes between input pulses.Furthermore, there is potential for expanding the simulation capabilities of the framework beyond ReRAM array-level simulations to full CIM architecture simulations.As mentioned in introduction, CIM architectures, leveraging multiple duplicates of ReRAM crossbars, have been proposed for both neural network (NN) inference and training [9][10][11].Enabling transient simulation at the full CIM architecture level requires synchronous simulation of multiple ReRAM arrays.These arrays may be organized in a pipelined fashion to optimize NN inference throughput.Additionally, the framework should be capable of simulating communication between ReRAM arrays, facilitated by additional digital components like first-in-first-out buffers.The allocation of these simulation tasks between CPU and GPU presents an open-ended question.
Lastly, the current software package, along with its capabilities for higher-level computation tasks, has been developed through C++ APIs.Users are required to manually code the APIs to define these computation tasks within our framework.In the domains of machine learning and scientific computing, well-established frameworks like PyTorch [25] and Matlab [55] hold a high degree of maturity and boast thriving user communities.A potential avenue for future development involves the creation of wrapper functions to automatically map target computation tasks from these well-established frameworks onto CIM Y Sui et al architectures or ReRAM crossbars, which can then be simulated by our framework.By doing so, developers utilizing these frameworks could readily incorporate our software package to simulate targeted computations on ReRAM crossbars.This endeavor would facilitate the utilization of the software we have developed within the context of these widely-used environments.
), (c) and (e) depict the simulated time-stamped voltage and current responses of Cell B, C and D. Figures 2(b), (d) and (f), offering an alternative perspective on figures 2(a), (c) and (e), illustrate

Figure 2 .
Figure 2. (a) Simulated Cell B current under applied triangular voltage.(b) Simulated Cell B I-V curve.Labels show pulse count in figure (a).(c) Simulated Cell C current under applied triangular voltage.(d) Simulated Cell C I-V curve.Labels show pulse count in figure (c).(e) Simulated Cell D current under applied triangular voltage.(f) Simulated Cell D I-V curve.Labels show pulse count in figure (e).

Figure 4 .
Figure 4. (a) The ReARTSim framework software class definition implemented in C++.(b) The flowchart of MVM performed in ReARTSim simulator.

Figure 7 (
c) shows the transient current in the four bitlines.The figure shows the effect of the first-order RC response due to the trace resistance and capacitance, with sampling conducted when all bitline currents are stable.Figure 7(d) presents the console output of the simulated array's MVM result, along with the reference

Figure 8 .
Figure 8. MVM simulation time comparison for CPU-based and GPU-based workflow.

Figure 9 .
Figure 9. (a) NN architectures used for simulation in ReARTSim framework.(b) Convolution kernel computation flow using MVM in ReRAM crossbar.(c) Simulation performance and power benchmark summary.(d) Simulated NN inference accuracy under various first CONV layer throughput.(e) Simulated NN inference accuracy under various noise level.(f) Simulated NN inference accuracy under various device variation level.

Figure 10 .
Figure 10.(a) PCA computation flow using Sanger's rule in ReRAM crossbar.(b) Sampled bitline and wordline current in PCA computation flow.(c) Computed PCA for Gaussian input data using ReARTSim framework.

Table 1 .
The simulated ReRAM model in this work.