Integration of Ag-CBRAM crossbars and Mott ReLU neurons for efficient implementation of deep neural networks in hardware

In-memory computing with emerging non-volatile memory devices (eNVMs) has shown promising results in accelerating matrix-vector multiplications. However, activation function calculations are still being implemented with general processors or large and complex neuron peripheral circuits. Here, we present the integration of Ag-based conductive bridge random access memory (Ag-CBRAM) crossbar arrays with Mott rectified linear unit (ReLU) activation neurons for scalable, energy and area-efficient hardware (HW) implementation of deep neural networks. We develop Ag-CBRAM devices that can achieve a high ON/OFF ratio and multi-level programmability. Compact and energy-efficient Mott ReLU neuron devices implementing ReLU activation function are directly connected to the columns of Ag-CBRAM crossbars to compute the output from the weighted sum current. We implement convolution filters and activations for VGG-16 using our integrated HW and demonstrate the successful generation of feature maps for CIFAR-10 images in HW. Our approach paves a new way toward building a highly compact and energy-efficient eNVMs-based in-memory computing system.


Introduction
Deep neural networks (DNNs) have been widely successful in solving difficult problems in computer vision, speech recognition, machine translation, playing board and video games, and medical diagnosis. DNNs have been constantly making breakthroughs in improving state-of-the-art computational accuracy [1]. Large-scale DNNs require a very large number of matrix vector multiplication (MVM) operations in each layer followed by non-linear neuron activations between the layers (figure 1(a)). By introducing non-linear transformation to the input, the activation function plays an important role in making the network capable to learn complex representations and perform more sophisticated tasks [2]. Although in-memory computing with emerging non-volatile memory (eNVMs) arrays considerably accelerates the computation of MVMs [3,4], current approaches still require external processors or complex peripheral circuits to implement neuron activations. Analogue-to-digital converters (ADCs) is typically used in computing activation function and propagating data through eNVM layers. However, it has been shown that the power of 9-bit Successive Approximation Register based ADCs (SAR-ADCs) is roughly 1 W compared to 0.3 W dissipated on a 4096 × 4096 eNVM array for MVM operations [5]. The energy and latency overheads associated with the separate implementation of the weights and activations significantly increase energy consumption and constitute a major bottleneck for the scalability of the hardware (HW) with ever-evolving neural network architectures. Recent advances have explored using analogue complementary metal-oxide-semiconductor (CMOS) circuits [6] or an ADC with reconfigurable function mapping [7] to implement activation functions in HW. Although these approaches improve processing speed, they are difficult to be directly integrated as part of the eNVM array due to area mismatch compared to the compact array [8]. To overcome this limitation, we have previously developed a Mott activation neuron that implements the rectified linear unit (ReLU) function in the analogue domain [8]. In our previous work [8], we focused on emulating a ReLU activation function with four terminal single Mott neuron devices. However, the integration of resistive memory synaptic arrays with Mott activation neurons is the key to prove the feasibility of the use of Mott ReLU neurons as activation units for full HW implementation of DNNs with in-memory computing systems.
To that end, in this work, we focus on the development and characterization of resistive crossbar memory arrays and study their integration with Mott ReLU neurons for a direct combination of MVM operations with activation functions. We experimentally demonstrate Ag conductive bridging random access memory (Ag-CBRAM) synaptic crossbars for MVMs and compact Mott neuron devices for activation functions and implement a large-scale image detection task using a deep convolutional neural network (VGG-16) in HW. Our demonstration concentrates on convolutional and activation layers, which are the main building blocks of VGG-16. Our Ag-CBRAM device exhibits a high ON/OFF ratio (∼10 10 ) and 4-bit multi-level switching, which are suitable for performing large-scale MVMs in DNNs. Mott activation neurons integrated with CBRAM arrays emulate characteristics of ReLU activation function, which is the most frequently used activation function in DNNs [9]. As shown in figure 1(b), a crossbar comprises Ag-CBRAM devices implementing a synaptic layer that accepts inputs in the wordlines (WLs) and generates weighted sum current in the bitlines (BLs). Each column of Ag-CBRAM crossbars is connected to a nano-scale Mott neuron device for direct computation of ReLU activation using the weighted sum. The outputs of Mott ReLU devices can be directly fed to the WLs of the following Ag-CBRAM layers (figure 1(c)). The rest of the paper is organized as follows. First, we present the characterization of Ag-CBRAM devices including direct current (DC) switching behavior, variation, retention, endurance, and multi-level switching. Then, we share our results on the volatile four-terminal Mott activation neuron device based on vanadium dioxide (VO 2 ) and experimentally measured input-output characteristics for implementation of the ReLU activation function. The Transient response of the device is also measured to validate its low energy consumption. Lastly, we demonstrate HW implementation of a Canadian Institute for Advanced Research, 10 classes (CIFAR-10) image classification task using VGG-16 by integrating Ag-CBRAM crossbars and Mott ReLU neurons using a custom printed circuit board(PCB) board. Our results based on the integration of Ag-CBRAM crossbars and Mott ReLU neurons suggest that the small size and energy efficiency of the Mott activation neuron offers a promising approach to replace CMOS circuits with a more area and energy-efficient device-level solution for ReLU activation and allow direct stacking of multiple synaptic layers.

Ag-based CBRAM
The simplicity of fabrication makes lateral eNVM devices desirable for direct integration on the back end of line (BEOL) CMOS circuitry. For implementing synaptic layers, we first developed a lateral Ag-based CBRAM crossbar that can be fabricated at the wafer scale as shown in figure 2(a). The 4-inch wafer contains 16 × 16 and 32 × 32 crossbar arrays as well as single devices for electrical characterization ( figure 2(b)). For crossbar fabrication, we started with 300 nm SiO 2 /Si wafer and deposited 50 nm thick Ag layer via DC sputtering. Then 5 µm × 20 µm Ag channels along with its BL were patterned via photo-lithography and wet-etching. Next, 250 nm SiO 2 was deposited as an insulating layer with plasma enhanced chemical vapor deposition method. After patterning SiO 2 to open via holes, 200 nm Cr/Au was deposited and patterned for WLs as well as defining contact pads for BLs and WLs. Single test devices were fabricated in a similar way.
The Ag-CBRAM devices (figure 2(c)) exhibit an initial low resistance as fabricated and need to undergo an oxidation step whereby the device is transformed from its highly conductive 'pristine' state (figure 2(c)) to an oxidized high-resistance state (HRS) to initialize the subsequent switching (figures 2(d) and (e)) using a low amplitude voltage sweep. Figures 3 and 4(a) demonstrate this forming process. The forming process here is different from the conventional forming process involving the formation of a conductive filament in metal oxide-based resistive random-access memory (RRAM) devices. The forming process here is different from the conventional forming process involving the formation of a conductive filament in metal oxide-based RRAM or CBRAM devices. Instead, here the forming step transforms the conductive metal layer into an oxidized state which exhibits resistive switching. As the voltage input was swept from 0 V to 1 V, the device current increased nearly proportionately up to ∼45 mA. However, when the input bias reached ∼0.8 V, the resistance of the Ag channel suddenly increased to ∼10 12 Ω (figure 4(a)). By comparing the optical images of the pre-formed (figure 2(c)) and post-formed (figure 2(d)) device, we noticed that the left part of the Ag channel became visibly darker shown in figure 2(e), likely due to the formation of resistive silver oxide. This observation explains the forming-induced transition to HRS.
To further validate that oxidation occurs during forming, we performed EDS analysis to investigate the material composition of the pristine and formed device. As can be seen, the channel of the pristine device consists of Ag only (figure 3(a)) while the color change in the formed device shows oxidization in the left part of the channel ( figure 3(b)). In addition, we implemented a line scan of the channel to better understand the concentration of the oxygen. As shown in figures 3(c) and (d), the oxide concentration increases significantly in 4 µm-9 µm while Ag concentration decreases, indicating left part of the channel has been successfully oxidized through the forming process.
Thereafter, the device operates as a resistive switching memory. Figure 4(b) shows the bipolar I-V characteristics of the Ag-CBRAM as measured from a DC double sweep cycle. Here, the applied voltage was increased in 5 mV steps for the positive (0 V-2 V) and negative (0 V to −1 V) voltage ramps while enforcing compliance currents of 500 µA and 10 mA to achieve SET and RESET respectively. To investigate the consistency of switching operations, we characterized the statistical distribution of switching voltages and device resistances by performing 50 DC switching experiments (figure 4(c)). The average switching voltage for the SET was ∼1.77 V whereas that for RESET was ∼−0.35 V. Figure 4(d) records the average low resistance state (LRS) and HRS resistances of ∼340 Ω and ∼3 × 10 13 Ω with good uniformity. These values translate to an ultra-high ∼10 10 ON/OFF ratio of the device which distinctly favors its flexibility in mapping a wide range of neural network weights [10]. It is noteworthy that the low resistance of the Ag-CBRAM promises low latency operation whereas the high HRS can help lower the static power consumption of the device by suppressing leakage currents.  In addition, we characterized device-to-device (D2D) variation (figure 5). Figure 5(a) shows D2D variation in switching voltages, indicating that both SET and RESET voltages have tight distribution across devices. Figure 5(b) displays D2D variation in resistance. The LRS has a small variation while the HRS has some spatial fluctuations.
Device reliability is essential for implementing network training and inference that requires frequent switching and long-term storage of the weights [11]. Figure 6(a) shows that our device can retain the HRS and LRS states for more than 10 4 s at room temperature. In addition to being non-volatile with long retention, our Ag CBRAM devices exhibit high endurance. Figure 6(b) shows that the device can be switched between LRS (∼10 4 Ω) and HRS (∼10 12 Ω) for at least 10 4 cycles. The HRS and LRS over 10 4 cycle switching is presented in figure 6(c), which indicates our device maintains low variations in both states. These results confirm the capability of Ag-CBRAM for achieving reliable crossbar operation with long-term stability.
While we demonstrated conventional memory application for our Ag-CBRAM device, gradual resistance switching is a key to achieve higher storage density by mapping multi-bit weights to a single device. By controlling the current compliance levels from 100 pA to 1 mA, the device can reliably switch between 16 states spanning 7 orders of magnitude in resistance as shown in figure 7(a). Moreover, we characterized the  The device is the first SET to LRS. A sampling measurement that lasts 10 4 s is performed to constantly monitor the device resistance. The device is then switched to HRS and the same measurement is performed. (b) Endurance of Ag-CBRAMs. 10 4 cycles are achieved by alternatively applying SET and RESET pulses to the device while monitoring the device's resistance. During the test, the RESET and SET transitions were achieved using −4 V/100 µs and 3 V/5 ms voltage pulses, respectively. 1 kΩ discrete resistor that is connected in series with our device is used to limit the SET current. (c) CDF of resistance in pulse programming, which is extracted from the endurance measurements in (b).  7(b)) and standard deviation (figure 7(c)) of each distinct resistance level versus compliance current. Our results validate that the Ag-CBRAM device has multi-level programmability with minimal overlap between different levels.

Mott ReLU activation neuron
We have previously developed the array of Mott ReLU neuron devices to implement the activation function layer [8]. Each Mott ReLU neuron device has four terminals that allow exploiting of a thermal-driven Mott transition of VO 2 , which emulates ReLU activation function in a single device ( figure 8(a)). Mott ReLU devices were fabricated by depositing 70 nm VO 2 film via reactive sputtering. The device switching area was defined by two Ti (20 nm)/Au (30 nm) electrodes with a 50 nm gap using e-beam lithography and  Figure 8(c) explains the operation of the device. The resistance of the heater is ∼30 Ω and the initial resistance of the VO 2 gap is ∼10 kΩ. ReLU activation function can be emulated by applying the current bias to the nanowire that induces thermal gradual resistivity switching on the VO 2 gap. The VO 2 gap formed a voltage divider circuit with a load resistor to generate voltage output (V OUT ) that can be directly fed as an input to the synaptic layer. By flowing current through the heater, the temperature of the VO 2 gap is precisely modulated to induce thermal-driven gradual resistive switching. As a result, the resistance change of the VO 2 gap modulates V out through V load and successfully emulates ReLU function as shown in figure 8(c). The device shows precision higher than 4-bit. As shown in the figure 8(d), Mott ReLU neurons show low latency of ∼61.4 ns, while consuming 199.5 pJ per operation.
To further understand how to achieve optimal energy efficiency for our device, we developed an empirical thermal model in SPICE to project the energy consumption of the device. This compact thermal model consists of the Joule-heating model of the heater, the thermal model of VO 2, and the coupling model between the heater and the VO 2 gap (figure 9(a)). Figure 9(b) lists model parameters and equations (1) and (2) govern the heater current and latency estimation in the model. By varying heater thermal resistance, our model indicates that heater current can be reduced by 3.4× when the thermal resistance of the nanowire heater increased by 10× (figure 9(c)). Therefore, replacing the heater material with a higher thermal resistance material such as Ti can significantly improve thermal coupling and allow generated heat to be more confined within the VO 2 gap. In our HW demonstration in section 2.3, we use the device with heater thermal resistance = 1.22 × 10 3 K W −1 , highlighted as the red arrow in figure 9(c). In addition, the latency can be further reduced to ∼3.8 ns by minimizing the parasitic capacitance of the Mott ReLU below 10 −11 F as shown in figure 9(d). As a result, our model estimates the energy consumption per cycle of Mott ReLU neurons can be minimized down to ∼0.638 pJ at a single device level by carefully engineering the heater material to enhance the thermal coupling and reduce parasitic capacitance.
Table 1 summarizes the energy, latency, area, and leakage performance of Mott ReLU activation devices against other activation devices or circuits at a single ReLU level [8]. The CMOS analogue implementation uses OP-AMPs and analogue switches [6] to implement ReLU activation function, while the digital CMOS implementation uses a 4-bit ADC and a look-up table [7]. Since all three implementations listed in our table emulate ReLU characteristics, a fair comparison could be achieved between them. Our Mott ReLU device can  already achieve ∼17× energy reduction compared with Analogue CMOS circuits [6]. With optimized thermal coupling, the device is projected to achieve ∼30× energy reduction compared with digital ADC implementation [7]. The device can also provide 450-1500× improvement in the area and 1.5-3× improvement in latency. These substantial performance gains in both area and energy of activation layers motivate our integration with the Ag-CBRAM array in section 2.3 which can achieve more efficient DNN implementation in HW.

Ag-CBRAM and Mott ReLU integration for a DNN application
To demonstrate core operations of DNN inference with our HW, we focused on VGG-16 for CIFAR10 image classification task in HW. First, we designed a custom PCB to integrate the Ag-CBRAM crossbar arrays with Mott ReLU arrays ( figure 10(a)). The board is capable of monitoring two arrays simultaneously and verifying weighted sum and activation results. In our board design, for each BL (column) in Ag-CBRAM array, the end of the BL is connected to the heater terminal of the Mott ReLU device. As a result, the weighted sum currents drive the heater and induce switching in Mott ReLU device to emulate ReLU activation function. We used a 16 × 16 Ag-CBRAM crossbar for this demonstration ( figure 10(b)). The callout window of figure 10(b) shows a representative device in the crossbar. Figure 10(c) shows an array that contains 44 Mott ReLU devices that can be individually connected to the BLs of the crossbar via PCB.
We investigated how to efficiently map VGG-16 to our HW. VGG-16 is a convolutional neural network that is 16 layers deep, which was widely used for computer vision applications. Figure 11(a) shows the representative CIFAR-10 images from 10 classes and network architecture. In VGG-16, there are 13 convolutional layers in which each layer is followed by ReLU activation layers, 5 max-pooling layers, and 3 fully connected layers in order. For the HW demonstration, we focused on convolutional layers. Max pooling layers and fully connected layers were implemented in the software (SW). Before we map full-precision (64-bit) weights in VGG-16 into HW, we performed post-training uniform quantization of both weights and  activation function with various precision and investigated its impact on inference accuracy. Figure 11(b) shows that 5-bit weights precision and 4-bit activation precision are the minimal bit precision allowed to ensure there is no significant accuracy degradation. Although each Ag-CBRAM cell in our crossbar array has gradual resistive switching capabilities as shown in figure 6(a), the analogue approach requires custom peripheral neuron circuits to precisely vary current compliance and realize fine control of resistance levels. Therefore, we chose to use digital implementation for this array-level demonstration to ensure better controllability of the resistance states. Figure 11(c) explains the mapping of the network to the Ag-CBRAM crossbar arrays using binary weights (HRS ∼ 10 12 Ω, LRS ∼ 20 kΩ). In this illustration, N of 3 × 3 convolutional filters are unrolled to N of 9 × 1 vector and mapped to columns of the crossbar. We quantized the filter weights into 5-bit binary representation to minimize memory size while maintaining high accuracy. As a result, five columns are used to represent MSB to LSB of the weights. As the filter slides across the input image, the part of the input (W × W) overlaps with the filter is also unrolled to a 9 × 1 vector and feeds into the WLs of the crossbar. The crossbar performs MVM and the weighted sum current is accumulated at the end of each column. Activation layers are implemented by connecting a Mott ReLU to each column. Mott ReLU neurons rectify the weighted sum and produce final pixel values in output feature maps (OFMs).  Before implementing CIFAR-10 classification task in our HW, we first tested whether Ag-CBRAM crossbars can drive Mott ReLU neurons ( figure 12(a)). We varied the input voltage (V in ) to a column of the CBRAM array by sweeping it from −250 mV to 250 mV when ∼2/3 of devices on a column of the CBRAM array are set to an LRS while the others are set to HRS ( figure 12(b)). To program each device in Ag-CBRAM array to its binary states, we adopted V DD /2 write scheme, where the selected WL and BL are biased to V DD /2 and −V DD /2, and all other unselected lines are grounded to prevent sneak path currents. Moreover, we varied the number of LRS in the column of the Ag-CBRAM array from 0% to 100% ( figure 12(c)). For both cases, 1.1 V is applied as V DD to the VO 2 gap of Mott ReLU with a 3.3 kΩ-load resistor connected in series, and 7 mA of offset current is applied to the heater. As can be seen in figures 12(b) and (c), the Mott ReLU neuron shows ReLU input-output characteristics.
After verifying that the Mott ReLU neuron can be driven by Ag-CBRAM crossbar, we further performed array analysis. Figure 13 shows a representative resistance map and its distribution across the array. As seen in figure 13, the target convolutional filters (3 × 3) are unrolled (9 × 1) and successfully mapped to columns of the CBRAM array. Since we use a digital implementation in our array, figure 13(a) shows filters represented as binary ('1' or '0') in SW. The high resistance device corresponds to '1' while the low resistance device corresponds to '0' ( figure 13(b)). In addition, figure 13(c) shows the current distribution with read voltage = 100 mV, which indicates the array has distinct binary states to represent filter weights.
We then converted CIFAR10 images into 8-bit pulse trains and fed them into the crossbar to generate weighted sum current, I sum , which is a result of the application of convolution filters to the images. Mott ReLUs rectify I sum and generate ReLU output (V out ). Figure 14 shows representative experimental results from network operations performed for the first (layer1) and last (layer13) convolution layer of VGG-16 on a 32 × 32 input image from the dog class. For each layer, we presented both SW simulated results and measured results in HW side by side for comparisons. Figures 14(a) and (g) show 3 × 3 quantized convolution filters. After mapping these filters using the approach described previously to the Ag-CBRAM array, I sum is measured at the end of each BLs. Figures 14(b) and (h) show measured I sum in real time for 4 representative patches (each patch contains 5 × 5 pixels) highlighted as red boxes in figures 14(c) and (i). The slide number represents the position of the filter as it slides across each patch. I sum from each BLs drives individual Mott activation neurons in the PCB and the output voltage of the neuron device is shown in figures 14(d) and (j). These results indicate that our HW implementation of convolution filters and activations can reliably generate OFMs and ReLU output without additional driver circuits and achieve close to ideal SW results (figures 14(e), (f), (k) and (l)). The learned OFMs represent abstract features of the dog class in layer 1 and 13. Based on the measured results in HW, the estimated classification accuracy for the entire CIFAR-10 dataset using our HW is 93.04%, approaching ideal SW accuracy (∼94%). Energy efficiency is estimated as 25.7 TOPS/W.

Discussion
There are several considerations for the array-level implementation of the proposed integration. In our experiments, we chose low read voltage (100 mV) to prevent disturbance of the memory state. In our V DD /2 biasing scheme in a crossbar array, pulse programming is used for the write operations. Higher write voltages are applied as compared to DC programming, thereby eliminating the possibility of disturbance during the read operation. The choice of low read voltage and pulse programming for crossbar operation effectively inhibited unwanted write disturbance for our HW implementation. Even though single devices exhibit gradual switching characteristics via current compliance, it is challenging to program the devices to multi-level states via current compliance in crossbar arrays. To further increase the margin as well as limit the programming current in crossbar arrays, self-selectivity characteristics can be achieved by further engineering the material composition of the device [12,13]. Moreover, one transistor and one resistor (1T1R) or one selector one resistor (1S1R) cell structure can be adopted [14,15] to eliminate the risk of unwanted disturbance of the memory state during read and write operations.
In this work, we demonstrated integration using a custom PCB. Moving one step forward requires wafer scale growth and integration of VO 2 with the Si process. Various approaches have already been demonstrated for wafer-scale growth [16] and heterogeneous integration of VO 2 films [17], as well as electrical circuits based on high-quality VO 2 film grown on silicon [18]. Large-scale deposition (ALD) of polycrystalline VO 2 has already been achieved with decent characteristics of metal-insulator transition [19]. These results offer some potential routes for on-chip integration of RRAM devices with Mott ReLU neurons. From the device operation perspective, Ag-CBRAM requires forming before entering its normal switching regime. When mapping weights to the array, the programming of the devices is also required. Since BLs of the Ag-CBRAM array will be used to drive Mott ReLU directly when they are integrated into a single chip, on-chip integration may require some additional peripheral circuitry to realize forming and array write operations.

Conclusion
In this work, a direct integration of the Ag-CBRAM array with Mott ReLU activation neurons is successfully demonstrated in HW. Our Ag-CBRAM device shows an ultra-high ON/OFF ratio, low variation, reliable endurance, and retention. In addition, Ag-CBRAM has multi-level switching capability with 16 states, making it an ideal synaptic device for neural network operation. The simplicity of fabrication for lateral Ag-CBRAM array makes it easy to be integrated with the BEOL of CMOS chips. The four-terminal Mott ReLU device embodies ReLU characteristics and can be directly driven by weighted sum currents generated in Ag-CBRAM array. The small footprint of the device allows stacking in between synaptic layers for the scalable in-memory computing system. The HW demonstration shows that Ag CBRAM arrays integrated with Mott ReLU devices offer a compact and scalable solution for accelerating DNNs with close to SW accuracy. Our approach opens new avenues in implementing deeper and more complex network architecture with higher area and energy efficiency using eNVM-based synaptic arrays and Mott ReLU activation devices.

Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors.