Analysis of Verilog-based improvements to the memory transfer

Memory copy technology is widely used for data transfer between CPU and memory, and is an important step in all types of operating systems and drivers, and is one of the bottlenecks in the current speed-up of computing. This paper has done more research on the historical development, theoretical basis and experimental details of different memory copy accelerators and related knowledge, and summarised and compared them, with emphasis on reviewing two related papers that provide an overview of their acceleration theory and experiments. The paper concludes that memory copy accelerators using a variety of innovative technologies have largely improved copy speeds and reduced energy consumption, laying a solid foundation for the future development of memory copy acceleration technology, which has been successful in improving and enhancing the user experience with the performance gains of this technology. At the same time, the development of memory copy accelerators has advanced more advanced technologies, such as machine learning, to power the development of human technology.


Introduction
For operating systems and drivers, memory copying is an unavoidable major step and the most timeconsuming operation.Data transfer between CPU and memory is the bottleneck to increase the speed of operation.Direct Memory Access (DMA), is widely used in important areas such as messaging, operating systems, and device drivers.In addition, in cloud computing architectures, the process of virtual machine (VM) live migration also requires a copy of memory, a process that involves transferring virtual machine memory pages to recover the virtual machine on another host so that all services or applications are not interrupted.
This paper examines numerous memory copy accelerators and summarises and compares them.Many relevant theories and histories are studied by the authors, for example, accelerating data transfer between CPU and memory in an operating system can be improved by reducing the number of CPU instructions executed and creating a multi-channel pipeline to allow parallelism between data transfer and computation.Many concepts are clarified by the authors, for example in live migration of memory copies in virtual machines, there are three key metrics that determine the effectiveness of migration, namely migration time, downtime, and total number of pages transferred.Numerous approaches have been read and understood by the authors and this paper analyses the pros and cons of each approach and the associated theory.In Chapter 2, the paper focuses on a review of the history and development of memory copy acceleration techniques in a wide range of domains, including software and hardware, in addition to aspects of virtual machine migration.In Chapter 3, this paper focuses on the definition and knowledge of FPGAs and DMA, as well as the advantages and disadvantages of DMA, including the links and goals between DMA and FPGAs, as a theoretical basis.In Chapter 4, the paper highlights the limitations of some current approaches in the field and selects two methods for presentation and analysis, highlighting their advantages and parsing the results.
Memory copy acceleration can significantly increase the speed of a computer, making it run more smoothly, while in virtual machines, memory copy speedup can also reduce the amount of memory page transfers, which will reduce virtual machine downtime and can improve and enhance the experience of network users to a higher degree.

History of memory copy acceleration
Due to its significance, there have been many memory copy accelerators throughout history, such as the most traditional DMA accelerator, where DMA transfers data directly from one external device to another, reducing CPU involvement and performing direct hardware transfers, relieving CPU computing pressure and increasing computing speed; on top of this, there are a number of software-based and algorithm-related improvements There are also software and algorithm-based improvements to improve the efficiency of DMA accelerators.
For example, Intel has developed an I/O acceleration technology (I/OAT): the Asynchronous DMA Copy Engine (ADCE), which relieves the CPU of the burden of large data transfers but is mostly useless due to its high overhead [1].Wong et al. have proposed a cache-based hardware accelerator for memory copying, where they reduced the memory copy workload and achieved acceleration by setting an additional cache index to avoid writing operations to the actual address of the memory, but the method is limited by the size and organization of the system cache, which is not highly pervasive and leads to a large limitation of their accelerator.They have also proposed a hardware cell-based DMA accelerator where they add a load/store cell to the accelerator to assume the data that is about to be copied, but this increases the overhead of memory operations and reduces accelerator performance [2].Unlike the traditional zero-copy approach, address translation in network stack memory copies, H. Tezuka et al. have proposed using a pin-shaped region in the cache to reduce frequent virtual address remapping, with better results [3].In virtual machine live migration, more traditional methods are pre-copy and postcopy, but they are not generalized or have unbalanced performance.Z. Wang et al. of Shanghai Jiao Tong University proposed adaptive copy for virtual machine live migration, which is a great improvement to the traditional methods [4].

DMA
DMA is a high-speed transmission operation that allows the direct reading and writing of data.The transmission process does not pass through the CPU.The entire data transmission operation is carried out under the control of a "DMA controller".The CPU can do other work during the transmission process.In this way, CPU processing and data exchange can happen in parallel.This will result in a significant improvement in the overall performance of the system.
The following are the essential features of DMA: 1.Each channel is directly coupled to specialized hardware.Also, software triggering is supported.Software can decide which request gets priority over others.2. Help loops manage their buffers.3.There are three event flags per channel (DMA half transfer, DMA transfer completion and DMA transfer error).Each one of these three event flags could develop into an interrupt request.
DMA, however, has two significant drawbacks: (1) Heavy setup requirements.At least three communications are needed to set up one DMA channel control logic before each data transfer task (source address, destination address and length, respectively).( 2) Poor exception management.Software cannot see the data transfer progress that DMA controls, making it challenging to manage exceptions correctly.

FPGA
The logical unit array idea, which includes input and output modules as well as customizable logic modules, is used by FPGA.FPGA can be programmed and differs structurally from conventional logic circuits in that it may be customized.Programming data is loaded into the internal static memory unit by the FPGA's logic.Finally, the functions that an FPGA can perform are defined by the value stored in the memory unit, which also defines the connection structure between modules and the logic unit's logic function.
In conclusion, DMA enables high-speed data transfer by allowing peripheral devices to access system memory directly.FPGAs can be used to implement numerous digital logic and signal-processing functions.For the neural network, they can be combined to accomplish high-speed data transit and processing.However, neural network inference requires a significant amount of computation and data transport between the memory and the processing units.The effectiveness of techniques must continually advance as the size and complexity of neural networks expand.A good memory copy accelerator should therefore support: 1. High data transfer bandwidth and pipelined memory read and write operations.2. A light over the channels.3. A channel-specific control and communication interface for data transfer.

Improvement method
4.1.Method 1: layer conscious memory management 4.1.1.Background.Deep neural networks (DNNs) are a computationally intensive learning model with increasingly promising applications.FPGAs are one of the hardware platforms for DNNs with their advantages of energy efficiency and reconfigurability.However, one of the main limitations of previous FPGA-based DNN accelerators is the insufficient off-chip memory bandwidth, which can result in significant delays in transferring data to and from the FPGA and will further have a negative impact on the overall performance of the accelerator [5].This situation gets worse when the DNN model has a large number of layers.
Another disadvantage is that the previous designs use a uniform memory management (UMM) scheme for all layers.It typically allocates a fixed amount of memory for each layer of the DNN, regardless of the actual memory requirements of each layer.Because different layers of a DNN often have different memory requirements, fixed memory allocation causes inefficiency and underutilization of on-chip memory resources.
Overall, these limitations highlight the need for novel techniques to optimize memory usage and minimize data transfer in DNN accelerators.The Layer Conscious Memory Management (LCMM) methodology introduced next overcomes the limitations of fixed memory allocation by dynamically allocating on-chip memory resources based on the characteristics of each layer [6].LCMM aims to maximize data reuse and minimize data transfer between on-chip and off-chip memory, which improves the performance and efficiency of FPGA-based DNN accelerators.The operation sequence of previous DNN accelerators is shown in Fig. 1.Convolutional layers are the most computationally and memory intensive parts of DNNs.Therefore, designing efficient architectures for convolutions is crucial for DNN acceleration.Multiply-andaccumulation is a procedure in convolution.It uses feature map tensor and weight tensor as input, and produces an output feature map tensor.Previous DNN accelerators applies a loop methodology that partitions input tensors into tiles and repeatedly loads them from off-chip to on-chip memory for processing [7].To be more specific, the input tensors are split into tiles that are loaded and processed sequentially.The middle loops indicate the feeding of memory from the input buffers to the buffer of compute array.The parallel execution within the compute array is done in a pipeline fashion by the inner loops.Finally, the output tile is written back to off-chip memory.All the layers adopt this strategy.

LCMM approach.
The goal of our research is to optimize the performance of DNNs by storing some layer data on the chip to reduce data transfer.The authors focused on a model named Inception-v4 and examined one section, which contains six convolution operations connected by feature outputs and weight data sources, as shown in Fig. 2 (a) [8].Using a UMM approach, buffers were allocated for each tensor in off-chip memory, as shown in Fig. 2 (b).However, the authors found that the performance of certain layers was restricted by computation, while others were limited by memory.To address this, the memory-restricted layers were selected and stored their data in on-chip memory, as shown in Fig. 2  (c).It is discovered that more on-chip memory did not always imply better performance because various levels having varied memory bandwidth requirements and tensor sizes.The intricacy of existing DNN models makes on-chip memory allocation challenging.
By analyzing the period of validity of different feature tensors, it is possible to determine if they can share the same buffer.An interference graph is constructed and an algorithm is used to colour diagrams, as shown in Fig. 4 (a) [9].The aim is to minimize the total capacity of buffers instead of reducing the number of buffers.The arrangement of tensors and buffers is illustrated in Fig. 4. (b).If numerous tensors can utilize the same buffer, the tensor with the biggest size determines the size of the buffer.Unlike feature tensors, the validity of weight tensors continues throughout entire computation.They could be prefetched before they are imported and reused for multiple instances of inference.A prefetching technique is designed to hide the prefetching overhead by computing the time required to load weight tensor data from off-chip to on-chip memory and backtracking to locate a node that guarantees sufficient elapsed time for loading.In a prefetching dependence graph (PDG), prefetching edges are constructed to describe the sequence of processes for weight tensors.If two prefetching edges do not intersect, it indicates that nodes can share a weight tensor buffer.To save on-chip buffer capacities, an interfering graph and buffer distribution diagram similar to those for feature tensors can be applied.

Evaluation. LCMM framework can be integrated with existing FPGA-based accelerator designs.
The authors use a specific array architecture with UMM as a baseline and evaluate the performance of LCMM using three popular DNN models, including ResNet-152, GoogleNet and Inception-v4 [10].Two data types are involved, and includes a comparison with baseline designs and state-of-the-art designs in terms of performance improvement and resource utilization.The detailed results are shown in Table 1 [11].
Table 1.Comparison results using three models.The results show that LCMM beats UMM on all benchmarks by 1.36 times on average.This improvement stems from two factors.Firstly, memory-restricted layer performance is improved via onchip buffer allocation.The second is improved computation efficiency using tensor buffers.The authors also note that the improvement is higher for ResNet-152 compared to GoogleNet and Inception-v4 due to its simpler network structure and fewer required buffers for feature map tensors.Also, the authors observe that the efficiency drops when changing precision from 16-bit to 32-bit because of the increased buffer sizes [12].
Moreover, as shown in Table 2, the authors' design achieves a performance speedup of 1.35X and 1.12X over the two state-of-the-art designs respectively in terms of throughput.Because it stores all intermediary feature maps on-chip, the design uses more on-chip capacity than the authors' design, whereas the authors' design only stores the outputs of memory-limited levels on chip, resulting in better computation efficiency.The authors' architecture also outperforms, but at the expense of higher DSP utilization and on-chip memory consumption [13].In the future, LCMM can be integrated with heterogeneous design methodology to increase performance density even further.

Method 2: memory-centric reconfigurable accelerator for classification 4.2.1. Limitations.
As datasets increase from pb level to eb level and beyond, it becomes more challenging to perform complex analysis (especially machine learning algorithms) in a reasonable amount of time and with a realistic power budget using standard architectures.General purpose graphics processing units (GPGPU) and off-the-shelf CPUs are frequently used in computers for computing and acceleration.The inefficiency of these systems is since they are general and nevertheless rely heavily on data transit between storage devices and computing components.Many have used hardware acceleration to solve this issue, heavily utilizing the parallel processing capacity of GPGPU devices.Unfortunately, these methods have two primary drawbacks: (1) The communication requirements between applications vary greatly when speeding problems for more complicated analysis, where parallelism may be less visible.
(2) These accelerators nevertheless experience data transfer energy limitations even if they are frequently linked to the host computer through a high-speed interconnect.Whether the accelerators' speed and/or efficiency, the data transport needs might drastically decrease their efficiency when processing enormous data sets.

Related theories.
The most frequent data path operations, according to application analysis, are addition, multiplication, shifting, and comparison, which are frequently used in applications for classification, clustering, and neural network-based methods.All communication-related applications, to some extent, display a combination of instruction-level parallelism (ILP) and data-level parallelism (DLP), which a multi-core environment can readily process and take advantage of.
The memory analysis accelerator is located between the last level of memory devices and the I/O controller, from a system-level perspective, as shown in Fig. 5 [14].When the host read command to a specific disk location with configuration data, acceleration mode is triggered.The accelerator can read files from a group of specified spaces or directories containing data to be processed, distribute that data, and then resume processing according to a predetermined configuration.
As indicated in Fig. 6, a system memory manager in the accelerator oversees transmitting data arriving from the on-board SATA/SAS controller for system level memory management.Figure 6.Schematic of the system-level memory management.Data scheduling is a feature of the reconfigurable read or write memory and is chosen when compiling.The memory manager gets the SATA commands from the operating system, allowing it to safely write back the results and retrieve file names or memory locations.Each cluster has a simple data router called a Data I/O Block (DIOB) that receives directed data transfers, which then distributes them to the appropriate cluster and PE.Every data byte is automatically transmitted to the proper PE after being preceded by a 9-bit header with an MSB, a 5-bit cluster ID, and a 2-bit PE ID.
A dedicated data line is used internally to and from the PE so that data transfer and execution can coexist without using the cluster bus for data transfer.Front and back data buffers, designated "F" and "B" in the diagram, are present in each PE and are set aside for I/O.4.2.3.Specific method and evaluation.For hardware, desktop system running Ubuntu 12.04 x64 server with Intel Core2 Quad Q8200 2.33GHz CPU, 8GB of 800 MHz DDR2 RAM, and 384-core NVIDIA Quadro K2000D GPU.For the simulated CPU processing, data is transmitted from the "SSD" to the "host" via the SPI connection, simulating the behaviour of the SATA interface.The data is also written back to the "SSD" after processing.Finally for data compression the data set is compressed using Huffman entropy coding to reduce the amount of data transferred.Decompression is simple and is achieved by a lookup table, which requires a total of 14 cycles: the rest for data handling and formatting of the NBC, KMC, or CNN algorithms, and one for the actual lookup.
In terms of throughput, both GPUs and accelerators outperform CPUs significantly, by a factor of about 6 on average.A bigger proportion of the chip's useable area results from the accelerator cores' increased target area emphasis.
In terms of energy efficiency, high energy efficiency is a design goal that is met by the accelerator, which is typically 2 orders of magnitude more energy efficient than CPUs and GPUs.On this accelerator, a mix of lookup operations and custom data paths are executed, effectively reducing transfer energy and latency by an average of 212 times over single-threaded CPUs and 74 times over GPUs.The resultant data for each experiment is shown in the Fig. 7

Conclusion
In conclusion, the Layer Conscious Memory Management (LCMM) framework is a proposed method for optimizing the performance of Deep Neural Network (DNN) accelerators on FPGA.It does this by allocating on-chip memory based on layer diversity and memory lifespan information.By analysing the liveness of feature tensors and prefetching time span of weight tensors, LCMM can improve buffer utilization through reuse.Experiments have shown that this technique can result in a 1.36X performance improvement compared to previous designs.
The memory-centric reconfigurable accelerator uses a hierarchical memory architecture to efficiently store and access data, resulting in improved performance and energy efficiency compared to traditional accelerator architectures.The experimental results show that the proposed accelerator achieves up to 66% speedup and huge reduction in energy consumption for various machine learning workloads.The authors suggest that this accelerator architecture can be scaled for larger datasets and extended for other machine learning tasks.
Based on the two strategies, it is anticipated that future DNN accelerators could use LCMM to speed up memory transfer.This approach will reduce off-chip data access to efficiently manage memory on FPGA, leading to further improvements in performance.Additionally, because neural networks become more complex and diverse, reconfigurable accelerators are required to help develop more advanced and flexible architectures.Overall, FPGA-based accelerators evolve rapidly in response to the need for quicker and more effective machine learning solutions in numerous applications.

Figure 1 .
Figure 1.Operation sequence of previous DNN accelerators.Convolutional layers are the most computationally and memory intensive parts of DNNs.Therefore, designing efficient architectures for convolutions is crucial for DNN acceleration.Multiply-andaccumulation is a procedure in convolution.It uses feature map tensor and weight tensor as input, and produces an output feature map tensor.Previous DNN accelerators applies a loop methodology that partitions input tensors into tiles and repeatedly loads them from off-chip to on-chip memory for processing[7].To be more specific, the input tensors are split into tiles that are loaded and processed sequentially.The middle loops indicate the feeding of memory from the input buffers to the buffer of compute array.The parallel execution within the compute array is done in a pipeline fashion by the inner loops.Finally, the output tile is written back to off-chip memory.All the layers adopt this strategy.

Figure 2 .
Figure 2. Memory footprints of UMM and LCMM (a) Computation graph (b) Uniform memory management (c) Layer conscious memory management.

Figure 3 .
Figure 3. Design space of memory allocation.It is discovered that more on-chip memory did not always imply better performance because various levels having varied memory bandwidth requirements and tensor sizes.The intricacy of existing DNN models makes on-chip memory allocation challenging.By analyzing the period of validity of different feature tensors, it is possible to determine if they can share the same buffer.An interference graph is constructed and an algorithm is used to colour diagrams, as shown in Fig.4(a)[9].The aim is to minimize the total capacity of buffers instead of reducing the number of buffers.The arrangement of tensors and buffers is illustrated in Fig.4.(b).If numerous tensors can utilize the same buffer, the tensor with the biggest size determines the size of the buffer.

Figure 4 .
Figure 4. Analysis of feature buffer lifespan (a)Feature interference graph (b)Feature buffer allocation.Unlike feature tensors, the validity of weight tensors continues throughout entire computation.They could be prefetched before they are imported and reused for multiple instances of inference.A prefetching technique is designed to hide the prefetching overhead by computing the time required to load weight tensor data from off-chip to on-chip memory and backtracking to locate a node that guarantees sufficient elapsed time for loading.In a prefetching dependence graph (PDG), prefetching edges are constructed to describe the sequence of processes for weight tensors.If two prefetching edges do not intersect, it indicates that nodes can share a weight tensor buffer.To save on-chip buffer capacities, an interfering graph and buffer distribution diagram similar to those for feature tensors can be applied.

Figure 5 .
Figure 5.The processing element architecture.The accelerator intercepts and relays I/O commands on the I/O interface as needed for normal activities.When the host read command to a specific disk location with configuration data, acceleration mode is triggered.The accelerator can read files from a group of specified spaces or directories containing data to be processed, distribute that data, and then resume processing according to a predetermined configuration.As indicated in Fig.6, a system memory manager in the accelerator oversees transmitting data arriving from the on-board SATA/SAS controller for system level memory management. [15].

Figure 7 .
Figure 7. Data on the results of each experiment.

Table 2 .
Comparison with two latest designs.