The promise of training deep neural networks on CPUs: A survey

This survey presents a comprehensive analysis of the potential benefits and challenges of training deep neural networks (DNNs) on CPUs, summarizing existing research in the field. Five distinct DNN models are examined: Ternary Neural Networks (TNNs), Binary Neural Networks (BNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and a novel method called Sub-Linear Deep Learning Engine (SLIDE), specifically designed for CPU-based network training. The survey emphasizes the advantages of using CPUs for DNN training, such as low cost, compact size, and broad applicability across various domains. Furthermore, the survey collects concerns related to CPU acceleration, including the absence of a unified programming model and the inefficiencies in DNN training due to increased floating-point operations. The survey explores algorithmic and hardware optimization strategies, incorporating compressed network structures, innovative techniques like SLIDE, and the RISC-V instruction set to tackle these issues. According to the survey, CPUs are more likely to become the alternative for developers with limited resources in the future. Through continued algorithm optimization and hardware enhancements, CPUs can provide more cost-efficient neural network training solutions, excelling in areas such as mobile servers and edge computing.


Introduction
The rising prominence of deep learning and neural networks has necessitated the development of efficient hardware platforms for effective training and inference.While CPUs have not traditionally been the first choice for training neural networks, their portability makes them suitable for an increasing number of models, especially given the growing demand for DNN training in mobile and embedded systems.Enhancing CPU-based DNN training could potentially replace dedicated processors and reduce costs for developers.
The next two-chapter overview analyses the feasibility of training DNNs on CPUs, with an emphasis on varied techniques and hardware modifications that enhance CPU-based DNN training.To evaluate the viability of CPU-centric training and inference methodologies, four distinct DNN models, together with algorithmic upgrades and hardware developments, are examined.In the second part, TNNs, BNNs, CNNs, RNNs, and the SLIDE algorithm will be examined as CPU-based DNN training possibilities.The deep learning algorithms are divided into three categories based on various categorization techniques: TNN and BNN, depending on the quantity of inserted values; CNN and RNN, based on implementation methods; and the SLIDE algorithm, which is presented as a separate category.In the third part, this survey will explore the reason for using CPUs for DNN training, highlighting numerous possible benefits such as reduced cost, bigger memory capacity, smaller footprint, and superior performance on small datasets compared to GPUs.In addition, the chapter examines the limitations encountered by CPUs as mainstream training hardware and evaluates recent advances in algorithmic optimization and RISC-V hardware optimizations that allow quick neural network training on CPUs.Finally, this survey concludes by evaluating the performance, accuracy, and efficiency of each approach in a CPU-based deep learning environment, providing insights into their use for various deep learning tasks and situations.

Algorithms applied on CPUs
This survey now reviews four different models of deep neural network (DNN).In this part, this survey primarily focuses on the structure and operational aspects of the networks, as well as their applicability in specific domains.TNN and BNN represents two compression methods for neural network, resulting in the different count of training on CPU.CNN and RNN are two network training methods with different operation for datasets.In contract to four traditional algorithms above, SLIDE is a new algorithm specificizing in network training on CPU.

TNN, BNN
The classification standard for distinguishing between ternary and binary neural network is based on the level of quantization applied to network's weights and actions.
2.1.1.TNN.Ternary Neural Networks (TNNs) restrict weights to {-1, 0, +1} values.Due to their potential for faster inference and greater power efficiency compared to full-precision networks, TNNs have garnered considerable interest.In contrast to its binary counterparts, a TNN inherently prunes smaller weights by setting them to zero during training, resulting in sparser and more energy-efficient networks.
2.1.2.BNN.Binary Neural Networks (BNNs) constrain weights and activations to {-1, +1} during runtime.Gradients and true weights are stored in full precision during training, allowing networks to be trained effectively on systems with limited resources.BNNs have made significant progress in recent years as they can be implemented on resource-constrained devices, saving storage, computation cost, and energy consumption.

CNN, RNN
The classification standard for dividing deep neural networks into CNNs and RNNs is based on their architectural properties and the nature of the input data they are designed to process.

CNN.
The use of convolutional neural networks (CNNs), a kind of artificial intelligence algorithm, has recently seen a significant uptick in popularity.Convolutional networks (CNNs) make use of pooling networks, fully connected networks, normalization networks, and convolutional networks (classification layers).CNNs minimize the number of network parameters by linking lower and higher layers via common convolution kernel parameters.This allows CNNs to take advantage of the local correlations in picture data while simultaneously reducing the number of network parameters.

RNN.
Recurrent Neural Networks (RNNs), encompassing popular variants such as Long Short-Term Memory (LSTM) networks, include recurrent connections that assist sequence analysis and are often used for producing or identifying time-series data, such as voice or music.RNNs generate output depending on both the current input and the history of previous inputs, enabling long-term dependencies to impact the result [1].For instance, LSTMs use three gates -the input gate, the output gate, and the forget gate -to control the influence of the input at any given time on the memory.While RNN-specific hardware acceleration has received scant attention to far, RNNs are well-suited for training on CPUs because to their low computational burden, small memory footprint, and sophisticated instructions [2].

SLIDE
Developed by Rice University, the Sub-Linear Deep Learning Engine (SLIDE) utilizes the Locality Sensitive Hashing algorithm and adaptive dropout in neural networks.SLIDE is comprised of two stages [3].

Pre-processing phase.
A hash table (L) is created by storing pointers to input data elements, promoting memory efficiency.

Query phase.
Recurrent Nearest neighbour search returns a set of active neurons from the buckets in the previously constructed hash table (L).In SLIDE, only a subset of activation functions is computed at each iteration.The hash code for each neuron is queried to obtain the active neuron ID from the corresponding bucket in the hash table.Once the neural network computation is complete, the output is evaluated, and the error is backpropagated to calculate the gradient descent value.Subsequently, neuron weights are updated, and their positions are adjusted in the corresponding hash tables.

Comparison on CPUs
In this part, the survey now discusses the reason that motivate the training of DNN on CPU and discusses the challenges the CPU faces in the current and future's neural network training.In context of the algorithms, this survey discusses the optimization of CPUs for accelerating DNNs in software and hardware.

Reasons to choose CPU
CPUs excel in carrying out logical processes and efficiently handling complex instructions.However, due to their restricted parallel processing capabilities, CPUs are commonly perceived as less proficient than GPU when it comes to accelerating neural network computations.The results may be influenced by factors including varying CPU architectures, batch sizes, hidden layer dimensions, and transfer learning strategies [4].Increasing the CPU's core count can substantially bolster overall performance.CPUs are a suitable alternative for training neural networks due to their compact form and low cost.Using Intel Knights Mill CPUs, which provide specific vector instructions for deep learning, developers may train neural networks without requiring significant training facilities [2].With smaller models or datasets, the performance disparity between CPUs and GPUs is far less pronounced.Moreover, the time needed for data transmission may negate the GPU acceleration advantage, especially with smaller datasets, resulting in CPUs outperforming GPUs in such instances [5].
The hardware of CPUs can be enhanced with RISC-V instruction sets.RISC-V offers a simpler, more effective open-source alternative to conventional instruction sets.Its modular design allows for a variety of processing capabilities.Its flexibility and scalability provide reserved coding space and easy subset extension.to reduce design costs and time while enhancing the CPUs' computational power and parallelism [6].
CPUs are likely to become more popular in the future.CPUs provide other advantages such as vast memory capacity, compatibility with mobile systems, and application in challenging situations, such as those experienced in the space and military sectors.

Challenges of CPU acceleration
As the capacity for model feature extraction improves and the number of model parameters and FLOPs increases, it becomes more difficult to achieve rapid inference on mobile devices with ARM architecture or x86-based CPUs [7].
There is still a tremendous deal of promise and opportunity for CPUs in DL, but there are also certain obstacles and restrictions.For instance, CPUs lack a unified programming model and tool chain to support DL applications of various types and scales; CPUs continue to face inefficiency and complexity when dealing with sparse DNNs; CPUs must solve load balancing, communication, and issues such as overhead and cache affinity; and CPUs must consider issues such as compatibility, scalability, and reliability in heterogeneous computing environments when working with other accelerators [8].

Algorithmic optimization
According to the research conducted by Y. Liu and his colleagues [9], the acceleration of inference tasks for CNN models on their CPU platforms complies to the TVM design idea.This technique includes linking code generation optimization at the operator level with optimization at the graph level, eventually pushing the limits of hardware performance via an all-encompassing optimization strategy.To reduce resource congestion, individual worker threads are allocated to distinct CPU physical cores, and global variables are padded with cache lines to avoid false sharing across threads.
C. Cui et al. summarize a variety of strategies designed to reach an ideal balance between precision and speed on DNN training on CPU [7].They include changing BaseNet's activation function from ReLU to HSwish; adding a SE-Block near the network's tail for improved accuracy-speed equilibrium; using a large convolutional kernel for low latency and good accuracy; and adding a 1x1 convolution with 1280 dimensions after the GAP layer to improve fitting.
M. Rastegari et al. successfully reduce the network size by a 32-factor by binarization, which not only accelerates the process but also enables the potential for real-time execution of state-of-the-art deep neural network inference on a CPU rather than a GPU.This advancement presents an opportunity for the integration of highly sophisticated neural networks into memory-constrained portable devices [10].
The ternary weight quantization neural network proves highly beneficial for training deep learning networks on CPUs.Seokhyeon Choi et al. developed TernGEMM, a technique that employs logical operations, rather than multiplication and addition, to facilitate rapid DNN inference on CPUs, while making full use of the 8-bit precision of CPUs, to efficiently execute quantized deep neural networks in real-time on CPU-equipped embedded devices [11].
Hongxu Yin et al. implemented a compressed LSTM model for improved CPU deployment [12].By cyclically expanding and pruning the weight and column/row of LSTM models, they produce a hardware-friendly LSTM model that is small, precise, and very efficient.Utilizing the hardware-existent delay lag effect (LHE), the LSTM latency may be decreased while accuracy is improved.
The training neural networks on the CPU may benefit from the use of innovative approaches.Rice University developed an innovative concept for a brand-new deep learning technique that they dubbed SLIDE technology [13].This technique eliminated the matrix multiplication approach that was constrained by frameworks like as TensorFlow (TF) and PyTorch, and it became an alternate way for CPU to train the neural network.Compared to TF-CPU, SLIDE's computational core utilization is much higher in terms of inefficient CPU utilisation.This is because TF-CPU makes poor use of its CPUs.SLIDE requires just a general-purpose CPU and no specialized graphics hardware for the execution of its machine learning algorithms.While GPUs often use networks comprised of millions or billions of neurons to evaluate massive datasets and detect different kinds of input, not every neuron must be trained for every potential situation.By selectively stimulating relevant neurons, SLIDE makes efficient use of the CPU processing capabilities, and it has the potential to significantly reduce convergence time as the number of CPU cores increases [3].Currently, Shabnam Daghaghi et al. [14] created a vectorized optimization utilizing AVX (Advanced Vector Extensions)-512.With this technology, the processing speed of SLIDE on an x86 CPU was increased by a 7-factor, demonstrating the full capability of CPU memory optimization and quantization techniques.Since SLIDE only targets neurons that are essential for learning, the CPU can also efficiently complete deep learning training, which is cost-effective for developers.
Finally, the following table 1 is compiled, including the model of CPU, the model of neural network, the dataset, the training time (deployment time), the accuracy of the deployed model and the characteristic of the CPU:

Hardware optimization
The RISC-V architecture enables a compact CPU design while promoting more efficient data transmission.Its open-source nature and inherent scalability make it an ideal choice for future advancements in CPU development.Furthermore, the compatibility of RISC-V allows for the integration of specialized coprocessors to accelerate neural network training, significantly enhancing the performance of key operations such as convolution, pooling, ReLU activation, and matrix addition.

Conclusion
In

Table 1 .
Performances of various deep neural networks on CPU.
conclusion, this survey concludes with a comprehensive survey of the possibilities for training DNNs on CPUs by reviewing different techniques, hardware improvements, and obstacles.After reviewing the structure, operational aspects, and domain applicability of four DNN models (TNN, BNN, CNN, and RNN) as well as a novel SLIDE algorithm tailored for CPU-based network training, this survey concludes that CPUs can be a viable alternative for neural network training due to their logical processing capabilities, low cost, and compatibility with mobile systems.It provides an economical method for training neural networks.Nonetheless, obstacles like as restricted parallelism, sparse DNN management, and load balancing still exist.To combine accuracy and speed while optimizing hardware performance, algorithmic enhancements such as TVM design, BaseNet's activation function, ternary weight quantization, compressed LSTM models, and SLIDE have been suggested to dramatically decrease convergence time.Using RISC-V architecture and customized coprocessors, hardware optimization enhances CPU performance for essential neural network training activities.By solving current obstacles and improving both algorithms and hardware, CPUs may gain popularity for DNN training, especially in mobile systems and devices with limited memory.These may provide valuable insights for researchers and practitioners in this field.