Optimization of ALU with gated clock and its internal modules in RVIM64 processor

Low power consumption and area are the main research directions of the RISC-V architecture processor, and the arithmetic logic operation unit (ALU) is the key part that affects the overall performance of the processor. In order to improve the performance of the RISC-V architecture processor, we adopted the strategy of reducing the overall power consumption and area of ALU, by using the method of adding a gated clock circuit and optimizing the structure of the comparison module and shift module in ALU. Under the conditions of 90 nm technology, the temperature and the voltage were set to 125 °C and 1 V respectively. The synthesis results of ALU are as follows: the area is 111061 um2, the power consumption is 20.58 MW, and the delay is 20.58 ps. Compared with other traditional ALU, the area is reduced by 46%, and power consumption is reduced by about 48%.


Introduction
With the development of science and technology, modern systems increasingly need low-cost and high-performance processors, which means that excellent processor architecture becomes particularly important for the development of modern processors [1].As an open-source architecture, the processor of RISC-V [2] began to develop rapidly.There have been many studies about processors based on RISC-V architecture.For example, the rocket is a simple 64-bit processor with a 5-stage pipeline designed by foreign universities, which also supports branch prediction; BOOM is a 10-stage pipeline designed by foreign experiments and executed out of order [3] [4]; the E203 processor has the low area and low power consumption in China.As the core unit of the processor, ALU must also meet the requirements of low power consumption.The experimental results show that the power consumption of ALU accounts for 40% and the area accounts for 38% of CPU, so it is particularly important to optimize the area and power consumption of ALU.
Traditional ALU optimization includes optimizing the adder, using a modified Carry Select Adder or Carry Look Ahead Adder [5]; Adder accounts for a small proportion in ALU supporting multiplication, so it can be combined with a comparison module to reduce the area.The multipliers are optimized by using modified booth multipliers [6] or radix-4 and radix-8 booth-encoded multi-modular multipliers [7].Although the multiplier with high radix is optimized, it does not give consideration to both performance and speed, so we can consider using the radix-4 multiplier and booth algorithm plus shifted multiplier.
In this paper, a gated clock unit [8] is added to the traditional ALU.Carry-look-ahead adder(CLA) [9] is used in the comparator module and reused.In the shift module, the mask algorithm is used to realize arithmetic right shift, and the inverter is used to reduce shifter generation.Through the optimization in this paper, the area of the proposed ALU is reduced by 46% and the power consumption is reduced by about 48%.
The second part mainly introduces the optimization method of ALU.Part 2.1 will introduce the gated clock in ALU.Part 2.2 will introduce the contents of the comparator module, including adder and adder multiplexing in the comparator module, and part 2.3 will introduce the contents of the shift module and the optimization of its structure.Part 2.4 will prove the feasibility of this scheme through experimental data.The third part is the results of the experiment.The arithmetic logic unit of DC comprehensive optimization is superior to the traditional arithmetic logic unit.

ALU optimization
The optimization of ALU mainly starts from two aspects of power consumption and area.Power consumption can be classified into static power consumption and dynamic power consumption.We mainly focus on reducing dynamic power consumption, which comes from flip power consumption and short-circuit power consumption.The following formula can be used for calculation: V dd is the power voltage, C load is the equivalent load capacitance of the later-stage circuit, and T r is the inversion rate of the input signal.There is another way to write it, as follows: Where α is the activity factor, also called the flip factor, and its optimization is to consider adding a gated clock.The original addition is generated by IP, and there will be a delay problem.In order to reduce the delay, CLA can be used, and the formula is as follows: 0 =   (5)   =     (6) According to the above formula and knowledge, the area and power consumptions of traditional ALU are optimized.First of all, the original structure of ALU is shown in Figure 1, in which the addition, subtraction, and comparison instructions are implemented by symbols, and the shift module uses three shifters.In order to reduce power consumption and area, the gated clock is added first, and then the internal structure of ALU is modified.The optimized structure is shown in Figure 2.

clock-gated
Adding a gated clock in ALU can reduce the dynamic power consumption of the chip.It reduces power consumption by closing sub-logic and sub-components [10].The design of the ALU is shown in Figure 3, which is mainly to add an enable signal at each signal input.When only one signal is enabled, there is only one channel, and all other channels are static.

Adder and Adder reuse in a comparison module
We design a 64-bit CLA with carry and flag information.Adders with high bit width can be composed of adders with low bit width in series.The structure of 4-bit CLA is shown in Figure 4, and then two 4-bit CLAs can be combined in series to form an 8-bit CLA, four 8-bit CLAs can be combined in series to form a 32-bit CLA, and then combined into a 64-bit CLA, as shown in Figure 5.The adder generated by combination in this way saves more resources than the adder with a high bit width directly.

Shifter design
Firstly, the masking algorithm is used to distinguish the shift of 64-bit and 32-bit, and at the same time, the arithmetic right shift is realized to reduce power consumption.
Secondly, in order to reduce the area of the whole shifter module, one shifter can be used for both logical right shift and logical left shift, as shown in Figure 7.The input shift_num first judges whether to move left or right.The right shift can be directly output by the shifter, and the left shift can be bitwise reversed by the inverter.After inversion, it is shifted by a shifter, and then the shift result is input into an inverter for inversion, and finally, the left shift result is output.

Data comparison before and after optimization
The data of power consumption and area are obtained by DC synthesis, in which the temperature of typical conditions are 25℃ and 1V, the slow conditions are 125℃ and 0.9V, and the fast conditions are -40℃ and 1.1V.Table 1 shows the comparison results of data before and after optimization, in which clock_G and clock_UG are the comparisons of whether there is a gated clock.The results show that adding a gated clock can reduce power consumption by about 13% in slow conditions.
On the basis of adding a gated clock, we consider modifying the contents of the adder and comparator.The original structure has no adder and comparator, which is only done by IP, named com_IP.After analysis, using CLA can reduce power consumption, called com_A, and reusing CLA in comparison module can reduce power consumption and area, named com_RA.In the slow conditions, the area is reduced by 36% and the power consumption is reduced by 13%.
On the basis of the previous two optimizations, the content of the shifter is optimized.At first, the shift module only uses IP, which is called shift_IP.Then, the mask algorithm is used to realize the arithmetic right shift, which is called shift _ M.After that, the number of shifters is reduced and inverters are added to reduce the usage of shifters, which is called shift_MR, further reducing the area and power consumption.In the slow conditions, the area decreases by 16% and the power consumption decreases by 12%.The power consumption area of the proposed ALU is greatly reduced in a single cycle with little delay.In the extreme state, the area is reduced by 46% and the power consumption is reduced by about 48%.

Conclusion
This paper further optimizes the area and power consumption on the basis of traditional ALU.The gated clock is used in the proposed ALU, which reduces the overall dynamic power consumption.In the comparison module, CLA is designed and reused, which reduces the area.Mask and inverter are added to multiplex the shifter, reducing the area and power consumption.Through the above optimization, the area is reduced by 46% and the power consumption is reduced by about 48% under extreme conditions.When the temperature is 25 ℃ and the voltage is 1 V, the frequency reaches 556 MHZ.

Figure 2 .
Figure 2. Adding gating unit and ALU for structural optimization.

Figure 3 .
Figure 3. Implementation of the gated clock in ALU.

Fig. 6 Figure 6 .
Fig.6is an RTL that implements a partial shift instruction.The slt, sltu, bge, and bgeu instructions in the comparison module are realized by multiplexing the CF, OF, and SF flag bits in the proposed adder.

Table 1 .
Comparison results of data before and after optimization.After a series of optimizations, the initial and final states of ALU are compared and analyzed, as shown in Table2.The ALU before the optimization is called traditional, and the ALU after the optimization module proposed in this paper is named Modular optimization ALU.

Table 2 .
Comparison of the proposed ALU and the traditional ALU.