Design and Implementation of RISC-V Based Pipelined Multiplier

Among the many microprocessors, RISC-V as an open-source instruction set is gradually gaining popularity among academia and industry. The performance of the multiplier in the microprocessor imposes constraints on the computational power of the processor. In order to improve the efficiency of multiplication instructions, a pipeline multiplier is implemented in this paper. Firstly, the partial product is generated using the Radix-4 booth. Secondly, the Wallace tree structure is used to accelerate the compression of the partial product. Then, a parallel prefix adder is used to calculate the resulting partial product to improve the timing. Finally, registers are added as a pipeline to achieve a high-efficiency multiplication calculation. With the operating voltage and temperature set to typical conditions, the integrated multiplier area is 50260.6 μm2, and the power consumption is 20.41 mW. The final frequency of the multiplier is 1 GHz in gate-level simulation.


Introduction RISC-V
is an open instruction set architecture.The instructions are very concise, the encoding of instructions is novel, and the index of the general registers required by the instructions is placed in a fixed position, which can make the decoding of the processor more convenient.Operators have always been an essential part of microprocessor chips.As an important computing unit in microprocessors, the multiplier plays an important role in improving overall performance.The path of the multiplier is usually the critical path for data processing in a microprocessor.
The main frequency of the microprocessor is determined by the frequency at which the multiplier completes a set of data calculations [2].In 1951 A.D. Booth proposed the method of Booth coding [3], and in 1961 O.L.Mcsorley improved the Booth algorithm, also known as the Radix-4 Booth algorithm.This method reduces the number of partial products by half.In 1964, Wallace introduced the well-known Wallace tree structure for compressing partial products, further reducing the number of counters [4].To alleviate complex interconnections in the Wallace tree, Itoh implemented a 600 MHz two-stage multiplier [5].To quest low power consumption, Tanya Mendez proposed a pipelined vedic multiplier [6].Although it has lower power and delay, it does not fit RISC-V instructions.All the multiplier above has a common problem of significant delay in the last stage adder was raised.To fit other types of multipliers, T. Krishnan and S. Saravanan proposed a pipeline multiplier in a floating point multiplier in 2020 [7].
Due to the long calculation cycle of the multiplier, calculating multiple sets of data can cause resource dependency and block the path when calculating multiple sets of multiplication.To solve the problem, the paper proposed a pipeline multiplier.However, the last stage needs adder large numbers, and the Carry-lookahead adder (CLA) will cause a large delay in carry propagation.So, the design also optimizes the final level adder, using a parallel prefix adder with less latency.Compared with the traditional CLA multiplier, the proposed multiplier increases by 100%.Section 2 describes the method of pipeline multipliers.Section 3 verifies the improvement of multiplier performance through synthesis and simulation.Section 4 summarizes the content of this paper and the results of the simulation.

Proposed pipeline-multiplier
The traditional multiplier uses the Radix-4 Booth algorithm to generate the partial product.After compressing the partial product with a Wallace tree, the summation is executed with a serial adder.In order to improve the performance, this paper proposed a pipeline multiplier that used a parallel prefix adder to reduce the timing path and divided the worst path by adding a pipeline.A comparison of the two multiplier structures is shown in Figure 1.
Figure 1.The structure of the multiplier before (a) and after (b) using a parallel prefix adder and dividing the worst path by adding a pipeline.

Radix-4 booth encoding
In this proposed multiplication, Radix-4 modified booth encoding is shown in Table 1.An n-bit multiplier will produce n/2 partial products.Therefore, it is more efficient than traditional Booth encoding [8].Table 1.Radix-4 modified booth encoding(MBE).

Partial product summation
The 3-2 compressor is mainly used for compression in the partial product addition so as to produce the sum of each bit, and the circuit is shown in Figure 2. The Cout generated by each bit of compression is used as the Cin of the next stage.The carry-out c generated by the last stage of each bit is combined with the c, which is generated by the other bits as the additive of the final summation.The sum (S) generated by each bit compression circuit is combined with the sum (S) generated by the other bit compression circuits as another additive for the final summation, as shown in Figure 3. Figure 3. Compression.

SKlansky adder
Compression of the partial product will eventually result in the final operand of two rows.The CLA adder is slower and will result in a longer timing path for the entire multiplier computation.So, the SKlansky adder (SKA) is selected as the output of the final stage result.The SKA is designed to calculate the g i and p i for every bit.Then the node units at each level are calculated based on the generated signals.
Finally, the sum and the carry are obtained.As shown in Figure 4, the SKA reduces the delay in calculating the intermediate prefixes.However, its fan-out will increase at each stage [9].

RISC-V based multiplier
The multiplier and the multiplied number are selected by a multiplexer before the calculation to make this multiplier more suitable for the RISC-V.If they are signed numbers, the operand is inverted and added by 1.If not, the operand is sent to the multiplier through a multiplexer for calculation.And finally, the result of the calculation needs to be sent to the register before the sign of the data is also judged, and the circuit structure is shown in Figure 5.

Pipeline structure
The pipeline is an essential method of trading areas for performance and space for time.Because there is no reuse of resources between each other, the area increases.However, it optimizes timing and increases throughput because the processor does different things at different pipeline levels at the same time [10].In Figure 6, each instruction needs to wait for all three clock cycles when no pipeline is added.In Figure 7, every instruction will get results after one cycle by adding the pipeline, and it increases the completion speed.A pipelined implementation of the proposed multiplier is shown in Figure 8.

Stability Analysis
By synthesizing the non-pipeline multiplier with the final adder using CLA and SK construction, the parallel prefix adder can improve the frequency of the whole multiplication computation.The performance of the three multipliers is depicted in Table 2.The addition of a pipeline leads to an increase

EEICE-2023
Journal of Physics: Conference Series 2625 (2023) 012006 IOP Publishing doi:10.1088/1742-6596/2625/1/0120065 in area and power consumption due to the introduction of registers, but at the same time, the timing is improved due to the shortening of the worst-case path.

Gate-level simulation results
The simulation is executed by giving ten sets of numbers to the three multipliers.The CLA multiplier takes 19.6 ns to execute ten sets of multiplication in Figure 9, and the SK multiplier takes 19.4 ns in Figure 10.For the proposed pipeline multiplier, the period of the clock is 1ns, and a set of motivations are input for each cycle.It can be seen that it takes 10 ns to generate ten sets of results.So the multiplier can reach a frequency of 1 GHz, and the frequency results are shown in Figure 11.

Conclusion
In this paper, the structure of the multiplier circuit based on RISC-V instruction set architecture is optimized to improve the low efficiency of instruction execution.Since the last stage adder needs to execute a large number of operations and the normal CLA has a large latency, the structure of the last stage adder is optimized by using a parallel prefix adder to reduce the latency.Also, we added a pipeline to improve the multiplier operation speed and solve the problem of the delay caused by the worst path.
The multiplier was synthesized based on the 90 nm process, and the optimized structure has improved instruction execution efficiency.Compared with the traditional CLA multiplier, the proposed multiplier increases in the area by 19.7% and power consumption by 44.6%.Though the proposed multiplier increases in the area by 6.5% and power consumption by 9.8% compared with the SK multiplier without a pipeline, the pipeline multiplier nearly doubles the frequency of the CSA multiplier.So it can increase the throughput rate of instruction execution, which helps a lot to improve the computing performance of the processor.

Figure 2 .
Figure 2. Compression circuit for each bit of partial product.Figure 3. Compression.

Figure 8 .
Figure 8.The circuit of pipeline multiplier.

Table 2 .
Performance comparison of three multipliers.

Table 3 .
VCS simulation result