An RSFQ flexible-precision multiplier utilizing bit-level processing

An RSFQ flexible-precision multiplier is proposed. The circuit can perform multiplication with specified bit-width within a predefined bit range. The calculation bit-width can be changed in every operation. When the bit-width of a calculation decreases, the latency in cycles is reduced. The proposed circuit calculates the multiplication result with bit-level processing to save the circuit area. The circuit carries out multiplication by counting pulses on a signal line. An RSFQ flexible-precision matrix multiplication circuit based on the proposed multiplier is also proposed. Its internal multipliers share many component circuits and it is implemented in a compact area.


Introduction
RSFQ circuits [1] are expected to realize energy-efficient high-performance computing systems for the post-Moore era. RSFQ circuits use pulse logic. Namely, voltage pulses are used for realizing logic circuits. In the logic design of RSFQ circuits, consideration for implementability of designs by the pulse logic is important.
Multiplication is an important arithmetic operation. There are various applications involving many multiplications and a part of them tolerates small error. Recently, neural networks are utilized to process various intelligent tasks. In the processing in the inference phase of neural networks, many multiplications are performed. Thus, implementing many multipliers in a chip is desired. Because the processing tolerates small error, low-precision or approximate arithmetic circuits [2] and flexible-precision circuits [3,4] that can change calculation precision online have been proposed for CMOS circuits.
In this paper, we propose a flexible-precision multiplier for RSFQ circuits. It treats operand values whose bit-width v is in a predefined range from n min to n. We can specify the precision for each multiplication. For a v-bit multiplication, it uses 2 v clock cycles. In other words, we can achieve higher multiplication performance when we use lower precision for processes tolerating larger error.
Parallel processing arithmetic circuits can achieve high performance with large layout area and are suitable for ALUs of microprocessors. Several RSFQ parallel processing circuits [5,6] have been proposed. On the other hand, hardware-efficient designs utilizing bit-serial or bitlevel processing [7] are suitable for massively parallel applications. Hardware efficiency is also an important factor for implementability of RSFQ circuits. The multiplier utilizes the bit-level processing proposed in our previous paper [7] for hardware efficiency. The bit-level processing Bits for compensation (ulp) Bits to be summed up Figure 1. Partial product bits summed up in the flexibleprecision multiplication.
converts two operands of a multiplication fed in parallel into two bit-streams on two lines. We only use the two lines for multiplication, and we can implement the multiplier in a compact area easily. Though it performs the multiplication with an AND gate like stochastic computing [8], the multiplication is not stochastic and is carried out deterministically as a truncated multiplication.
In the flexible-precision multiplier, we modify the generation circuits of bit-streams from the original ones to calculate a multiplication with reduced bit-width correctly. We insert some selectors for masking bits.
We also propose a flexible-precision matrix multiplication circuit based on the multiplier. Matrix multiplication is a computational kernel operation used commonly in a wide variety of signal processing applications and neural network applications. Because internal multipliers in the circuit share many component circuits, we can realize the circuit in a compact area.
We designed a layout of the proposed multiplier and a layout of the matrix multiplication circuit for AIST ADP2 process for evaluation purposes. They can perform 3-and 4-bit multiplication. There have been designs of RSFQ bit-serial multipliers such as [9,10]. The number of Josephson junctions (JJs) in the layout of the proposed multiplier is smaller than those bit-serial designs. The number of JJs in the proposed matrix multiplication circuit is smaller than the number of JJs in the previously proposed circuit in [11].

Flexible-Precision Multiplication
We consider multiplication of unsigned fixed-point numbers. In the circuits proposed in this paper, we can choose the bit-width v of calculation from a predefined range from n min to n. We represent the range of possible bit-width, i.e., n − n min , with f . We represent the multiplicand X and the multiplier Y as [0.x 1 · · · x v ] 2 and [0.y 1 · · · y v ] 2 , respectively. The multiplication result Z is v-bit fixed point number [0.z 1 · · · z v ] 2 . The unit in the last place (ulp) of the result is 2 −v .
The proposed multiplier performs truncated multiplication. Namely, it discards the lower part of partial product bits as shown in Fig. 1. It sums up the upper part of bits enclosed by the solid line in the figure, whose weights are larger than 2 −v−1 , and uses the bits whose weight is 2 −v−1 enclosed by the dashed lines to compensate the result. In this paper, we double the bits for compensations. In other words, we treat the weight of these bits as 2 −v . It corresponds to the rounding to the nearest value of each partial product. Thus, we can represent the calculation result as

RSFQ Multiplier Utilizing Bit-Level Processing
We previously proposed an RSFQ multiplier in [7]. Its inputs are the n-bit fixed-point multiplicand X(= [0. x 1 x 2 · · · x n ] 2 ) and the n-bit fixed-point multiplier Y and it outputs the n-bit fixed-point result Z. The bit-width of the inputs and the bit-width of the output are fixed  Figure 2. Design of the bit-level multiplier in [7].

Weighted-Bits
Generator Selector and Merger Figure 3. Design of the bit generator in [7].
and are not flexible. It performs the truncated multiplication as the same as the circuits shown in this paper. We show the structure of the multiplier in Fig. 2. There are two bit-generators and each of them consists of a weighted-bits generator and a selector-and-merger. They convert the operands to sequences of bits whose period is 2 n − 1 cycles. We show the design of the bit-generator used in [7] in Fig. 3. The weighted-bits generator is a leading-one detector. It receives an n-bit vector and detects the index of the first "1" in the vector. As the index is lower, it outputs more pulses during a period of 2 n − 1 cycles. The selector-and-merger filters pulses from the weighted-bits generator according to the internal state of the non-destructive readouts (NDROs). We set the operand value as the internal state of NDROs through q 1 , . . . , q n to convert the operand value into a sequence.
The circuit calculates logic AND of the two sequences. We obtain the result by counting the number of pulses generated with the AND gate. In the sequences, the weight of each bit is the same and is 2 −n . If the AND gate generates a pulse for each cycle of a period of 2 n − 1 cycles, the result is [0.11 · · · 1] 2 (= (2 n − 1) · 2 −n ). If the AND gate generates no pulses in a period, the result is [0.00 · · · 0] 2 .

Flexible-Precision Multiplier Utilizing Bit-Level Processing
We propose a flexible-precision multiplier. We can choose the bit-width v for every operation. We show its structure first. Then, we explain the steps to calculate the multiplication with it and discuss why it can perform the flexible-precision multiplication. Finally, we also propose a flexible-precision matrix multiplication circuit performing multiple multiplications in parallel.

Structure
We show the structure of the flexible-precision multiplier in Fig. 4. XI and Y I are n-bit inputs (xi 1 , · · · , xi n ) and (yi 1 , · · · , yi n ), respectively. We connect each bit of XI and Y I to the bit with the same subscript of Q input of the corresponding selector-and-merger. reset XY is an input terminal for resetting the operand values held in the selector-and-mergers. Z(= [z n . . . z 1 ] 2 ) is the output of the n-bit pulse counter. reset counter is an input terminal for resetting the pulse counter. It outputs the counted value when it receives a reset pulse. We use a binary counter for the multiplier in place of a linear feedback shift register (LFSR) in the original design in Fig. 2.
We realize the flexible-precision capability by masking the most significant (n − v)-bits of an n-bit vector fed from the binary counter. The latency in cycles to perform a v-bit multiplication is 2 v . When we mask the most significant (n − v)-bits, it generates all v-bit vectors in a period  Figure 4.
Design of the flexible-precision multiplier.
Algorithm 1 Flexible-precision multiplication with the proposed multiplier. [0.zvzv−1 · · · z1]2 is the result of 2 v cycles while a LFSR does not guarantee this property. We mask the bits by padding 0s in XI and mask the bits fed for the weighted-bits generator for Y by inserting NDROs between the binary counter and the weighted-bits generator. We insert f NDROs at inputs r 1 , . . . , r f of the weighted-bits generator. We name the NDRO connected to r i as NDRO i and name its reset input and its set input as tr i and ts i , respectively. To configure the multiplier to v-bit precision, we feed pulses through ts 1 to ts n−v to set the internal state of NDRO 1 ,. . . , NDRO n−v .

Operation
We show the process to perform multiplication with the circuit in Algorithm 1. First, we feed a pulse for reset XY to reset the operand values and feed a pulse to each of tr inputs to reset the precision of calculation. Then we feed a pulse for each of ts 1 , . . . , ts n−v inputs to set the bit-width. Note that it is not necessary to feed pulses for tr and ts inputs when we do not change the bit-width from the last multiplication. We set the v-bit operands through n-bit XI and Y I inputs. We feed the bits of X to bit inputs with lower subscripts of Q of the selector-and-merger through XI and the bits of Y to bit inputs with higher subscripts of Q through Y I. We pad 0s for unused bits of XI and Y I. We finally feed a pulse for the reset input of the pulse counter and obtain the result. We discuss the operation of the multiplier carefully and explain why the circuit performs the multiplication. We can write the outputs of the weighted-bits generator, which is the same function as a leading-one detector, as w h = r h ∧ r h−1 ∧ · · · ∧ r 1 = r h ∧ k<h r k . Thus, we can represent the output values of the selector-and-mergers as follows: Here, we let the output for Y be b Y and let the output for X be b X . We omit some bits in the above formula because we mask the upper bits of the binary counter with NDROs and we pad 0s for unused bits of XI. We can represent the output of the AND gate as follows with The output of the AND gate b X ∧ b Y is logic OR of P t,u for each pair of x t and y u (2 ≤ t + u ≤ v + 1). A value of P t,u is determined according to the output vector of the binary counter (s n , . . . , s 1 ) and the values of x t and y u . P t,u takes logic-1 when both x t and y u are logic-1 and the output vector of the binary counter is as follows: where " * " denotes "don't care". When the binary counter feeds 2 v vectors, (s v , . . . , s 1 ) takes all v-bit vectors from (0, . . . , 0) to (1, . . . , 1) during the period regardless of the starting state of the binary counter. When both x t and y u (2 ≤ t + u ≤ v) are logic-1 and the binary counter feeds 2 v vectors, there are 2 v−(t+u) vectors that make P t,u logic-1 because there are v − (t + u) don't cares in formula (1). In 2 v vectors, there is one vector that makes P t,u logic-1 (t + u = v + 1) when both x t and y u are logic-1. Therefore, when the binary counter feeds 2 v vectors, the number of pulses the AND gate outputs is represented as t+u≤v x t y u 2 v−(t+u) + t+u=v+1 x t y u . We consider the weight of each pulse as 1 ulp, i.e., 2 −v . The obtained result is t+u≤v x t y u 2 −(t+u) + t+u=v+1 x t y u 2 −v .

Flexible-Precision Matrix Multiplication Circuit
We propose a flexible-precision matrix multiplication circuit with the proposed multiplier. We consider matrix multiplication C = AB where each of A, B, and C is an m × m matrix as follows: Each element of input matrices is a v-bit fixed-point number (n min ≤ v ≤ n). We represent We show the structure of the circuit in Fig. 5. In the figure, we omit several control signals such as clock and reset signals. BI and AI k (0 ≤ k < m) are n-bit inputs (bi 1 , · · · , bi n ) and (ai k 1 , · · · , ai k n ), respectively. We extend the bit-width of pulse counters to (n + l)-bit where l = log 2 (m) . CO k (0 ≤ k < m) are (n + l)-bit outputs [co k n+l · · · co k 1 ] 2 of the pulse counters. The circuit calculates a column of the resultant matrix by carrying out the following calculation.
For each B k,i (0 ≤ k < m), the circuit performs m multiplications of each term simultaneously. We realize the accumulation of multiplication results with the pulse counters by counting pulses  Figure 5. Structure of the flexible-precision matrix multiplication circuit.
Algorithm 2 Flexible-precision matrix multiplication with the proposed circuit.

9:
Feed a pulse for resetcount and read a column of the result 10: end for without resetting them during a series of multiplications. The m multipliers share the binary counter and the bit generator for one operand. They also share the weighted-bits generator for the other operand. Thus, the circuit is smaller than a circuit implementing multipliers naively. We previously proposed a matrix multiplication circuit [12] based on the original bitlevel multiplier in [7]. The proposed circuit can change the precision in matrix multiplication. Algorithm 2 shows the steps for calculating matrix multiplication with the circuit. The circuit carries out each term in the right-hand side of formula (2) in 2 v clock cycles. In other words, for values fed to its inputs, it performs AI 0 · BI, · · · , AI m−1 · BI in parallel. We obtain the left-hand side of formula (2), i.e., a column of the resultant matrix, in m × 2 v clock cycles. For calculation of each term in formula (2), we feed elements of A through AI 0 , . . . , AI m−1 inputs and an element of B through BI input every 2 v clock cycles. We realize accumulation of terms by counting pulses without resetting the pulse counters. Thus, we reset the pulse counters at the end of the calculation of each column after m × 2 v clock cycles from the beginning of the calculation and observe the result of the column at pulse counters. The total number of clock cycles for m × m matrix multiplication is m 2 × 2 v .

Layout Designs and Evaluation Results
We have designed a layout of the flexible-precision multiplier and a layout of the matrix multiplication circuit for evaluating purposes. Both the designed layouts perform 3-bit and 4-bit multiplication. The designed matrix multiplication circuit was for 4 × 4 matrices, i.e., m = 4. We used the cell library for the AIST ADP2 process [13]. We show the designed layouts in Figs. 6 and 7. We tuned delay time in them for 40GHz operation.
For the layouts, we used a design in Fig. 8 for the binary counter. RTFFB is a resettable toggle flip-flop (TFF) with both inverted and non-inverted outputs. Each RTFFB outputs a pulse every cycle at one of two output terminals alternately. We use those two outputs dout0 and dout1 as the sum signal and the carry signal, respectively. The each of NDROs is set by a pulse on the sum signal, i.e., dout0 and is reset by the carry signal, i.e., dout1.
The number of Josephson junctions (JJs) in the multiplier is 1,049 and its area is 0.43 mm 2 (0.51 × 0.84 mm). The number of JJs in the matrix multiplication circuit is 2,308 and its area is  1.44 mm 2 (1.71 × 0.84 mm). We have verified the valid operation of logic level designs of those two circuits with the logic-level simulation tool [14] before designing the layouts, and verified their valid operation with Verilog netlists extracted from the layouts. There have been small bitserial multipliers such as [9,10]. The number of JJs in the 4-bit design [9] utilizing special cells was 1,097 and the number of JJs in the four PEs in [10] was 2,556. Though the figures depend on design style of layouts, the number of JJs in the proposed multiplier is smaller. When we implement plural proposed multipliers, they can share many component circuits. The number of JJs in the matrix multiplication circuit is smaller than the estimated number of JJs in the previously proposed circuit in [11]. The proposed multiplier performs truncated multiplication and its result contains small error. We evaluated the error of the proposed circuits. We show the result in Table 1 the error for rounding the true value to the nearest v-bit fixed-point one and the error for rounding towards zero, i.e., cutting the lower bits. When we round to the nearest value or round towards zero, the maximum error is 0.5 ulp or (1 − 2 −v ) ulp, respectively. The maximum error of the proposed multiplier increases from 0.9 ulp to 1.6 ulp depending on the bit-width because the number of partial products, i.e. the rows in Figure 1, increases. An example of the pair of X and Y for the maximum error in Table 1 is X = Y = [0.1010 · · · 101] 2 for odd v and X = [0.11010 · · · 101] 2 and Y = [0.1010 · · · 1011] 2 for even v. The maximum error is not very large up to around 5 bits compared with the maximum error of rounding towards zero. The average error of the multiplier is smaller than the average error of rounding towards zero for all bit-widths.

Conclusion
We proposed a flexible-precision multiplier for RSFQ circuits. It utilizes bit-level processing and is extended to perform flexible-precision multiplications from the previously proposed fixedprecision circuit. As the bit-width of the operation decreases, the latency in cycle decreases. We can achieve higher multiplication performance for processes tolerating larger error. We also show a flexible-precision matrix multiplication circuit utilizing the proposed multiplier. Because its internal multipliers share many component circuits, we can realize the circuit in a compact area. The number of JJs are small compared with previously proposed designs for RSFQ circuits. The proposed circuits are suitable for applications which tolerate small error and request many multiplications.