Investigation of timing margin in single-flux-quantum 4 bit adders for increasing clock frequency of gate-level-pipelined circuits

This study investigates the timing margin required to handle fluctuations and variations in superconductor single-flux-quantum gate-level-pipelined adders; a smaller timing margin would improve the clock frequencies of gate-level-pipelined circuits. To evaluate timing margins, we demonstrated three 4 bit adders with 50-, 75-, and 100 GHz target clock frequencies using a 1.0 μm process. We estimated that the required timing margin of the adders was 2.1 ps. This indicates that previously reported gate-level-pipelined circuits operating at 30–60 GHz could operate at higher clock frequencies by reducing the timing margins.


S
ince the early 2000s, the clock frequency of CMOS ICs has remained in the range of a few GHz due to the power-wall problem. 1) Superconductor single-fluxquantum (SFQ) logic 2) and the associated energy-efficient logic families [3][4][5][6][7] show good promise to break this status quo. SQ logic can enhance clock frequency because voltagepulse logic with fast (10 −12 s) and low-energy (10 −19 J) switching of Josephson junctions (JJs) mitigates the powerwall problems.For example, a recent study showed that an SFQ-based neural-network accelerator based on a currently available 1.0 μm process 8) operated at 52.6 GHz and 1.9 W, 9) whereas a CMOS-based neural-network accelerator operated at 0.7 GHz and 40 W. 10) Gate-level pipelining 11,12) can maximize throughput of SFQ circuits.Unlike CMOS logic circuits, additional pipeline registers are unnecessary because SFQ logic gates are clocked gates with latch functions.][14][15][16][17] The throughput of gate-level pipelining is directly related to the clock frequency. Itis possible to increase the clock frequencies of SFQ circuits by reducing the clock-data and data-clock intervals by controlling the clock and the data arrival times at each logic gate.In principle, we can minimize the clock cycles to the maximum sum of the setup and hold times of the logic gates.However, we must add an additional timing margin to handle fluctuations and variations, such as timing jitter [18][19][20][21] and circuit parameter spread in fabrication process; 8,22) this is essential for physical implementation.Because it is very difficult to quantitate the delays with timing fluctuations and variations in all paths, the timing margins of conventional gate-level-pipelined circuits were set to be rather wide (e.g. 10 ps), which limits the clock frequency to 30-60 GHz.
In this study, we investigated the timing margins using 4 bit adders to further increase the clock frequencies of gatelevel-pipelined circuits.We designed and fabricated three adders with 50−, 75−, and 100 GHz target clock frequencies and evaluated the bias margins.The 50 GHz is the typical target clock frequency of circuits fabricated using the 1.0 μm process.The target frequency of 100 GHz was chosen to be a clock frequency near the upper limit as determined experimentally. 23)irst, we explain the wiring technologies used in SFQ circuits from the perspective of timing design.There are two wiring technologies: the Josephson transmission line (JTL) and the passive transmission line 24,25) (PTL).Their timing designs must be considered separately because their timing characteristics differ.A JTL is a chain of superconductor rings with JJs; the delay depends on the supplied bias voltage.The JTL-based wirings can also contain various wiring segments, such as an SFQ pulse splitter (fun-out) and a confluence buffer (CB) (merger).Microstripline or stripline structures are used for the PTLs, the delay of which does not depend on the supplied bias voltage.A PTL is mainly used for long-distance wiring because of the rapid data propagation at the speed of light.The PTL delay is determined by the length.In layout design, it is possible to control the PTL delay to an accuracy of <0.2 ps in our design environment.A JTL timing design is much more complex.Each JTL-based wire segment exhibits different delay characteristics, even when the numbers of JJs in all paths are the same, rendering precise timing design difficult.Moreover, each logic gate is similar to JTLs but has unique delay characteristics that also depend on the bias voltage.
Next, we explain the strategy of adder design.In this paper, we define "controllable delay" as the adjustable delay in simulation, i.e. deterministic, free from timing fluctuations and chip fabrication variations.We tried to reduce the effects of controllable delay by tuning the timing as precisely as possible in layout design.In this study, we obtained timing parameters (i.e.delay, setup, and hold times) of JTL-based wirings and logic gates using the analog simulator, JSIM. 26)TL delay was obtained from the experimental result. 27)ig. 1(a) shows the timing design method in the previous study. 15)We employ concurrent-flow clocking, during which the clock and the data flow in the same direction, because the adder has no feedback loops.The main clock line runs from the first to the last pipeline stage in the middle of the circuit layout.Each pipeline stage has a clock distribution sub-tree that branches from the main clock line and supplies clock signals to all logic gates in the same pipeline stage.We control all delays in clock/data PTLs in the same pipeline stage by equalizing their lengths.We also control the delays of clock/data JTLs in the same pipeline stage by equalizing their delays.In clock JTL lines, a slight (less than 1 ps) delay difference is only caused between each clock signal because all clock lines in the same pipeline stage consist of the same depths of stacked splitters by using dummy splitters 28) in which one branch is terminated by a resistor.In the data JTL lines, the delay in each path, including the delay of the logic gate in one former pipeline stage, is controlled by inserting additional JJs that consider the delay characteristics of each wire segment.However, the data JTL lines in the same pipeline stage exhibit a delay difference of the order of ps because it is difficult to align the delay of each path to those of the various JTL-based wire segments and the different logic gates.As logic gates exhibit unique setup and hold times, the timing window that allows data input is smaller than the clock cycle.We define the timing window as the difference between the clock cycle time and the sum of the setup and hold times.The logic gate with the largest sum of setup and hold times in the designed adders is the XOR gate.The setup and hold times of the XOR gate of our cell library 29) are 3.7 ps and 4.1 ps, respectively, at the design bias voltage (2.5 mV).The timing window of the adder at 100 GHz is only 2.2 ps.As the delay difference associated with the timing design shown in Fig. 1(a) is large compared to the 2.2-ps timing window, we must reduce the data delay differences caused by JTL-based wire segments and logic gates.Figure 1(b) shows the timing design used in this study.We modified the schematic so that the data in each pipeline stage consisted of logic gates and wire segments with the same delay characteristics; this minimized the data delay difference.First, we replaced the CBs with OR gates, as shown in the dotted line in Fig. 1(b), to eliminate the delay difference between JTL data lines with and without a CB.As shown in the Nth-(N+1)th pipeline stage in Fig. 1(b), this allowed us to configure all the JTL data lines using the same wire segment (i.e.there is no branch JTL).Second, we replaced several delay flip-flops (DFFs) with OR/XOR gates (e.g. the OR gate filled with gray) so that the pipeline stage is composed of the identical logic gates or those with a similar delay characteristics.One input into the OR and XOR gates is a dummy input that terminates in the ground.These overheads caused an 18% increase in the JJs used in logic gates.
Figure 2 shows a schematic of an adder based on the Kogge-Stone adder. 30)The operands are the 4 bit unsigned integers A = (a 3 , a 2 , a 1 , a 0 ) and B = (b 3 , b 2 , b 1 , b 0 ).The output is a 5 bit unsigned integer S = (s 4 , s 3 , s 2 , s 1 , s 0 ).Six pipeline stages are used.The CBs in the prefix boxes surrounded by the dotted lines were replaced with OR gates.The DFFs in the 3rd, 5th, and 6th pipeline stages were replaced with OR and XOR gates (filled with gray).
We control the clock and data lines so that all data lines in the same pipeline stage are input in the center of the timing window at the target clock frequency.For example, when the target clock frequency is 100 GHz, we set the clock-data and data-clock intervals in the sixth pipeline stage to 5.2 ps and 4.8 ps, which are half of the 2.2-ps timing window and the sums of XOR gate hold and setup times, respectively.We insert a JTL-based delay (a clock skew) into the main clock line of each pipeline stage to ensure that the clock and data 054501-2 © 2024 The Author(s).Published on behalf of The Japan Society of Applied Physics by IOP Publishing Ltd exhibit the same delay dependencies at the supplied bias voltage.We create appropriate clock-data and data-clock intervals by controlling the PTL lengths of the clock lines that branch from the main clock line.This means that the only difference between the 50−, 75−, and 100 GHz adders is the PTL lengths of the clock lines that branch from the main clock line.In other words, the JTL-based delays that depend on the bias voltage are the same in the three adders.We can compare the bias margins without the delay differences caused by the bias voltages to the three adders.The simulated bias margins normalized to the design value for the 50−, 75−, and 100 GHz adders were 80%-125%, 80%-125%, and 84%-125%, respectively, at the target clock frequencies.The wide bias margins show that our timing design can successfully suppress the controllable delay difference.Except for the lower bias margin of the 100 GHz adder, the bias margins were determined not by the timing violations but rather by the operating margins of the cells used in the adder.The timing violation caused by the XOR gates limited the lower bias voltage of the 100 GHz adder.Figure 3 shows a micrograph of the test chip containing 50−, 75−, and 100 GHz adders.Each test circuit contained shift registers (SRs) and an on-chip clock generator (CG) for on-chip high-speed testing. 31)The CG of each adder was designed to generate clock pulses close to the target clock frequency at the design bias voltage.The number of JJs in each adder without SRs or a CG was 1,273.We verified correct operations of all three adders on the same chip.Figure 4 shows oscilloscope screen images captured during on-chip high-speed testing.Figure 5 shows the measured bias margins of the three adders on the same chip.The clock frequency was obtained via JSIM simulation using the measured supplied bias current of the CG and the device parameters.The maximum clock frequencies of the 50−, 75−, and 100 GHz adders were 72, 95, and 101 GHz, respectively.We discuss the timing margin by reference to the maximum clock frequencies.The maximum clock frequencies were approximately 20 GHz higher than their target clock frequencies.As explained before, data are input in the center of the timing window at the target clock frequency; the most restricted timing windows are those of the 1st and 6th pipeline stages, which contain XOR gates.In such situations, the clock-data and data-clock timing windows are relatively wide (50 GHz adder: 6.10 ps, 75 GHz adder: 2.77 ps) at the target clock frequencies and designed bias voltage.When the clock frequency increases, the data-clock timing window decreases, but the clock-data timing window remains constant.When the clock frequency is 72 and 95 GHz, the sum of the simulated hold time of the XOR gate, the abovementioned clock-data timing window, and the simulated setup time of the XOR gate is 13.90 or 10.57ps, respectively.These are almost the same as the clock cycle time at the measured maximum clock frequencies.This shows that the data-clock timing windows are rather small at the measured maximum clock frequencies of the 50− and 75 GHz adders.On the other hand, the maximum clock frequency of the 100 GHz adder was almost the same as the target clock frequency.The clock-data and data-clock timing windows at 101 GHz were 1.05 ps.
Given the different timing windows at the maximum clock frequencies of the adders, the required timing margin was 2.1 ps (1.05 ps for both clock-data and the data-clock), whereas the numerical calculation showed that the JJ switching time and timing jitter in the critical path were 1.8 and 0.1 ps, respectively.In fact, the bias voltages of the adders at the maximum clock frequency deviated from the design value by ±5%.As the difference in the sum of the setup and hold times of the XOR gate is 0.5 ps at 95% and 100% bias voltage and 0.3 ps at 100% and 105% bias voltage, the data-clock and clock-data timing margins may differ by approximately 0.5 ps.Thus, it is important to ensure a timing margin of approximately 3 ps (1.5 ps for both clockdata and the data-clock, respectively) for adders that are conservatively designed.
The timing margin aside, we discuss why the lower bias margin decreases in adders with higher target clock frequencies.The lower bias margins were limited by the timing errors.We consider that the decreases in the lower bias margins were caused by hold time violations in XOR gates as the clock-data timing windows.We observed that the lower bias margin remained constant, and was thus independent of the clock frequency, for each adder at low clock frequencies (e.g.26-56 GHz for the 50 GHz adder).The clock-data timing window remained unchanged, but the data-clock timing window increased when the clock frequency decreased at the same bias voltage.As the timing errors were independent of the width of the data-clock timing window in this region, hold time violations occurred.The most likely causes of the violations are the XOR gates, because an XOR gate has the largest sum of setup and hold times.Although we carefully tuned the timings, it was impossible to completely equalize the clock and data delay dependencies on the supplied bias voltage because of gate delay.The input timing difference between the clock and data corresponds to the delay difference between the path from the splitter that is contained in the main clock line of a former pipeline stage to the clock and the data input of the logic gate in the subsequent pipeline stage.It is possible to equalize the JTL-based wire segments of the clock and data paths by inserting dummy wire segments; however, the difference 054501-3 © 2024 The Author(s).Published on behalf of The Japan Society of Applied Physics by IOP Publishing Ltd between the presence and absence of a logic gate in a pipeline stage is reflected in the clock and data paths.This difference narrows the clock-data timing window as the bias voltage is lowered, in addition to the increased setup and hold time of logic gates.For example, in the 50 GHz adder, the clock-data interval of the XOR gate in the 6th pipeline stage was reduced by approximately 2 ps when the bias voltage decreased to 2.0 mV.
In summary, we investigated the timing margins required to handle uncontrollable fluctuations and variations using 4 bit SFQ gate-level-pipelined adders.We designed adders with 50−, 75−, and 100 GHz target clock frequencies and the measured maximum operating frequencies were 72, 95, and 101 GHz, respectively.The estimated timing margin of the adders was 2.1 ps.This indicates that earlier gate-levelpipelined circuits operating at 30-60 GHz had excessive timing margins and that higher high-clock-frequency operation can be achieved by reducing the timing margins.Investigation of the required timing margins for other larger-scale, more complex circuits is left for future work.

Fig. 1 .Fig. 2 .
Fig. 1.The timing designs of (a) a previous study 15) and (b) the present study.The thick gray lines indicate PTLs; the other wirings are JTLs.The black dots and open circles indicate splitters and confluence buffers, respectively.The boxes with plus marks denote additional JJs inserted for timing control.The boxes marked DFF are delay flip-flops.

Fig. 4 .
Fig. 4.An oscilloscope screen image taken during on-chip high-speed testing.