PicoTDC: a flexible 64 channel TDC with picosecond resolution

The PicoTDC is a 64 channel TDC (Time to Digital Converter) ASIC, with 3ps or 12ps time binning, developed at CERN for use in a large variety of high channel count scientific instrumentation. Acquired time measurements are processed on-chip with a programmable buffering and triggered data flow architecture before being read out on 1 to 4 byte-wise readout ports with a total readout bandwidth of up to 10 Gbps. Leading edge, falling edge or leading edge together with pulse width can be captured at burst rates as high as 1.2 GHz and sustained rate of 320 MHz per channel. Effective single-shot time resolution has been measured to be 3.7 ps RMS across all 64 channels over its full dynamic range and as good as 1.35 ps RMS when using an on-chip adjustment feature optimized for a specific channel (0.43 ps if averaged over multiple measurements). The PicoTDC is implemented in a 65 nm CMOS process and 20 k chips have been produced and packaged in a 400 pin plastic BGA package.

a known reference pulse shape (given by detector type). This is though often limited in number of channels per chip, power consumption, triggering latency, rate and programmable flexibility, and finally by the need of analog samples to be digitized with an (off-chip) ADC [11,12]. In recent years it has also been proposed by multiple groups to implement 10-100 ps time resolution TDCs based on FPGAs [13,14], with a limited number of channels per chip. These often require non-trivial off-line corrections to be applied per channel and can have significant temperature and supply voltage dependencies.

Architecture
A high stability time reference is required to assure high precision and highly stable time measurements at the ps accuracy level and the TDC must assure continuous self-calibration to this time reference, to compensate for possible temperature and power supply variations. This is accomplished using an external (system) clock as absolute time reference, which can conveniently be distributed from a centralized source, generated by a stabilized crystal oscillator, possibly phase locked to an atomic clock. As indicated in figure 1, the PicoTDC uses a 40 MHz reference input clock (reference for all LHC experiments) that is multiplied by a very low jitter L-C based PLL to a 1.28 GHz clock, used to drive a high-resolution Delay Locked Loop (DLL) based time interpolator circuit.
The DLL generates evenly spaced 12.2 ps time taps over the 1.28 GHz clock cycle, that are further interpolated with an analog fine time interpolator to 3.05 ps time taps, as shown further below. The generated 256/64 sampling clocks/taps, with 3/12 ps phase skew, are used to sample the 64 differential hit inputs. The DLL and fine time interpolator are implemented in full custom, driving two banks (an upper and a lower) of 32 channel time digitizers. Signal transitions (leading and/or trailing) in the sampled channels are identified and encoded into a 8 bit fine time binary scale and appended to a 5 bit medium scale counter driven by the 1.28 GHz clock. Time measurements -2 -are buffered in a small 4 deep channel derandomizer, to handle short fast bursts before being passed to the digital part of the chip, processing hit measurements at 320 MHz. At the entry to the 320 MHz processing, each time measurement is appended with a 13 bit coarse time from a 40 MHz counter. This results in a large dynamic range (204 us) high-resolution time measurement of 26 bits: 2 bit ultra fine time (3 ps), 6 bit fine time (12 ps) + 5 bit medium time (0.781 ns) + 13 bit coarse time (25 ns). All being tightly phase locked to the 40 MHz input reference, without any time discontinuities on a convenient linear scale. A de-glitcher can optionally be applied to hit inputs to remove unwanted glitches on the terminated high slew rate differential hit inputs (differential high slew rate hit signal required to make reliable ps time measurements). Acquired full range time measurements are buffered in each channel in 512 deep latency/derandomizer buffers, awaiting trigger extraction processing and/or readout. A trigger extraction/matching function, with configurable latency and time window (both with 25 ns resolution), and also individual time offset per channel (at full resolution), can extract time measurements of interest. Trigger processing is performed such that consecutive triggers can have overlapping time windows, enabling individual time measurements to be extracted multiple times for different triggers (required in certain high rate HEP applications). A dedicated trigger FIFO, containing time tags of triggers, assures that high rate trigger bursts can be handled independently of required processing time to extract hits for each trigger. Extracted time measurements are passed to 512 deep readout FIFOs, each shared by 16 channels (4 channel groups of 16 channels), preceded by trigger/event info. Triggering can be disabled, whereby time measurements are passed into the readout buffers. In case a readout FIFO gets full, the chip can be configured to delete hits or back propagate buffering to the channel buffers.
Readout data is organized in 32 bit words containing time measurements of leading or trailing edges or alternatively as a leading edge plus pulse width/TOT. In case of leading edge with TOT pulse width, the resolution and dynamic range of the two fields can be configured according to the needs of the specific application. Leading edge time can optionally be relative to the time of the trigger (at 25 ns level), which reduces required dynamic range of leading edge in many HEP applications. In triggering mode, each event is preceded by an event header (even if no hits found), containing a trigger ID and trigger time tag. 32 bit words are read out on each readout port byte-wise on differential signals at a rate of 320/160/80/40 MHz, giving an effective rate of up to 80 M hit measurements per second per port (320 M hits/s for whole chip). For low rate applications, all readout data can be passed to a single readout port.
The PicoTDC includes a configurable digital test pulse generator with programmable phase and pulse width, with 3 ps resolution. This can be used to inject calibration pulses into analog hit pre-processing for time calibration of the signal chain.
I2C is used as control and monitoring interface. Configuration registers can be written (and read) and monitoring information (monitoring counters, error status, etc.) can be read during operation.

Fine time interpolator
The ps level time resolution of the PicoTDC is obtained with a two stage time interpolation, based on the low jitter 1.28 GHz clock from the PLL [10]. The PLL is implemented with a L-C based oscillator with a functional schematic shown in figure 2. The PLL has been measured to have a -3 -RMS jitter of 340 fs. An L-C based PLL has a limited locking range, as the variable varactor used in the PLL control loop has limited regulation range. This makes it difficult to guarantee locking to the external reference over all process, temperature and voltage corners. The tank capacitor is therefore implemented with a small switchable capacitor bank connected to the varactor diodes. At initialization, an automatic Frequency Calibration (AFC) state-machine determines optimal switchable capacitor value, to cover temperature and voltage variations.  A 64 stage DLL is driven from the 1.28 GHz clock and precisely locked to cover a complete clock cycle, achieving 12 ps time binning as shown in figure 3. Each delay stage is made from a differential delay cell, as indicated in figure 4, with a delay controlled by the differential pair tail current (VBN of T1). The differential loads are regulated via a replica delay cell circuit (Vctrl of T5, T6) to assure correct function and appropriate signal amplitude of the delay cell over required delay range. Additional load transistors (T4, T7) with dynamic R-(parasitic C) feedback speeds -4 -

JINST 18 P07012
up the cell to have a guaranteed 12 ps delay over worst-case process, voltage and temperature conditions. Transistor sizes in the delay cell have been optimize for high speed operation and acceptable mis-match, as this is critical to obtain adequate effective time resolution at ps level. DLL locking is implemented with a classical bang-bang phase detector, coupled to an analog chargepump controlling the delay cells. Layout and signal distribution have been carefully optimized for precise time tap generation with minimized mis-match, within an acceptable power budget. Configurable time skewing features have been included in the phase detector to compensate for possible static phase error (this has not been used in practice as carefully optimized phase detector has been seen to have very small static phase error). A differential delay chain has been used as it has superior speed and power supply rejection, at the cost of higher power consumption. This is however negligible compare to the total chip power consumption. Simulations of jitter and mis-match induced DNL and INL along the delay chain is shown in figure 5. Differential signals from the DLL are converted (T8, T9. T10, T11) into single ended time taps to a second stage analog R-(C) time interpolation circuit. It is important to notice that the rise -5 -time of signals are longer than the 12 ps interpolation period and that the slew-rate of these signals are kept nearly constant by the DLL control loop.
A relatively simple analog R-(parasitic C) weighting circuit [5], as shown in figure 6, is driven by neighboring DLL time taps, generating 4 second-level interpolation time taps with 3 ps binning for each 12 ps time tap from the DLL. By appropriately optimizing the resistors to the parasitic capacitance loading, such a simple resistive time interpolation works very well, when the effective slew rate of the time taps are stabilized by the DLL control loop. To get appropriate interpolation at the beginning and the end of the delay chain, optimized "dummy" delay and interpolation circuits have been added.  In each channel, the 256 time taps sample the distributed and buffered hit signal with 3/12 ps binning. Sampling flip-flops, as shown in figure 7, are optimized for precise and fast resolving sampling (metastability), while at the same time have minimal capacitance loading on its clock input with acceptable timing mismatch (this effectively determines power consumption and obtained time resolution). Leading and trailing edges are identified in the sampled data and decoded into -6 -binary form with a fast and delicate pipeline, using the precisely aligned multi-phase clocks. The hit decoder is constrained to one hit transition per clock cycle, with glitch filtering, finding first hit transition, resulting in an effective channel dead time of 0.78 ns and TOT minimum pulse width. Decoded time information is loaded into a 4 deep 1.28 GHz derandomizer FIFO before being passed to the 320 MHz data processing.
The delicate fine time interpolator part of the chip has on purpose been made to have constant power consumption across the clock cycle (and to the extent possible also independent on hits) to assure highly stable and very low jitter in all parts of the time interpolator. This has come at the cost of power consumption, compared to alternative low power TDC architectures using Vernier type gated coupled oscillators, having calibration and jitter issues from largely varying activity levels. The effective RMS time resolution of the TDC has been estimated based on the following time uncertainties from detailed circuit simulations:

Implementation
The low jitter PLL was prototyped in a small test chip to verify correct function and characterize its detailed jitter performance as shown in figure 8. Based on this prototype, the final PLL was implemented with minor improvements. The DLL and analog time interpolator had previously been submitted in a 130 nm CMOS test chip [4]. The implementation in 65nm has enabled a factor 2 improvement in time resolution, over full PVT (Process, Voltage and Temperature) corners on 64 channels, at half the power and include extensive digital buffering with flexible triggering. Delay and sampling cells have been redesigned for low power and mis-match effects.

JINST 18 P07012
The DLL, the resistive fine time interpolator, together with the large number of sampling flip-flips have been implemented in full custom design flow, with significant efforts spent to optimize circuit details for best possible timing, minimal crosstalk, combined with acceptable power consumption. Mis-match analysis was extensively made to assure that no specific parts of the design would give unexpected dominating mis-match effects. It was seen that dominating mis-match comes from the sampling flip-flops (and confirmed in first prototype chip). Minor improvements were made to the sampling flip-flops, with an acceptable power consumption increase. Timing performance estimates made during the design matches relatively well-observed measurements. It was realized that initial mis-match simulations of the sampling flip-flops underestimated their effective mis-match by a factor 2 because of a modeling issue.
The chip and IO are powered by 1.2 V. Hits/clock signals are 1.2 V max amplitude differential with a dedicated differential receiver with on-chip 100 ohm termination, optimized for low jitter reception. Readout ports have 100ohm differential drivers to prevent possible disturbances (e.g. ground bounce) to propagate to the chip and hit/clock receivers. The critical low jitter PLL has its separate power supply domain that is isolated from the possibly noisy chip substrate using a triple well isolation feature available in the used technology. The time digitizer array with its DLL, resistive time interpolator and channel sampling flip-flops is also a separate power supply domain with on-chip decoupling and triple well isolation. The reminder of the chip, with 320 MHz digital processing, does not have triple well isolation.
The final 65 nm CMOS chip layout can be seen in figure 9, with different parts of the chip clearly visible. The characteristic "keyhole" shape is the L-C structure of the PLL. An initial full chip MPW prototype was submitted in 2018, followed by final production masks in 2019, shared with another CERN chip project. Initial prototype samples were carefully characterized by direct wire-bonding of naked chip die to a test board as shown in figure 9. After confirmation of correct function and good timing performance, a customized 400 pin BGA package -8 -was developed. Unfortunately, because of the ASIC crisis also heavily affecting the IC packaging industry, a delay of nearly 2 years have accumulated before getting packaged chips. Testing of packaged chips have confirmed its correct function and good timing performance.

Characterization
The PicoTDC has been extensively timing characterized at both bare die level [3] and in its packaged form with an instrument setup as shown in figure 10: a source of random hits, a 0.5 ps resolution motorized trombone, a Keysight 81134a pulse generator and a high precision and low jitter programmable PLL chip from Silicon Labs [15]. No significant differences have been found between naked chip die and in the customized BGA package. Detailed Code density tests with 3 ps binning is shown in figure 11. Code density tests (with random hits) is very efficient to characterize differential and Integral non-linearity, but does by its nature filter out jitter and noise effects. The relatively large DNL, with 3 ps bining, is dominated by mis-match effects in the sampling flip-flops. When used as single/dual channel TDC the time tap adjustment feature works very well as can be seen on the right plot of figure 11. When using all channels, only a limited improvement is obtained using the time tap adjustment feature, as mismatch in the sampling registers is dominating, as indicated on the color mapping of the code density test across 32 channels in figure 12. It was considered to improve significantly the sampling flip-flop mis-match, but this would have had a major power consumption impact and was not implemented, as obtained time resolution is significantly better than what is normally required in HEP applications.
Effective single shot time resolution of the time interpolator (covering 0.78 ns) has been measured with a high precision and low jitter time sweep as shown in figure 13 and figure 14. This includes in addition to DNL/INL effects also jitter and noise effects and is the best measure of effective TDC time resolution in real applications.
An extended time sweep covering a 25 ns period is shown in figure 14. A small but visible jump is seen at the transition from one 25 ns clock cycle to the next at a delay of~8000 ps. This is most likely from the test board/setup, as on-chip cross talk is expected to be a 320 MHz pattern as -9 -  digital logic is running at this frequency. The 320 MHz effect is barely visible as a repeated (32) pattern of 0.78 ns over the 25 ns delay range. Longer time sweeps have as expected not shown any additional visible timing effects, as dynamic range is expanded with a 40 MHz counter. It must be mentioned that detailed TDC time characterization at the ps level is quite challenging and delicate as instrumentation, system and test board effects can be very hard to disentangle from real TDC chip effects.
-10 -   Table 1 below shows a summary of the main timing performance parameters. Entries where the time adjustment feature have been used with optimized parameters (based on DNL/INL code density tests) only gives minor improvements when using all channels, as it cannot compensate for random sampling register mis-match in the 32 channels sharing the time tap adjust feature. If the adjust feature is used to compensate for mis-match in a single specific channel, then effective single shot resolution can be improved from 3.74 ps to 1.35 ps, including INL, jitter and noise. It was attempted to find an optimized time tap adjustment data set that would give improved time resolution across all channels (so compensating for DNL/INL in DLL and analog interpolator), as shown in figure 12. This only resulted in a minor improvement from 3.48 ps RMS to~3.06 ps RMS across all channels, so in practice not worth the effort of determining the best possible time tap adjustment parameters.
When averaging is used for repeated measurements (N = 100) together with tuned time tap adjusts for a specific channel, then an effective resolution as good as 0.43 ps RMS has been obtained in a time sweep. In this case getting better time resolution than what can be expected from the intrinsic time binning of 3.05 ps/ √ 12 = 0.88 ps RMS. This is not contradicting, as jitter (clock reference, PLL, DLL and hit signal) in this case will effectively result in binning smearing that when averaged over multiple measurements can result in better RMS resolution than given by the intrinsic time quantization. Timing performance has been measured with changing temperature (20-50 deg C) and supply voltage (1.10-1.30 V) and it has been confirmed to have only minor effects. A simple time offset shift has been observed (no change of time bin, DNL, INL, jitter) at the level of less than 1 ps/ • C and less than 0.5 ps/mV power supply change. The observed time offset shift with temperature/voltage is caused by different sensitivity in the hit signal path (short) and the more complex clocking path (with PLL, DLL, analog interpolator). The hit signal path could artificially have been made longer, to have same sensitivity as clock path, but this would give the risk of introducing additional jitter.
An absolute worst-case cross-talk measurement has been made with a fixed timing for a single channel and all other channels (so 63 channels) being exercised in parallel and swept over a period covering both rising edge and falling edge of aggressor channels. A max time shift of the victim channel of 2LSBs has been measured as shown in figure 15, when used in 12 ps binning mode. The leading and trailing edges of the 25 ns pulse width aggressor signal on the 63 channels are clearly -12 -visible. It has in practice not be possible to determine where this small cross talk occurs (test board, wire-bonding, Chip IO, digitizer array, substrate coupling or simply from power supply coupling). With a single channel aggressor, no clearly visible crosstalk has been seen. The power consumption has been measured to be 1.3 W with all 64 channels having 3 ps binning and hit rate of 1 MHz/channel. In 12 ps binning mode, total chip power is reduced to 0.85 W which can be further reduced to 0.55 W if only 32 channels are enabled (lower 32 channels disabled).

Conclusions
A 64 channel TDC ASIC with ps time resolution and stability has been developed for HEP and scientific high channel count instrumentation. Extensive on-chip buffering and triggering enables its use at very high hit rates (320 MHz/channel) combined with flexible triggering to select relevant measurements for readout. The PicoTDC has significantly better time resolution than needed in typical HEP applications, so in most cases the 12 ps binning mode (lower power consumption) is sufficient. The 3 ps binning mode is very useful during detector R&D studies and can be used for applications with very high time resolution requirements.
A first batch of 20 k PicoTDCs has been produced and packaged, with a production yield of 94%, and are available for the scientific community with relevant documentation [2]. A small simple evaluation system with a PicoTDC FMC plug-in card for a commercial FPGA evaluation board has been made available to the HEP community with FPGA firmware and Python based DAQ and characterization software (same as used for PicoTDC characterization). PicoTDC evaluation systems and chips have been distributed to more than 20 institutes for their R&D on high time resolution sensors and detectors. Multiple Instrumentation modules/systems based on the PicoTDC are becoming available [17,18] and other applications are evaluating its possible use.
To the best of our knowledge, no similar TDC is currently available with effective ps time resolution and stability on 64 channels.