A Two-Stage Interpolation Time-to-Digital Converter Implemented in 20 nm and 28 nm FGPAs

Yu Wang, Wujun Xie, Haochang Chen, Chengquan Pei, and David Day-Uei Li

Abstract—This article presents a two-stage interpolation time-to-digital converter (TDC), combining a Vernier gray code oscillator TDC (VGCO-TDC) and a tapped-delay line TDC (TDL-TDC). The proposed TDC uses the Nutt method to achieve a broad, high-resolution measurement range. It utilizes look-up tables (LUTs)-based gray code oscillators (GCOs) to build a VGCO-TDC as the first-stage interpolation for fine-time measurements. Then the overtaking residual from the VGCO-TDC is measured by a TDL-TDC to achieve the second-stage interpolation. Due to the two-stage interpolation architecture, the carry-chain-based delay line only needs to cover the resolution of the VGCO-TDC. Hence, we can reduce the delay-line length and related hardware resource utilization. We implemented and evaluated a 16-channel TDC system in Xilinx 20-nm Kintex-UltraScale and 28-nm Virtex-7 field-programmable gate arrays (FPGAs). The Kintex-UltraScale version achieves an average resolution (least significant bit, LSB) of 4.57 picoseconds (ps) with 4.36 LSB average peak-to-peak differential nonlinearity (DNL_{pk-pk}). The Virtex-7 version achieves an average resolution of 10.05 ps with 2.85 LSB average DNL_{pk-pk}.

Index Terms—Two-stage interpolation, hybrid time-to-digital converter (TDC), field-programmable gate array (FPGA), low hardware utilization.

I. INTRODUCTION

High-resolution time-interval (TI) measurements play a crucial role in time-resolved scientific applications, including particle physics [1]–[5], time-of-flight positron emission tomography (ToF-PET) [6]–[9], Raman spectroscopy [10], [11] and fluorescence lifetime imaging microscopy (FLIM) [12]–[14]. Besides, industrial applications (such as hardware Trojan detection [15], [16], light detection and ranging (LiDAR) [17]–[20], and analog-to-digital conversion [21]–[24]) also benefit from high-quality TI measurements. Hence, time-to-digital converters (TDCs) are highly focused due to picosecond (ps)-level resolutions.

The Nutt-TDC architecture (including both coarse counters and fine-time measurements) [25] is the mainstream for modern TDCs since it can simultaneously achieve a wide measurement range and a high resolution [26]–[28]. The coarse counter can be easily implemented through a clock-driven counter. So, most research focuses on fine-time measurements [29]. The main parameters to evaluate fine-time measurements are resolution, linearity, and precision. The resolution, also referred to as the least significant bit (LSB), is the quantization step for a TDC and defined as \( Q = \frac{T}{n} \), where \( T \) is the period of the coarse-counting clock and \( n \) is the number of quantization steps in a period. However, quantization steps are not uniform, and this difference is characterized by differential nonlinearity (DNL) and integral nonlinearity (INL). They are respectively defined as \( DNL[k] = \frac{W[k] - Q}{Q} \) and \( INL[k] = \sum_{j=0}^{k} DNL[j] \), where \( W[k] \) is the time interval of the \( k \)-th quantization step. Due to jitters and quantization errors, measurement results fluctuate for repetitive fixed-TI measurements. This measurement uncertainty is characterized by precision (called the RMS resolution) and calculated as the repetitive measurements’ standard deviation (\( \sigma \)). It is defined as \( \sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N-1} \), where \( x_i \) is the \( i \)-th measurement and \( \mu \) is the average value for \( N \) measurements when the TI is constant.

With rapid advances in complementary metal-oxide-semiconductor (CMOS) transistor technologies, TDCs can be implemented by application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs). Compared with FPGA-TDCs, ASIC-TDCs can perform better due to customized placing and routing strategies [30], [31]. However, FPGA-TDCs are more prevalent in scientific experiments and prototype designs, benefiting from low development costs and short developing cycles. Tapped delay line (TDL), Vernier, and multi-phase-clock-driven counter (MPCDC) architectures are the most used architectures for fine-time measurements among FPGA-TDCs [32]. However, most MPCDC-TDCs have resolutions from 100 ps to 1000 ps [32]. Hence, high-resolution TDCs (LSB < 30 ps) are usually TDL-TDCs and Vernier ring oscillator (VRO)-TDCs. For example, Kwiatkowski et al. proposed a TDL-based multi-sampling wave union type B (WU-B) architecture achieving a 0.4 ps resolution with a 5.95 LSB maximum bin [33]. Wang et al. used a TDL-based bin-decimation architecture achieving a 6 ps resolution with a 2.5...
LSB maximum bin, and Cui et al. proposed a VRO-based multistep fine time measurement architecture achieving a less than 10 ps resolution with a 1.3 LSB maximum bin [34]. A TDL-TDC’s resolution depends on the propagation delay of delay elements (τ in Fig. 1a). In contrast, a VRO-TDC’s resolution depends on the oscillation-speed difference between fast and slow oscillators, as shown in Fig. 1b. So, there are different features between TDL-TDCs and VRO-TDCs. For TDL-TDCs, they can achieve a high throughput benefitting from pipeline sampling and encoding. However, they have higher hardware utilization due to complex de-bubble circuits and thermometer-code-to-binary-code converters. For example, the TDC in Ref. [35] achieves a 350M samples/second throughput. But 646 LUTs and 1112 D-type flip flops (DFFs) were used per channel. VRO-TDCs are more hardware-efficient compared with TDL-TDCs. However, VRO-TDCs suffer from a long dead time. For example, the TDC in Ref. [36] only consumes 104 LUTs and 319 DFFs per channel. But its dead time is around 400 ns, corresponding to 2.5M samples/second. Besides, the propagation delay of CARRY4/CARRY8 has been improved to less than 10 ps, benefiting from advances in manufacturing processes [37], [38]. This allows TDL-TDCs to deliver better resolutions. But lower propagation delay also requires a TDL having more CARRY4s/CARRY8s to cover a whole period of the coarse counting clock. So, more taps are sampled and encoded. Similarly, for VRO-TDCs, more oscillation cycles are required for a fixed TI when the resolution is improved, resulting in an increasing dead time and degraded measurement precision.

Nowadays, the multi-channel design is an increasing trend for photon-electric detectors. For example, Ref. [39] introduces a 128 × 128 single-photon avalanche diode (SPAD) array with 16×2 TDCs, and Ref. [40] presents a 16-channel SiPM with 16 TDCs. Benefiting from high hardware-utilization efficiency, VRO-TDCs are appropriate for multi-channel designs. However, suffering from a long dead time, VRO-TDCs are only suitable for applications accepting low conversion rates, such as FLIM [41]. Besides, an improved TDC’s resolution is also demanded for accurate fluorescence-lifetime evaluation.

Therefore, we propose a new architecture, aiming at a shorter
A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs
delay line and less hardware resource utilization than conventional TDL-TDCs, with a better resolution and a lower
dead time than conventional VRO-TDCs. The contributions of this work are:
1) We propose a new TDC architecture combing a TDL-TDC and a VRO-TDC and introduce the measurement
principle of this architecture in detail.
2) Unlike the carry-chain-based ring oscillator (RO), we use LUTs to construct a Vernier gray code oscillator
TDC (VGCO-TDC) to reduce carry-element utilization.
3) The TDL-TDC only needs to cover the resolution of the
VGCO-TDC in our proposed architecture, reducing the
length of the delay line and related hardware utilization.
4) We developed and evaluated 16-channel TDCs in 20 nm
Kintex-UltraScale XCKU040 and 28 nm Virtex-7
XC7V690T FPGAs to show our methods.

This article is structured as follows: Section II describes the principle and design of the proposed TDC. Section III
presents the experimental results and implementation details, Section IV compares with other designs, and Section V
summarizes our TDC.

II. PRINCIPLE AND DESIGN

As shown in Fig. 2a, the proposed TDC works with a coarse
counter, and each fine-time interpolator consists of a VGCO-
TDC, a TDL-TDC, and a calculation and output module
(Calc.&Output in Fig. 2a). The VGCO-TDC is responsible for
the first-stage fine-time measurement with a resolution of
several hundred picoseconds. Then the overtaking residual (δ)
from the VGCO-TDC is measured by the TDL-TDC with a ~10
ps resolution for the second-stage measurement. After finishing
both measurements, the VGCO-TDC and TDL-TDC output
measurement results from the oscillation counter and encoder,
respectively. Then these two results are calculated in the
Calc.&Output module for the final result. Like the conventional
Nutt-method-based TDL-TDC [26], the combination of outputs
from the oscillation counter (VGCO-TDC) and encoder (TDL-
TDC) shown in Fig. 2a ensures that the second-stage fine-time
measurement is immune to synchronization problems caused by
the TDL-TDC’s input and sampling clock. Similarly, the
proposed two-stage interpolation TDC is also immune to these
synchronization problems because it works with a coarse
counter, as Fig. 2a shows. Besides, we also use block random
access memories (BRAMs) for the on-chip histogram and
asynchronous output (Histo. BRAM and Asyn. Output BRAM
in Fig. 2a), respectively. For parameters (highlighted in yellow
in Fig. 2a) input to the multiplier and subtracters, we use a state-
machine-based parameter core (Para. Core in Fig. 2a) to
configure channel-by-channel according to histograms stored in
BRAMs.

A. Measurement Principle

When a hit comes, the input hit respectively launches the
slow and fast GCOs (highlighted in orange) as shown in Fig. 2a.
For launching the slow GCO, the input hit first arrives at the
Input_shaper_start (ISA) and changes this module’s output to
“1” (high-logic level) when the input hit’s rising edge occurs.
Then, the ISA output keeps “1” to launch the slow GCO until
the global asynchronous clear (CLR in Fig. 2a) is asserted.
Simultaneously, the input hit is also transferred to the fast GCO.

Differently, the input hit first reaches the Coarse_clk_sync
(CCS) module and is synchronized with the coarse-counting clock (coarse_clk in Fig. 2a) after two rising edges of the
coarse-counting clock. Then, the synchronized input
(input_sync in Fig. 2a) launches the fast GCO, similar to the
input hit launching the slow GCO. The timing diagram of the
proposed TDC is shown in Fig. 2b. The resolution of the
VGCO-TDC (\(R_{VRO}\) in Fig. 2b) is determined by the oscillation
speed difference between two GCOs. And the output of the
oscillation counter is stored when the fast GCO first overtakes
the slow GCO. However, the TI between slow and fast GCOs’
launch is \(\tau + T\) rather than \(\tau\), where \(\tau\) is the measured TI
between the rising edges of the input hit and the subsequent
coarse-counting clock, and \(T\) is the period of the coarse-
counting clock. Hence, \(T\) should be subtracted from the VGCO-
TDC measurement result. But an extra \(T\) between GCOs’
launches is necessary. Without this, when an input hit appears
close to the rising edge of the coarse clock, the launching
sequence of slow and fast GCOs can disorder due to uneven
timing delays of internal connections, causing VGCO-TDCs to
work incorrectly.

Unlike previous VRO-TDCs using a DFF as the phase
detector to detect overtaking [36], [42], we use a TDL-TDC to
detect it in the proposed two-stage interpolation TDC. Besides,
we also use the TDL-TDC to measure the \(\delta\) at a ~10 ps
resolution. For the TDL-TDC in our design, the fast GCO’s
output is fed into the delay line. And the slow GCO’s output
is used as the clock for the sampling DFFs, encoder, and so on, as
shown in Fig. 2a. The sampled outputs from the TDL are “0’s
(low-logic level) when the fast GCO runs behind the slow GCO.
But a thermometer code (“11100...000”) is output when the fast
GCO first overtakes the slow GCO (shown in Fig. 2b).
Simultaneously, the thermometer code is also encoded to a
binary code as a measurement of \(\delta\). Hence, the measured TI
is calculated as follows:

\[
\tau = \left( N_{osc} + \frac{1}{2} \right) R_{VRO} - T - \delta,
\]

where \(N_{osc}\) is the oscillation number of the slow GCO when
the fast GCO first overtakes the slow GCO.

B. VGCO-TDC and TDL-TDC

Vernier-TDCs utilize the oscillation speed difference
between the two oscillators to achieve fine-time measurements.
Theoretically, two asynchronous clocks with different
frequencies are appropriate for Vernier-TDCs. However, the
oscillators in the Vernier-TDC launch independently, meaning
every Vernier-TDC consumes at least two PLLs/MMCMs [43],
[44], which is unaffordable. Therefore, we use GCOs here to
construct VGCO-TDCs. Unlike using a GCO to directly
measure a TI in Ref. [45], [46], we use two GCOs in the Vernier
way (as slow and fast oscillators). The diagram of the VGCO-
TDC is shown in Fig. 2a, and the working principle has also
been introduced in Sec. II A. Here, we detailly present the GCO.

For a GCO, the output changes following the gray-code
sequence. Hence, only one bit experiences a transition between
two continuous states. Benefiting from this feature, the GCO is
immune to the “race and competition” phenomenon, a common
problem in traditional counters that more than one bit toggle
simultaneously [47]. So GCO can be implemented by
combinational logic. And we can use free-run (not driven by a
A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs

Fig. 3. (a) Block diagram of the GCO. (b) Timing diagram of the GCO. (c) Block diagram of the TDL-TDC in the Virtex-7 FPGA. (d) Principle of Sub-TDL.

clock) GCOs as slow and fast oscillators in the Vernier architecture. The GCO is shown in Fig. 3a. In 7-series and more advanced Xilinx FPGAs, each LUT has up to six inputs [43], [44]. Hence, we use five 6-input LUTs to achieve a 32-state GCO. In each LUT, one of its inputs is connected to “EN” (highlighted in red in Fig. 3a) to launch and reset the GCO. The other inputs are used to get feedback from outputs (Gi[4:0] in Fig. 3a). We instantiate LUTs with Vivado primitives and use G[4] in Fig. 3a as the slow and fast oscillators’ outputs fed into the TDL-TDC. The timing diagram of the GCO is shown in Fig. 3b.

As shown in Fig. 3c, we use CARRY4s in the Virtex-7 FPGA (CARRY8s in the Kintex-UltraScale FPGA) as delay elements to achieve the second-stage interpolation at a ~10 ps resolution. In previous TDL-TDCs [37], [38], the input hit is fed into the delay line directly. Outputs from delay elements are sampled by coarse-clock-driven DFFs, as shown in Fig. 1a. However, in the proposed TDC shown in Fig. 2a, we use the fast GCO’s output as the delay line’s input and use the slow GCO’s output to measure the time interval to and from the TDL-TDC. As shown in Eq. (1), the two-stage interpolation TDC’s output is also calculated as:

\[ Out_{TDC} = N_{TDL} \times N_{TDL} - Out_{TDL} - Offset, \]

where \( N_{TDL} \) is the number of TDL-TDC’s time bins covering the VGCO-TDC’s resolution, \( Out_{TDL} \) is the raw output from the TDL-TDC, and \( Offset \) is the offset caused by the CCS module, uneven routing delays and so on. Besides, in the designed Sub-TDL TDC, the TDC’s output is valid only when all Sub-TDLs have non-zero outputs, causing the minimal valid output of TDL-TDC is more than 1. To cancel this offset, we calculate \( N_{TDL} \) and \( Out_{TDL} \) as:

\[ N_{TDL} = Out_{max} - Out_{min} + 1, \]

\[ Out_{TDL} = Out_{bin} - Out_{min} + 1, \]

C. Result Calculation and Parameter Configuration

The TI is measured, and then corresponding results are output from the VGCO-TDC and TDL-TDC. However, these two outputs still require post-processing for the final result. Hence, we design a Calc.&Output module shown in Fig. 2a for this task. The calculation is conducted according to Eq. (1). However, the TDL-TDC’s output is a raw binary code rather than a calibrated timestamp. Therefore, it cannot be directly used as \( \delta \) in Eq. (1). So, we use a raw binary code instead of a calibrated timestamp as the final output of the proposed TDC, considering the complexity of hardware-implemented bin-by-bin calibration [50]. And we conduct the bin-by-bin calibration in our PC as shown in Fig. 4. Referring to Eq. (1), the two-stage interpolation TDC’s output is also calculated as:

\[ Out_{TDC} = N_{TDL} \times N_{TDL} - Out_{TDL} - Offset, \]

where \( N_{TDL} \) is the number of TDL-TDC’s time bins covering the VGCO-TDC’s resolution, \( Out_{TDL} \) is the raw output from the TDL-TDC, and \( Offset \) is the offset caused by the CCS module, uneven routing delays and so on. Besides, in the designed Sub-TDL TDC, the TDC’s output is valid only when all Sub-TDLs have non-zero outputs, causing the minimal valid output of TDL-TDC is more than 1. To cancel this offset, we calculate \( N_{TDL} \) and \( Out_{TDL} \) as:

\[ N_{TDL} = Out_{max} - Out_{min} + 1, \]

\[ Out_{TDL} = Out_{bin} - Out_{min} + 1, \]
A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs

where $Out_{bin}$, $Out_{max}$ and $Out_{min}$ are the raw binary code, the maximum and minimal output from the TDL-TDC.

The parameters mentioned in Eq. (2) can be calculated and configured manually. But it is time-consuming for all 16 channels. Hence, we design a Para. Core to calculate and configure parameters channel by channel. The Para. Core is implemented through a state-machine. The workflow of the Para. Core is shown in Fig. 5. It requires two code density tests (CDTs) [29], [51], CDT1 and CDT2. The CDT1 is only for TDL-TDC, with the switch in Fig. 2a selecting data from TDL-TDC (highlighted in blue) as output. While the CDT2 is for the two-stage interpolation TDC with the switch outputting data after calculation (highlighted in pink in Fig. 2a). According to the result from CDT1, the Para. Core can extract $Out_{max}$ and $Out_{min}$, and calculate $N_{TDL}$. Then, the Para. Core configures $N_{TDL}$ as the coefficient for the multiplier and configures $Out_{min}$ as the subtrahend for the subtractor highlighted in brown in Fig. 2a. After configuration, CDT2 is conducted for the two-stage interpolation TDC. Offset in Eq. (2) can be extracted and is configured as the subtrahend for the subtractor highlighted in pink in Fig. 2a, ensuring the CDT’s histogram of the proposed two-stage interpolation TDC begins from 1. CDTs’ histograms for the TDL-TDC and the two-stage interpolation TDC after offset cancellation are shown in Fig. 6. As shown in Fig. 6b, the pattern (inverted from the histogram of the TDL-TDC and highlighted in blue in Fig. 6b) periodically appears in the histogram of the proposed two-stage interpolation TDC, matching the expectation of the proposed TDC.

III. EXPERIMENTAL RESULTS AND IMPLEMENTATION DETAILS

We implemented and evaluated the proposed TDC in KCU105 [52] and NetFPGA-SUME [53] evaluation boards, respectively. We used an uncorrelated 3.7777777 MHz hit from SRS-635 (Stanford Research System) as a random input for CDTs [29], [51]. While we used on-chip delay macros (IDELAY3 [43] in the Kintex-UltraScale FPGA and IDELAY2 [44] in the Virtex-7 FPGA) to generate controllable delays (relative to coarse-counting clocks) for TI tests. Coarse counting clocks are from low-jitter crystal oscillators on boards (SI-570 in KCU 105 and DSC-1103 in NetFPGA-SUME) and are configured to 400 MHz in both FPGAs. However, clocks for TDL-TDCs source from designed GCOs. Hence, we conducted timing constraints according to measured GCOs’ oscillation frequencies (measured by Teledyne LeCroy 640Zi). Besides, the temperature and voltage are maintained in experiments.

A. Resolution and Linearity

We measured oscillation periods of GCOs in both evaluation boards to calculate the resolutions of VGCO-TDCs. Besides,

![Fig.6. Histograms of time bins for (a) the TDL-TDC and (b) the proposed two-stage interpolation TDC from the Virtex-7 FPGA.](image)

![Fig.7. RMS resolution and measured TIs of the (a) channel-1, (b) channel-5, (c) channel-9 and (d) channel-13 in the Kintex-UltraScale FPGA, and (e) channel-1, (f) channel-5, (g) channel-9 and (h) channel-13 in the Virtex-7 FPGA.](image)
A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs

Table II

<table>
<thead>
<tr>
<th>Reso. (ps)</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
<th>14</th>
<th>15</th>
<th>16</th>
<th>Ave.</th>
</tr>
</thead>
<tbody>
<tr>
<td>DNL_{pk-pk} (LSB)</td>
<td>0.54</td>
<td>0.51</td>
<td>0.50</td>
<td>0.50</td>
<td>0.62</td>
<td>0.54</td>
<td>0.63</td>
<td>0.52</td>
<td>0.65</td>
<td>0.55</td>
<td>0.63</td>
<td>0.53</td>
<td>0.63</td>
<td>0.66</td>
<td>0.55</td>
<td>0.57</td>
<td></td>
</tr>
<tr>
<td>INL_{pk-pk} (LSB)</td>
<td>4.29</td>
<td>5.32</td>
<td>4.14</td>
<td>4.27</td>
<td>3.91</td>
<td>3.86</td>
<td>4.25</td>
<td>5.30</td>
<td>4.22</td>
<td>5.72</td>
<td>3.80</td>
<td>4.35</td>
<td>4.58</td>
<td>4.12</td>
<td>3.77</td>
<td>3.79</td>
<td>4.36</td>
</tr>
<tr>
<td>( \omega_{eq} ) (ps)</td>
<td>1.92</td>
<td>2.16</td>
<td>1.82</td>
<td>1.96</td>
<td>2.03</td>
<td>1.91</td>
<td>1.80</td>
<td>1.94</td>
<td>1.81</td>
<td>1.95</td>
<td>1.83</td>
<td>1.97</td>
<td>2.00</td>
<td>1.98</td>
<td>2.01</td>
<td>1.99</td>
<td>2.02</td>
</tr>
</tbody>
</table>

The oscillation periods of GCOs and resolutions of VGGCO-TDCs are summarized in Table I. As TABLE I shows, the oscillation periods are designed from 7 ns to 11 ns for TDL-TDCs’ fast timing-closure in the implemented FPGAs. The resolutions of Kintex-UltraScale-implemented VGGCO-TDCs range between 750 ps and 910 ps with an average resolution of 831 ps. Differently, the resolutions of Vortex-7-implemented VGGCO-TDCs range between 460 ps and 800 ps with an average resolution of 579 ps. GCOs in the Vortex-7 FPGA have faster oscillation frequencies than GCOs in the Kintex-UltraScale FPGA. So, the VGGCO-TDC in the Vortex-7 FPGA can achieve a similar dead time (for a 5-ns maximum measurement range, \( r + T \)) even with a finer resolution. The resolutions and linearity of two-stage TDCs implemented in both evaluation boards are also summarized in Table II. For TDCs in the Kintex-UltraScale FPGA, resolutions fluctuate between 4.50 ps and 4.66 ps with an average resolution of 4.57 ps, showing good uniformity between channels. And the average DNL_{pk-pk} and INL_{pk-pk} are respectively 4.36 LSB and 18.26 LSB, with a 5.72 LSB maximum DNL_{pk-pk} and a 23.66 LSB maximum INL_{pk-pk}. Besides, the \( \omega_{eq} \) is from 9.16 ps to 10.68 ps with an average \( \omega_{eq} \) of 9.93 ps, simultaneously determined by the resolution and linearity. For TDCs in the Vortex-7 FPGA, the resolutions are from 9.65 ps to 10.29 ps, with an average resolution of 10.05 ps. And the average DNL_{pk-pk} and INL_{pk-pk} are respectively 2.85 LSB and 13.61 LSB, with a 4.29 LSB maximum DNL_{pk-pk} and a 19.71 LSB maximum INL_{pk-pk}. Moreover, the \( \omega_{eq} \) fluctuates between 14.26 ps and 19.54 ps with an average value of 15.97 ps. Compared with TDCs in the Vortex-7 FPGA, TDCs in the Kintex-UltraScale FPGA has an average resolution enhanced by more than 2-fold, from 10.05 ps to 4.57 ps. However, the \( \omega_{eq} \) only improves by 1.6-fold (from 15.97 ps to 9.93 ps), suffering from worse linearity.

B. Time Interval Test

We use the standard deviation introduced in Section I to evaluate measurement uncertainty induced by quantization errors and jitters. To avoid jitters introduced by the input signal, we use on-chip programmable delay macros (IDELAY3 [43] in the Kintex-UltraScale FPGA and IDELAY2 [44] in the Vortex-7 FPGA) to delay a 5 MHz clock (synchronized with the coarse-counting clock), and use the delayed clock as the input for TI tests. Besides, we also use bin-by-bin calibration [50] to minimize the impacts of quantization errors and INL on measurements. It is calculated as:

\[
t_k = \frac{W[k]}{2} + \sum_{j=1}^{k-1} W[j],
\]

where \( t_k \) is the calibrated timestamp corresponding to the center of the \( k \)-th time bin.

The TDCs’ RMS resolutions of both FPGAs are shown in Fig. 7. We use 60 delay taps to cover a period of the coarse-counting clock (2.5 ns @ 400MHz) in the Kintex-UltraScale FPGA. However, only 32 delay taps are required to cover the same period in the Vortex-7 FPGA, due to a worse resolution of IDELAY2 compared to IDELAY3. For TDCs in both FPGAs, most measured groups containing different TIs (highlighted in red in Fig. 7) have a sharp change caused by spanning coarse-counting clocks’ cycles. However, the delay-tap numbers corresponding to sharp changes are varied due to different path delays of the input signal. Different placement and routing strategies cause these various path delays. In general, RMS resolutions have a deteriorating trend with increasing measured TIs. Jitters’ accumulation from VGGCO-TDCs causes this phenomenon. As Fig. 2a shows, the slow GCO’s output drives the oscillation counter. And more oscillation cycles of GCOs are required for longer TIs, resulting in more accumulation of GCOs’ jitters. Then the accumulated jitters from VGGCO-TDCs deteriorate RMS resolutions of two-stage interpolation TDCs. Besides, the accumulated jitters of the VGGCO-TDC are also influenced by the period of the coarse-counting clock rather than the stability of GCOs only. Because the period of the coarse-counting clock determines the maximum measured TI of the VGGCO-TDC. Simultaneously, it is worth noting that the trend of the proposed TDC’s RMS resolution slightly differs from that of the measured TI, especially in the Kintex-
A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs. There are two reasons for this phenomenon. Firstly, measurement uncertainty is not sourced from GCOs only. Measurement uncertainty from the TDL-TDC, unrelated to the measured TI, also contributes to RMS resolutions. Secondly, for the same measured TI, the GCOs in the Virtex-7 FPGA oscillate more than GCOs in the Kintex-UltraScale FPGA, due to better VGCO-TDCs’ resolutions. Therefore, the trend of accumulated jitters is more prominent in the Virtex-7 FPGA.

Figure 7 shows RMS resolutions for different intervals (less than one coarse-clock period). However, we must also characterize the RMS resolution for the whole coarse-clock period. Hence, the valid RMS resolution \( \sigma_{\text{valid}} \) [55] is used, defined as \( \sigma_{\text{valid}}^2 = \sum_1^H \sigma_i^2 / H \), where \( \sigma_i \) is the standard deviation of measurements for the \( i \)-th fixed TI and \( H \) is the number of different TIs. The valid RMS resolution for each channel in both FPGAs is summarized in TABLE II.

### C. Hardware resource utilization and constraint for design

We implemented the proposed TDCs in both FPGAs. The hardware resource utilization of both FPGAs is summarized in TABLE III. For TDCs in the Kintex-UltraScale FPGA, each channel consumes 402 LUTs and 544 DFFs. Besides, each channel also uses 1.5 BRAMs for Histo. BRAM and 1.5 BRAMs for Asyn. Output BRAM. Moreover, 29 CARRY8s are needed to construct TDL, and 23 CARRY8s for calculation, including accumulation, multiplication and subtraction. In addition to the hardware utilization mentioned above, 614 LUTs and 419 DFFs are also used for the Para. Core which calculates and configures parameters for all 16-channel TDCs in the Kintex-UltraScale FPGA. However, the hardware utilization is less in the Virtex-7 FPGA due to VGCO-TDCs' better resolutions and worse TDL propagation delays. For TDCs in the Virtex-7 FPGA, each channel consumes 257 LUTs and 360 DFFs. Besides, each channel also uses 1 BRAM for Histo. BRAM and 1 BRAM for Asyn. Output BRAM. And 20 CARRY4s are used for TDLs, and 38 CARRY4s as arithmetic units. Unlike the Para. Core in the Kintex-UltraScale FPGA, the Para. Core in the Virtex-7 FPGA only consumes 574 LUTs and 396 DFFs, due to parameters’ shorter bin-width. The hardware resource utilization indicates the proposed design is more hardware-efficient than conventional TDL-TDCs [38], and has similar hardware utilization compared with the VRO-TDC presented in Ref. [42] (a comparison is shown in TABLE IV).

The TDC’s implementation layouts in the Kintex-UltraScale FPGA are shown in Fig. 8a. We placed the VGCO-TDC close to the corresponding TDL-TDC to minimize jitters and skews induced by inner connections. Besides, as shown in Fig. 8b and 8c, the GCO is manually routed (manually constrained routes are highlighted as yellow dotted lines in Fig. 8b) and placed (LUTs are highlighted in red in Fig. 8c) to ensure the uniformity between channels. We used the Tcl command “set_property BEL” and “set_property LOC” to place LUTs, and used “set_property FIXED ROUTE” to perform routing manually. Moreover, timing constraints differ from previous TDL-TDCs since the designed TDL-TDCs’ sampling and encoding clocks are from GCOs rather than MMCMs [43], [44]. Hence, we use “create clock -period” to claim the GCO’s output as a clock and set the period for it. Meanwhile, we also need to use “set_clock_groups -asynchronous” to set asynchronous-clock groups to avoid unnecessary timing checks between different clock regions.

<table>
<thead>
<tr>
<th>TABLE III: HARDWARE RESOURCE UTILIZATION</th>
<th>LUT</th>
<th>DFF</th>
<th>CARRY</th>
<th>CLB/Slice</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>UltraScale</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Available</td>
<td>242400</td>
<td>484800</td>
<td>30300</td>
<td>30300</td>
</tr>
<tr>
<td>1-ch</td>
<td>402</td>
<td>544</td>
<td>52</td>
<td>155</td>
</tr>
<tr>
<td>16-ch</td>
<td>6431</td>
<td>8704</td>
<td>832</td>
<td>2495</td>
</tr>
<tr>
<td>Para. Core</td>
<td>614</td>
<td>419</td>
<td>1</td>
<td>165</td>
</tr>
<tr>
<td><strong>Virtex-7</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Available</td>
<td>433200</td>
<td>866400</td>
<td>108300</td>
<td>108300</td>
</tr>
<tr>
<td>1-ch</td>
<td>257</td>
<td>360</td>
<td>58</td>
<td>177</td>
</tr>
<tr>
<td>16-ch</td>
<td>4113</td>
<td>5760</td>
<td>928</td>
<td>2695</td>
</tr>
<tr>
<td>Para. Core</td>
<td>574</td>
<td>396</td>
<td>5</td>
<td>248</td>
</tr>
</tbody>
</table>

1CARRY8s in UltraScale FPGA and CARRY4s in Virtex-7 FPGA. 2CLB in UltraScale FPGA and slice in Virtex-7 FPGA.
IV. COMPARISON AND DISCUSSION

Table IV summarizes recently published TDL-TDCs and VRO-TDCs. We use the maximum oscillation number and maximum oscillation period for the proposed TDC to evaluate the dead time for all 16 channels. Hence, the dead time of the design is calculated as follows:

\[ T_{\text{Dead}} = \left( N_{\text{Max}}^{\text{Oscil}} + T_{\text{code}} + T_{\text{calc}} + T_{\text{his}} + T_{\text{reset}} \right) \times P_{\text{max}} , \]  

where \( N_{\text{Max}}^{\text{Oscil}} \) is the maximum oscillation number of the slow GCO for the 5-ns maximum measurement range, \( T_{\text{code}}, T_{\text{calc}}, T_{\text{his}} \), and \( T_{\text{reset}} \) are required clock cycles for the encoder, result calculation, histogram and resetting the TDC (they are 2, 3, 1 and 1 clocks cycles, respectively), and \( P_{\text{max}} \) is the maximum oscillation period for the slow GCOs.

As shown in TABLE IV, the TDL-TDC is the mainstream design. And the VRO-TDC is also well-developed. However, in this work, we first proposed the two-stage interpolation architecture combing a VGCO-TDC and a TDL-TDC, achieving a better dead time and a finer resolution than conventional VRO-TDCs and better hardware utilization than conventional TDL-TDCs.

TDL-TDCs normally have a one-cycle or two-cycle dead time, benefiting from pipeline sampling and encoding. And the VRO-TDCs’ dead time is much longer due to the measuring principle (the measurement is conducted by the fast oscillator “chasing” the slow oscillator). But, benefiting from the two-stage interpolation architecture, the oscillation number in our design is reduced dramatically even with a finer resolution, further reducing the dead time. For example, TDCs’ dead time in Ref. [36] and [42] is 400 ns and 602 ns. However, the dead time of our design is 155 ns (in the Kitex-UltraScale FPGA) and 144 ns (in the Virtex-7 FPGA), respectively.

Moreover, the reduced oscillation number also benefits precision since fewer jitters are accumulated. For example, our designs have similar precision compared to those in Ref. [36] and [42], although our designs have much better resolutions. Simultaneously, the proposed TDC has a similar hardware utilization compared to the Ref. [42] design. But our method is less hardware-efficient than the design in Ref. [36] due to the on-chip calculation and histogram.

TDL-TDCs in Ref. [38] and Ref. [58] have similar resolutions compared with the proposed TDCs. However, the TDL-TDC in the proposed two-stage interpolation TDC only needs to cover the resolution of the VGRO-TDC, indicating the proposed TDC is more hardware-efficient than conventional TDL-TDCs. The designs in Ref. [56] and Ref. [59] have similar hardware utilization compared with ours. The TDL-TDC in Ref. [56] requires 228 LUTs and 678 DFFs, similar to the design in the Kintex-UltraScale FPGA. However, our design has a much finer resolution (improved from 20 ps to 5 ps). Compared with the design in Ref. [59], the design in the Virtex-7 FPGA has similar hardware utilization. But our design performs worse resolution and precision, Here, the advantage of our method is fewer CARRY4s to construct the TDL and lower-frequency

<table>
<thead>
<tr>
<th>Ref-year</th>
<th>Methods</th>
<th>Devi. Proc. (nm)</th>
<th>LSB (ps)</th>
<th>( \sigma_{eq} ) (ps)</th>
<th>Prec. (ps)</th>
<th>DNL (LSB)</th>
<th>INL (LSB)</th>
<th>Dead Time (ns)</th>
<th>LUT</th>
<th>DFF</th>
<th>CARRY</th>
<th>CLB /Slice</th>
</tr>
</thead>
<tbody>
<tr>
<td>TDL-TDCs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>[38]-19</td>
<td>Sub-TDL, Bin-width compensation and calibration.</td>
<td>20</td>
<td>5.02</td>
<td>5.03</td>
<td>7.81</td>
<td>[-0.12,0.11]</td>
<td>[-0.18,0.46]</td>
<td>NS(^4)</td>
<td>703</td>
<td>1195</td>
<td>80(^4)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td></td>
<td>28</td>
<td>10.54</td>
<td>10.55</td>
<td>14.59</td>
<td>[-0.05,0.08]</td>
<td>[-0.09,0.11]</td>
<td>NS(^2)</td>
<td>1145</td>
<td>1916</td>
<td>NS(^2)</td>
<td>712(^4)</td>
<td></td>
</tr>
<tr>
<td>[56]-22</td>
<td>Dual-mode encoder.</td>
<td>22</td>
<td>22.1</td>
<td>NS(^2)</td>
<td>22.35</td>
<td>[-0.71,1.05]</td>
<td>[0.85,0.86]</td>
<td>4</td>
<td>228</td>
<td>678</td>
<td>48(^4)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>[57]-22</td>
<td>Wave union A, DSP delay line.</td>
<td>28</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>11.49</td>
<td>13.60</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>10%(^2)</td>
</tr>
<tr>
<td>[58]-22</td>
<td>Wave union A, bin merging.</td>
<td>28</td>
<td>10</td>
<td>NS(^2)</td>
<td>17</td>
<td>[-0.13,0.15]</td>
<td>[-2.26,5.34]</td>
<td>NS(^2)</td>
<td>1136</td>
<td>2716</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>[26]-23</td>
<td>Wave union A, dual-sampling, bidirectional encoder.</td>
<td>16</td>
<td>0.46</td>
<td>1.81</td>
<td>&lt;9</td>
<td>[-0.99,6.42]</td>
<td>[-8.79,51.56]</td>
<td>NS(^2)</td>
<td>11773</td>
<td>13547</td>
<td>234(^4)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>[33]-23</td>
<td>Multi-sampling wave union B.</td>
<td>28</td>
<td>0.4</td>
<td>0.55</td>
<td>&lt;5.2</td>
<td>[-0.97,5.95]</td>
<td>[-8.02,219.30]</td>
<td>NS(^2)</td>
<td>2840</td>
<td>1165</td>
<td>NS(^2)</td>
<td>953(^4)</td>
</tr>
<tr>
<td>[59]-23</td>
<td>Folding-TDC.</td>
<td>24</td>
<td>4.4</td>
<td>NS(^2)</td>
<td>4.6</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
<td>4.4</td>
<td>339</td>
<td>740</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>[60]-23</td>
<td>SSP-SFC-TDC.</td>
<td>28</td>
<td>2.625</td>
<td>NS(^2)</td>
<td>180(^2)</td>
<td>0.03</td>
<td>0.05 (^4)</td>
<td>NS(^2)</td>
<td>212</td>
<td>333</td>
<td>64</td>
<td>84(^4)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VRO-TDCs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>[36]-17</td>
<td>Period difference recording.</td>
<td>65</td>
<td>[23,37]</td>
<td>NS(^2)</td>
<td>[32,39]</td>
<td>[-0.4,0.4]</td>
<td>[-0.7,0.7]</td>
<td>400</td>
<td>104</td>
<td>319</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>[42]-20</td>
<td>Bidirectional-Operating.</td>
<td>65</td>
<td>24.5</td>
<td>NS(^2)</td>
<td>28</td>
<td>[-0.20,0.25]</td>
<td>[0.03,0.82]</td>
<td>602</td>
<td>172</td>
<td>986</td>
<td>NS(^2)</td>
<td>NS(^2)</td>
</tr>
<tr>
<td>Other-TDCs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>[60]-23</td>
<td>SSP-SFC-TDC.</td>
<td>28</td>
<td>625</td>
<td>NS(^2)</td>
<td>180(^2)</td>
<td>0.03</td>
<td>0.05 (^4)</td>
<td>NS(^2)</td>
<td>212</td>
<td>333</td>
<td>64</td>
<td>84(^4)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

\(^1\) Single shot precision; \(^2\) NS=not specified; \(^3\) CARRY8; \(^4\) Slice; \(^5\) CARRY4; \(^6\) Percentage of used DSPs in a Xilinx-7 200T FPGA; \(^7\) Averaged value for 19 channels; \(^8\) DNL_{\text{pk-pk}},INL_{\text{pk-pk}}; \(^9\) Average DNL_{\text{pk-pk}}, INL_{\text{pk-pk}} for 16 channels; \(^10\) Average INL_{\text{pk-pk}} for 16 channels; \(^11\) Each channel’s average hardware utilization with the Para. Core; \(^12\) CLB.
clock required for TDL-TDC’s sampling and encoding. In Ref. [59], over 200 taps (50 CARRY4s) cover the 554 MHz sampling clock. However, in our design, only 80 taps (20 CARRY4s) are used to construct the TDL to cover the resolution of the VGRO-TDC rather than the period of the coarse clock. Compared with conventional TDL-TDCs, this architecture allows TDL’s length unrelated to the coarse-counting clock, further reducing the difficulty of timing closure (a high-frequency TDL-TDC sampling clock is preferred to reduce the length of the TDL in conventional TDL-TDCs). Besides, although 257 LUTs and 360 DFFs are used per channel for the TDC in the Virtex-7 FPGA (402 LUTs and 544 DFFs are used for the TDC in the Kintex-UltraScale FPGA), only 86 LUTs and 116 DFFs are used for the VGCO-TDC and TDL-TDC (only 213 LUTs and 284 DFFs are used for the VGCO-TDC and TDL-TDC in the Kintex-UltraScale FPGA), indicating our design can be more compact.

V. CONCLUSION

In this work, we use GCOs to replace CARRY4s/CARRY8s to build oscillators for Vernier-TDCs and propose the two-stage interpolation architecture. With the new architecture, the TDL-TDC in our design only needs to cover the resolution of the VGCO-TDC, reducing the hardware utilization of the designed TDC. Besides, the length of the TDL is not related to the frequency of the TDL-TDC’s clock, reducing the difficulty of timing closure (a high-frequency TDL-TDC clock is preferred to reduce the length of the TDL). Compared with previous VRO-TDCs, the proposed TDCs improve the dead time and precision even with a finer resolution by reducing oscillation numbers.

We implemented the proposed 16-channel TDC in Kintex-UltraScale and Virtex-7 FPGAs to evaluate our design. Experimental results indicate that the proposed TDC is hardware-efficient compared with VRO-TDCs and TDL-TDCs and has competitive performances compared with VRO-TDCs. It is appropriate for multi-channel and low-conversion-rate applications such as FLIM although the precision needs to be further improved compared with TDL-TDCs. Besides, the mutiphase clock method [61] for the VGCO-TDC and the WU method [50] for TDL-TDC can be implemented in future work to improve the precision.

VI. ACKNOWLEDGEMENTS

This work has been funded by the EPSRC (EP/T00097X/1): the Quantum Technology Hub in Quantum Imaging (QuantiC), Innovate UK HYDRI (10005391), and the University of Strathclyde.

REFERENCE


A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FPGAs


A two-stage interpolation time-to-digital converter implemented in 20 nm and 28 nm FGPAs


Yu Wang was born in Chongqing, China in 1995. He received the B.Eng. degree in measurement and control from the Harbin University of Science and Technology, in 2017, and the M.Eng degree in electronics and communication engineering from Harbin Engineering University, in 2020. Since 2020, he has been working toward the Ph.D. degree founded by China Scholarship Council at the University of Strathclyde, Glasgow, U.K. His current research interest is the FPGA-based mixed signal circuit.

Wujun Xie was born in Hunan, China, in 1996. He received the M.S. degree in embedded systems from the University of Southampton, Southampton, U.K., in 2018 and the Ph.D. degree from the University of Strathclyde, Glasgow, U.K., in 2023. He is currently working with Adaps Photonics Shenzhen, as senior system engineer. His current research interests are in SPAD based photon-sensing techniques.

Haochang Chen was born in Xian China, in 1990. He received the M.S. degree in embedded digital systems from the University of Sussex, Brighton, U.K., in 2013, and the PhD degree from the University of Strathclyde, Glasgow, U.K., in 2020. His current research interests include FPGA-based high-precision time metrology systems for ranging and biomedical imaging applications.

Chengguan Pei received the Ph.D. degree from Xi’an Jiaotong University, Xi’an China, in 2017. He was a post doctor in Tsinghua university from October 2018 to October 2021. He is currently a lecturer with the School of Artificial Intelligence, XiDian University, China. His research interests include computational photography, circuit and system, and computational femtosecond imaging.

David Day-Uei Li received his Phd in electrical engineering from National Taiwan University, Taipei, Taiwan, in 2001. He then joined the Industrial Technology Research Institute, working on complementary metal-oxide-semiconductor (CMOS) optical and wireless communication chip-sets. From 2007 to 2011, he worked at the University of Edinburgh, Edinburgh, on two European projects focusing on CMOS single-photon avalanche diode sensors and systems. He then took the lectureship in biomedical engineering at the University of Sussex, Brighton, in mid-2011, and in 2014, he joined the University of Strathclyde, Glasgow, as a senior lecturer. He has published more than 100 research articles and patents. His research interests include time-resolved imaging and spectroscopy systems, mixed-signal circuits, CMOS sensors and systems, embedded systems, optical communications, and field programmable gate array/GPU computing. His research exploits advanced sensor technologies to reveal low-light but fast biological phenomena.