A high speed serializer/deserializer design

Yifei Luo

University of New Hampshire, Durham

Follow this and additional works at: https://scholars.unh.edu/dissertation

Recommended Citation
https://scholars.unh.edu/dissertation/536

This Dissertation is brought to you for free and open access by the Student Scholarship at University of New Hampshire Scholars' Repository. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of University of New Hampshire Scholars' Repository. For more information, please contact nicole.hentz@unh.edu.
A high speed serializer/deserializer design

Abstract
A Serializer/Deserializer (SerDes) is a circuit that converts parallel data into a serial stream and vice versa. It helps solve clock/data skew problems, simplifies data transmission, lowers the power consumption and reduces the chip cost. The goal of this project was to solve the challenges in high speed SerDes design, which included the low jitter design, wide bandwidth design and low power design.

A quarter-rate multiplexer/demultiplexer (MUX/DEMUX) was implemented. This quarter-rate structure decreases the required clock frequency from one half to one quarter of the data rate. It is shown that this significantly relaxes the design of the VCO at high speed and achieves lower power consumption.

A novel multi-phase LC-ring oscillator was developed to supply a low noise clock to the SerDes. This proposed VCO combined an LC-tank with a ring structure to achieve both wide tuning range (11%) and low phase noise (-110dBc/Hz at 1MHz offset).

With this structure, a data rate of 36 Gb/s was realized with a measured peak-to-peak jitter of 10ps using 0.18microm SiGe BiCMOS technology. The power consumption is 3.6W with 3.4V power supply voltage. At a 60 Gb/s data rate the simulated peak-to-peak jitter was 4.8ps using 65nm CMOS technology. The power consumption is 92mW with 2V power supply voltage.

A time-to-digital (TDC) calibration circuit was designed to compensate for the phase mismatches among the multiple phases of the PLL clock using a three dimensional fully depleted silicon on insulator (3D FDSOI) CMOS process. The 3D process separated the analog PLL portion from the digital calibration portion into different tiers. This eliminated the noise coupling through the common substrate in the 2D process. Mismatches caused by the vertical tier-to-tier interconnections and the temperature influence in the 3D process were attenuated by the proposed calibration circuit.

The design strategy and circuits developed from this dissertation provide significant benefit to both wired and wireless applications.

Keywords
Engineering, Electronics and Electrical
A HIGH SPEED SERIALIZER/DESERIALIZER DESIGN

By

Yifei Luo
B. S., Huazhong University of Science and Technology, China, 2002
M. S., Huazhong University of Science and Technology, China, 2005

DISSERTATION

Submitted to the University of New Hampshire
in Partial Fulfillment of
the Requirements for the Degree of

Doctor of Philosophy
in
Electrical Engineering

September, 2010
ALL RIGHTS RESERVED

© 2010

Yifei Luo
This dissertation has been examined and approved.

Dissertation Director, Kuan Zhou, Assistant Professor, ECE

Allen D. Drake, Associate Professor, ECE

Richard A. Messner, Associate Professor, ECE

Valencia M. Joyner, Assistant Professor, ECE, Tufts University

Bo Zhang, Senior Technical Staff Engineer, Intel, U.S.

12 August 2010
Date
Acknowledgements

First, I want to thank my parents for their unfailing help throughout my life.

I want to thank my advisor, Dr. Kuan Zhou, for guiding me through my research and helping me overcome the difficulties of the past five years. His guidance and encouragement were very important in helping me overcome obstacles during my Ph.D. study. I thank him also for providing me with this unique opportunity to learn current state-of-the-art techniques in high speed integrated circuit design.

I want to thank all my committee members for their kind support during my Ph.D. study: thanks to Dr. Drake for helping me adjust to life at UNH; thanks to Dr. Messner for giving me instructions study techniques and prepare the dissertation; thanks to Dr. Joyner of Tufts University for giving me advice on my research; and thanks to Dr. Zhang of Intel Corporation for helping me with his extensive experience in IC design.

I also want to thank the professors and students at UNH who gave me help during the past five years of study.

I want to thank all my friends at UNH, especially Shunfu Hu, Gang Chen, James Brandt, Xiaolu Li and Jiayin Tian. They helped me a lot when I was in UNH.

I wish to thank the ECE department at UNH for providing a good study environment and the U.S. National Science Foundation (NSF under the contract ECCS-0702109) for funding parts of my Ph.D. research. Finally, I want to thank UNH for awarding me the 2009-2010 Dissertation Year Fellowship to support my Ph.D. dissertation preparation.
3.3.4 VCO ................................................................. 101
3.3.5 Jitter Transfer .................................................... 102
3.3.6 Jitter Tolerance .................................................. 105
3.3.7 Jitter Generation ................................................ 108
3.3.8 The CDR in this Project ......................................... 110
3.4 Simulation Results and Layouts ................................. 112
3.5 Discussion and Future Work ..................................... 118
  3.5.1 Discussion ..................................................... 118
  3.5.2 Future work .................................................. 118
    3.5.2.1 Line Decoding ........................................... 118
    3.5.2.2 Speed Enhancing ....................................... 119
4. TIME-TO-DIGITAL CALIBRATION ................................. 120
  4.1 Technology Overview .......................................... 121
  4.2 The Proposed Phase Calibration Circuit ..................... 128
    4.2.1 Structure of the Proposed TDC ......................... 130
      4.2.1.1 Timing Resolution Determine Loop (TRDL) ........ 134
      4.2.1.2 Phase Comparison (PC) ............................. 135
      4.2.1.3 The Comparator .................................... 138
      4.2.1.4 Sampling Circuit ................................... 139
  4.3 Simulation Results and Layouts ............................... 140
  4.4 Discussion and Future Work ................................ 144
    4.4.1 Discussion ............................................... 144
    4.4.2 Future Work ............................................. 145
5. CHIP FABRICATION AND MEASUREMENT ........................................... 146

5.1 Chip Design Flow ........................................................................ 146

5.2 Equipment Setup ......................................................................... 148

5.3 Testing Methods .......................................................................... 150

6. CONCLUSION AND DISCUSSION .................................................. 152

6.1 Conclusion .................................................................................. 152

6.2 Discussion .................................................................................. 153

LIST OF REFERENCES ......................................................................... 155

APPENDIX A LINEAR FEEDBACK SHIFT REGISTER ................................. 163

APPENDIX B LINE ENCODING ............................................................. 167
LIST OF TABLES

Table 1-1 SONET/SDH hierarchy.......................................................5
Table 1-2 Comparison of previously published SerDes designs......................16
Table 1-3 Comparison of previously published PLL designs..........................17
Table 2-1 Device comparison between the D latch and the delay component I ...........33
Table 2-2 Different PLL types with different loop filter orders.......................36
Table 2-3 PLL performance comparison...............................................66
Table 2-4 Simulated power consumption of the data path and the clock path in the
    proposed 16:1 MUX..................................................................69
Table 2-5 Simulated power consumption of the data path and the clock path in the
    proposed 4:1 MUX with 65nm CMOS technology...............................76
Table 3-1 Component value in the output driver.........................................92
Table 3-2 A MUX/DEMUX comparison of this work with some published designs....116
Table 4-1 Timing resolution of some previous published designs....................127
Table 4-2 Jitter performance comparison at different temperatures with and without
    the proposed calibration circuit..................................................143
Table 5-1 Testing equipment and specifications..........................................149
Table A-1 The polynomial coefficients of the LFSR with different numbers of bits....166
Table B-1 Examples of 8B/10B encoding...............................................167
Table B-2 8B/10B encoding lists.......................................................169
LIST OF FIGURES

Figure 1-1  Data skew elimination with SerDes..................................................2
Figure 1-2  Chip area change with/without a Serializer as technologies scale down....3
Figure 1-3  Block diagram of the 16-bit SerDes..................................................5
Figure 1-4  Mean data rate for high speed I/O links, data from ITRS 2007...............7
Figure 1-5  Wireless communication application spectrums, data from ITRS 2007......7
Figure 1-6  OC-192 jitter transfer and jitter tolerance masks.................................14
Figure 2-1  Block diagram of the Serializer..........................................................23
Figure 2-2  Block diagram of the proposed 16:1 MUX and the data transfer of the 4:1
MUX in the first stage..........................................................................................26
Figure 2-3  Block diagram of the last 4:1 MUX.......................................................27
Figure 2-4  Current mode logic (CML) implementations of the three different logic
   gates in this project..............................................................................................29
Figure 2-5  Schematics of the three 2:1 MUXs in the last 4:1 MUX......................30
Figure 2-6  Bandwidth comparison of the 2:1 MUXs shown in Fig. 2-5 (a) and (b)...31
Figure 2-7  Data transfer diagram of the last 4:1 MUX..........................................32
Figure 2-8  Schematic of the delay component I....................................................33
Figure 2-9  Structure of a multi-phase phase-locked loop......................................35
Figure 2-10 Two different types of oscillators......................................................37
Figure 2-11 Block diagrams of the proposed VCO and its tuning structure..........40
Figure 2-12 Schematic of the proposed VCO delay buffer....................................41
Figure 2-13 The simplified VCO buffer model......................................................42
Figure 2-14 Output phase relationship of the proposed VCO

Figure 2-15 The simplified phasor diagram of the proposed VCO under two boundary conditions

Figure 2-16 Structure of the 3-state phase detector

Figure 2-17 Schematic of the proposed one-level AND gate

Figure 2-18 Schematic of the NIA charge pump with a second order loop filter

Figure 2-19 Phase detector and charge pump outputs with different input signals

Figure 2-20 Structure of the programmable integer-N divider

Figure 2-21 PLL diagram showing how the VCO control voltage reflects the phase error

Figure 2-22 The proposed PLL model in the frequency domain

Figure 2-23 Simulation results of the PLL step response and frequency response

Figure 2-24 Block diagram of the PLL with the noise sources

Figure 2-25 Noise generated in the loop filter

Figure 2-26 Simulation results of the PLL component phase noise in Matlab

Figure 2-27 Schematic of the 12-bit LFSR

Figure 2-28 Simulation results of the VCO tuning and the VCO control voltage vs. the VCO phase noise

Figure 2-29 Simulated VCO phase noise: -110dBc/Hz at 1MHz offset

Figure 2-30 Spectrum and output waveform of the proposed PLL: center frequency is 2.24GHz×4=8.96GHz

Figure 2-31 The measured phase noise of the PLL: -118.64dBc/Hz at 1MHz offset

Figure 2-32 The measured PLL phase noise vs. the power supply voltage
Figure 2-33 Simulated and measured eyediagram of the transmitter at 36 Gb/s data rate

Figure 2-34 Layout of each part in the transmitter with 0.18μm BiCMOS technology

Figure 2-35 Schematic of the three-level MS latch

Figure 2-36 Schematic of the proposed level shifter

Figure 2-37 Schematic of the MS latch in the frequency divider

Figure 2-38 Simulation waveforms of the frequency divider with 15GHz VCO output

Figure 2-39 Simulated phase margin of the PLL with Matlab

Figure 2-40 PLL step response with Matlab

Figure 2-41 Simulation waveform of the VCO phase noise

Figure 2-42 Simulated eyediagram of the 4:1 MUX at 60 Gb/s rate

Figure 2-43 Layouts of the 4:1 MUX and PLL with 65nm CMOS technology

Figure 2-44 Diagram of wire bonding

Figure 2-45 Diagram of the flip-chip bonding

Figure 3-1 Block diagram of the deserializer

Figure 3-2 TX output data with different duty cycle of the clock sign

Figure 3-3 Received data at different data patterns

Figure 3-4 Two methods to broaden the circuit bandwidth and their effects on the eyediagram

Figure 3-5 Structure of the proposed 1:16 DEMUX

Figure 3-6 Traditional 1:4 DEMUX

Figure 3-7 The proposed 1:4 DEMUX

Figure 3-8 Data transfer diagram of the proposed 1:4 DEMUX

xii
Figure 3-31 Eyediagram of the recovered data and clock: the data rate is 15 Gb/s and the clock frequency is 15GHz

Figure 3-32 Simulated waveforms of the input and output signals of the proposed SerDes

Figure 3-33 Chip layout of the 4:1 MUX/DEMUX with IBM 65nm CMOS technology: layout area is 1.75mm × 2.00mm

Figure 3-34 Chip photo of the 16-bit SerDes with Jazz Semiconductor 0.18µm BiCMOS technology: the chip area is 2.235mm × 2.092mm

Figure 4-1 Profile structure of the 3D process

Figure 4-2 Profiles of the standard CMOS and the SOI CMOS processes

Figure 4-3 Elimination of the substrate noise coupling with SOI technology

Figure 4-4 Structure of a multi-phase PLL and its existing phase mismatch

Figure 4-5 Block diagram of the proposed TDC embedded PLL

Figure 4-6 A typical TDC with Vernier delay line

Figure 4-7 The proposed TDC and calibration structure

Figure 4-8 Schematic of the proposed VCO in the TRDL loop

Figure 4-9 The input and output waveforms of the TDC block

Figure 4-10 Schematic of the proposed D flip-flop with holding and resetting

Figure 4-11 Schematic of the 6-bit synchronous counter

Figure 4-12 Schematic of the 6-bit comparator

Figure 4-13 Schematic and waveforms of the sampling circuit

Figure 4-14 Simulation waveforms of the proposed TDC, from top to bottom: VCO1, VCO2 and D latch output
Figure 4-15 VCO output in the TRDL loop under activating and deactivating conditions...............................141

Figure 4-16 Simulated eyediagram of the multi-phase PLL output with and without the calibration circuit.................................................................142

Figure 4-17 Layout of the proposed TDC embedded PLL (1.4mm×1.0mm)........144

Figure 5-1 Design flow of this project.................................................................147

Figure 5-2 Equipment setup of the chip testing in this project......................149

Figure A-1 LFSR with Fibonacci implementation........................................164

Figure A-2 LFSR with Galois implementation..............................................164

Figure A-3 3-bit LFSR with Galois implementation.....................................165

Figure B-1 Schematic of the 8B/10B encoding..............................................168

Figure B-2 8B/10B encoding of the symbol “000 00000”............................168

Figure B-3 Schematic of the 64B/66B encoding..........................................170
ABSTRACT

A HIGH SPEED SERIALIZER/DESERIALIZER DESIGN

by

Yifei Luo

University of New Hampshire, September, 2010

A Serializer/Deserializer (SerDes) is a circuit that converts parallel data into a serial stream and vice versa. It helps solve clock/data skew problems, simplifies data transmission, lowers the power consumption and reduces the chip cost. The goal of this project was to solve the challenges in high speed SerDes design, which included the low jitter design, wide bandwidth design and low power design.

A quarter-rate multiplexer/demultiplexer (MUX/DEMUX) was implemented. This quarter-rate structure decreases the required clock frequency from one half to one quarter of the data rate. It is shown that this significantly relaxes the design of the VCO at high speed and achieves lower power consumption.

A novel multi-phase LC-ring oscillator was developed to supply a low noise clock to the SerDes. This proposed VCO combined an LC-tank with a ring structure to achieve both wide tuning range (11%) and low phase noise (-110dBc/Hz at 1MHz offset).

With this structure, a data rate of 36 Gb/s was realized with a measured peak-to-peak jitter of 10ps using 0.18µm SiGe BiCMOS technology. The power consumption is 3.6W with 3.4V power supply voltage. At a 60 Gb/s data rate the simulated peak-to-peak jitter was 4.8ps using 65nm CMOS technology. The power consumption is 92mW with 2V power supply voltage.
A time-to-digital (TDC) calibration circuit was designed to compensate for the phase mismatches among the multiple phases of the PLL clock using a three dimensional fully depleted silicon on insulator (3D FDSOI) CMOS process. The 3D process separated the analog PLL portion from the digital calibration portion into different tiers. This eliminated the noise coupling through the common substrate in the 2D process. Mismatches caused by the vertical tier-to-tier interconnections and the temperature influence in the 3D process were attenuated by the proposed calibration circuit.

The design strategy and circuits developed from this dissertation provide significant benefit to both wired and wireless applications.
CHAPTER 1

INTRODUCTION

1.1 Motivation and Goals

With the rapid growth of the internet and the development of new storage techniques, larger transmission capacity is required. More information is required with less data transmission time. Multi-channel circuits are being used in parallel to transfer more data at the same time. However, as speed increases, skewing problem among different channels becomes critical, which causes data transmission errors. This affects the data magnitude, phase, and wave symmetry between differential signals. The causes of this data skew problem are commonly different routing lengths, different local termination impedance due to process variations, different sampling edges among different channels and crosstalk [1]-[2]. In addition, each channel requires a 50 Ω termination resistor to eliminate signal reflection [3]. Thus, multi-channel circuits consume more power. Finally, more channels require more pads, which prevent the area from shrinking as technology scales down.

A different approach to solving all of these problems at high data speed has been developed. The Serializer/Deserializer (SerDes) is an efficient method. Simply stated, a SerDes chip is a circuit which converts parallel data into serial data or vice versa [4]. It is widely used in people's everyday lives such as in Gigabit Ethernet systems, wireless network routers, storage applications and fiber-optic communication systems.
There are several advantages of this method of serial data transmission over multi-channel transmission. First, the data-skewing problem is eliminated. Because parallel data is transferred to serial data within the chip, only one data stream goes through the channel. Thus, the influences due to channel-to-channel interference no longer exist, as shown in Fig. 1-1. Also data timing can be more easily controlled within the chip rather than through lossy channels.

![Diagram of data skew in multi-channel transmission](image1)

![Diagram of data skew elimination with SerDes](image2)

Second, chip area can be reduced, and as a result, the cost can be decreased. As mentioned above, the number of pads in a chip becomes the constraint factor when the technology scales down. This is because the pad area does not scale down at the same factor due to thermal and mechanical bridging issues between conductors [5]. Although the size of the transistors scales down, the chip pads must stay almost the same; therefore, the chip area does not decrease, and more area may be wasted as shown in Fig. 1-2 (a). However, the number of pads can be decreased after converting parallel data into serial data, and thus the chip area can be significantly decreased, as shown in Fig. 1-2 (b).
area reduction makes the yield per wafer higher and thus the cost per chip lower. In addition, the decreased pin number makes the board design simpler and less expensive. Third, less power consumption can be achieved. As mentioned above, each channel needs a 50 Ω termination, and this consumes much power in multi-channel circuits. However, with serial output data, only one 50 Ω termination is required and thus power consumption is reduced.

Like all other techniques, SerDes design also comes with some challenges, which includes low clock noise, wide bandwidth and low jitter. The clock noise directly affects the output data jitter so that the low noise clock design is a critical task in any SerDes design. As the data is converted from parallel to serial, the output data rate increases significantly. High speed brings difficulties to circuit designers, such as inter-symbol-interference (ISI). To solve this ISI issue, wide bandwidth is required. Currently, a 10 Gb/s data rate is widely used in the industry; 40 G/s data rate is being researched in
universities and academic institutes. Broadband techniques such as inductive peaking, capacitive degeneration, $f_T$ doubling and Cherry-Hooper amplification have been developed to increase the circuit bandwidth [6]-[9].

High speed SerDes design was chosen as this dissertation's project. The goal of this research was to explore a more efficient structure in order to achieve a higher data rate under current technology with satisfactory performance. This includes the following main criteria: data rate, power consumption and jitter performance. The initial target data rate was 40 Gb/s with better power consumption and jitter performance than the current state-of-the-art SerDes. Next, the goal was to push the data rate higher using common technologies. Then the performance of each component was made to satisfy specific criteria, as described in the following chapters.

The applications of SerDes chips range from short-haul SerDes, such as chip-to-chip and board-to-board data transmissions, to long-haul SerDes such as synchronous optical networking (SONET), which is implemented over distances exceeding 100km. SONET is a standard set by the American National Standards Institute (ANSI), the industry standard for U.S. communications. An equivalent standard called synchronous digital hierarchy (SDH) has been set by the International Telecommunications Union (ITU). Table 1-1 shows the SONET/SDH hierarchy. Higher data rates may be covered in the future.

The design in this dissertation belongs to the short-haul SerDes category and contains a transmitter and a receiver as shown in Fig. 1-3. The blocks in the dashed box were implemented in this project. The transmitter (TX) converts parallel data into serial data, whereas the receiver converts serial data back into parallel data. Necessary coding
and decoding circuits such as 8B/10B and 64B/66B were added before the transmitters and after the receivers to avoid errors in data recovery [10].

Table 1-1 SONET/SDH hierarchy

<table>
<thead>
<tr>
<th>Optical Level</th>
<th>SDH Equivalent</th>
<th>Bit Rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>OC-1</td>
<td>-</td>
<td>51.840 Mb/s</td>
</tr>
<tr>
<td>OC-3</td>
<td>STM-1</td>
<td>155.520 Mb/s</td>
</tr>
<tr>
<td>OC-12</td>
<td>STM-4</td>
<td>622.080 Mb/s</td>
</tr>
<tr>
<td>OC-48</td>
<td>STM-16</td>
<td>2488.320 Mb/s</td>
</tr>
<tr>
<td>OC-192</td>
<td>STM-64</td>
<td>9953.280 Mb/s</td>
</tr>
<tr>
<td>OC-768</td>
<td>STM-256</td>
<td>39813.12 Mb/s</td>
</tr>
</tbody>
</table>

Figure 1-3 Block diagram of the 16-bit SerDes
This design allows for an effective tradeoff in the performance criteria of voltage-controlled oscillators (VCOs), multiplexers/demultiplexers (MUXs/DEMUXs), phase-locked loops (PLLs), etc. The critical components in our SerDes design were the phase-locked loop (PLL) in the transmitter and the clock and data recovery circuit (CDR) in the receiver. The goals of this project also included the implementation of a high performance PLL with low phase noise and wide tuning range, and the implementation of a CDR that would satisfy all performance requirements including jitter transfer, jitter generation and jitter tolerance. As explained later, a novel LC-ring oscillator was implemented in the transmitter PLL and in the receiver CDR to achieve both wide tuning range and low phase noise. A quarter-rate MUX/DEMUX was implemented to relax the design constraints, such as power consumption and clock frequency.

1.2 Serializer/Deserializer (SerDer) Overview

Fig. 1-4 shows the history and future of the mean data rate in high speed I/O links. Many different interfaces are represented in Fig. 1-4, such as Sonnet, SATA, XAUI, PCIE, etc. An important point can be learned from Fig. 1-4. First, current SerDes circuits concentrate on data rates below 10 Gb/s. The data rate doubles every 2 to 3 years from 1995 to 2010. After 2010, the rate of growth slows down, which shows a technology breakpoint between 10 to 20 Gb/s. At this breakpoint, physical limitations become the dominant factor preventing the data rate from increasing much beyond 10 Gb/s. As speed goes higher, the parasitic capacitance and inductance introduced by packaging considerations limit the circuit bandwidth and thus the circuit speed.
To implement the SerDes chip, different technologies can be used depending on the speed requirement. Fig. 1-5 shows the wireless communication data rate achievements for different technologies.

Fig. 1-4 Mean data rate for high speed I/O links, data from ITRS 2007

Fig. 1-5 Wireless communication application spectrums, data from ITRS 2007

From Fig. 1-5, we have a general view of the frequency range for different technologies, from SiC MESFET, to InP HBT, HEMT, and GaAs MEHMT. CMOS
technology covers below 28GHz, and SiGe BiCMOS technology covers higher frequencies, up to around 77GHz. This division in usage is caused by the different technology characteristics. In CMOS technology, the low carrier mobility in PMOS transistors prevents their use in high speed operations. Rail-to-rail swing is another factor affecting the circuit speed. And the single-ended signals in CMOS technology make it less robust to noise than the differential signals in high speed circuits. Parasitic influence is another factor. In addition, the smaller transconductance of the MOS transistors causes lower $f_T$, hence insufficient circuit bandwidth for high speed operations [11]. However, in BiCMOS technology, the bipolar transistors have the advantage of a larger transconductance, resulting in higher cut off frequencies, higher bandwidths, easier realization of differential signals (making it immune to the common mode noise), and smaller signal swings, as in ECL [11]. At speeds above 77GHz, GaAs and InP materials are used, which is often not a feasible choice for industrial applications because of the higher cost.

In light of the cost and the performance advantages, two technologies were chosen for this research: 0.18μm SiGe BiCMOS technology and 65nm CMOS technology.

Jazz Semiconductor’s 0.18μm SiGe BiCMOS technology that we used achieves a wide frequency range because of its high $f_T$ and inherently low noise. However, the power consumption and cost of SiGe BiCMOS technology are higher than those of CMOS technology. The IBM 65nm CMOS technology we used has lower power consumption, a higher level of integration and lower cost. For this advanced CMOS technology, circuit speed can be increased above that of large size CMOS technologies.
because of the reduced transistor size, resulting in fewer parasitic influences and a higher
$f_r$ due to the smaller channel length.

The SerDes chip in this project contains two parts, as shown in Fig. 1-3: the
transmitter and the receiver. The transmitter converts the parallel 16 bit input to a serial
output, while the receiver converts the serial stream back into a parallel 16 bit output.
Because of the high data rate at the transmitter output, sufficient bandwidth is required
for the last 2:1 MUX necessitating broadband techniques. A novel multi-phase VCO was
implemented in the PLL and CDR to relax the design requirements, such as the frequency
and power consumption. Both the MUX and the DEMUX are controlled by multi-phase
clocks to further relax the design constraints.

1.2.1 Transmitter

As introduced above, the transmitter acts as a multiplexer circuit which converts the
low speed parallel data into high speed serial data. It consists of a MUX, a PLL, which
delivers the clock signals to the entire transmitter, and a control logic circuit, which
distributes the clock signal to each block. The block diagram of the transmitter in our
design is shown in the top dashed box in Fig. 1-3. The transmitter input is a 16-bit
parallel data stream following the encoding block. The PLL generates high speed multi-
phase clock signals based on an external reference clock. Those multi-phase clocks go
through a control circuit, providing the controlling clocks for the 16:1 MUX.

When the data comes into the serializer, it is not synchronized with the clock.
Normally, the data is pre-sampled so it can be synchronized with the internal clocks. The
data type used in this design is the non-return-to-zero (NRZ) scheme, which is just a
simple square wave because of its simplicity and compatibility with conventional digital signal techniques. A high voltage level represents logic “1”, and a low voltage level represents logic “0”. The random NRZ data has the advantage of being more bandwidth efficient [6], [12]. However, the disadvantage is their DC levels when there are continuous “1”s or “0”s. A continuous voltage level will cause difficulty in extracting and recovering data and clock information from the received signal due to the lack of transitions in the data. Thus, an encoder is always used before the data comes into the multiplexer to avoid any long strings of “1”s or “0”s.

In the 8B/10B encoding, each byte is assigned a 10-bit code. The 10-bit code contains four “1”s and six “0”s, five “1”s and five “0”s, or six “1”s and four “0”s [36]. This prevents the occurrence of too many continuous “1”s and “0”s and helps to synchronize the clock. To maintain the DC balance, a calculation called the running disparity calculation is used to keep the number of “0”s transmitted the same as the number of “1”s transmitted. And because of the increased bit number, the data rate also has to increase. For example, a 1Gbps line results in a data rate of 1.25Gbps. In Gigabit Ethernet, this rate is then reduced using PAM-5, a five-level code to achieve a lower bandwidth than possible with a 3 level code, such as MLT [13].

The transmitter should have good jitter generation performance so that even through channel attenuation, the data can still be recognized by the receiver. This requires a low noise PLL, wide bandwidth MUXs, and other careful considerations, such as special layout routing as well as process, voltage, and temperature (PVT) variations.

A low noise PLL can be realized only by careful structure. In this project, a novel multi-phase VCO was implemented to achieve both wide tuning range and low phase
noise. This VCO combined a ring structure with an LC-tank structure. The LC-tank was easily able to achieve low phase noise because of the high quality factor of the inductor. However, the tuning range of the LC-tank was restricted by the limited tuning of the varactors [6]. In this project, through control of current variation within the ring structure, wider frequency tuning was achieved. Detailed explanations are supplied in Chapter 2. With this multi-phase PLL, a novel 4:1 MUX was developed to relax the clock frequency requirement. For example, the clock frequency is at least half the data rate in a traditional MUX. In our MUX, the clock rate only needs to be one fourth of the data rate. Thus, 40 Gb/s data needs only a 10GHz clock instead of a 20GHz clock. This frequency reduction due to structure optimization makes high speed applications easier in current semiconductor processes with reasonable clock frequencies.

The last 2:1 MUX needs to have a wide bandwidth because of its high data rate. The bandwidth should be larger than half the data rate, which is the highest bandwidth of the NRZ data, given continuous switching of the data. Insufficient bandwidth will cause inter-symbol interference (ISI) problems, which cause difficulty in the clock and data recovery [6], [14]. A wider bandwidth decreases the rising and falling times of the transmitted serial data, thus widening the horizontal and vertical eye opening of the serializer output’s eyediagram.

1.2.2 Receiver

The purpose of the receiver in a SerDes is to convert the received high speed transmitter’s serial output data back into parallel output data. After the decoding circuit, this parallel data will be the same as the input parallel data and will be processed by
subsequent circuits. The receiver in this project consists of the DEMUX circuit and the clock and data recovery (CDR) circuit, as shown in the bottom dashed box of Fig. 1-3.

The DEMUX is used to transfer the serial data into parallel output data. Traditionally, the DEMUX needs the clock frequency to be at least half of its input data rate [6], [15], [16]. In our proposed DEMUX circuit, the clock frequency only needs to be one fourth of its input data rate. For example, a 20GHz clock is required for a 1:4 DEMUX for 40 Gb/s input in a traditional 1:4 DEMUX. However, with the proposed 1:4 DEMUX, only a 10GHz clock is needed for a 40 Gb/s data rate. This frequency reduction is realized by the quadrature phase control clock. The proposed multi-phase VCO is also used in the receiver section to generate the required quadrature phase clocks. More details will be supplied in Chapter 3.

The CDR is a critical component in the receiver because it extracts clock edges from the serial data and then uses this extracted clock signal to retime the serial data and control the DEMUX. Currently there are many different kinds of CDR circuits, according to the input data rate and its VCO frequency. For example, a full rate CDR means that the data rate and the VCO frequency are the same. A half rate CDR means the clock rate is half the data rate. Other multiple rate CDRs such as 1/4 and 1/8 are also used. Multi-rate CDRs relax the design of the VCOs, especially at high speed data rates. However, much lower rate CDRs bring complexities and more power consumption to the circuit design, which may overcome the advantages. A quarter-rate CDR together with our developed multi-phase VCO was implemented. The CDR with its multi-phase VCO in this project also helps with reducing clock loading compared with the widely used Alexander
structure in which the clock has at least four times the loading [6]. Load reduction makes
the multi-phase CDR more attractive to high speed applications.

Three important parameters of the CDR are carefully considered in Chapter 3. These
are jitter transfer, jitter tolerance and jitter generation [6]. In industry, there are different
SONET standards of jitter transfer and jitter tolerance based on various data rates, as
shown in Table 1-1. The goal of our design was to satisfy all the requirements of the
SONET standard.

Jitter transfer describes the amount of jitter the CDR transfers from its input to its
output. To satisfy the jitter transfer requirement, the bandwidth of the CDR transfer
function should be less than the standard mask bandwidth, so that the transferred jitter
within the pass band is smaller than the standard restriction.

The jitter tolerance means how much input data jitter the CDR can tolerate with the
data remaining recoverable. The maximum tolerable jitter decreases as the jitter
frequency increases. To satisfy the jitter tolerance requirement, the jitter tolerance curve
should be above the standard mask. For example, Fig. 1-6 shows the jitter transfer and
jitter tolerance masks of the OC-192 Standard. An acceptable jitter transfer curve should
have a bandwidth of less than 8MHz in order to satisfy the standard requirement. A
satisfactory jitter tolerance curve should lie in the area above the standard jitter tolerance
mask, so that the tolerable jitter meets the standard requirement.

Jitter generation represents the jitter caused by the CDR itself, i.e., the output jitter if
the input contained no jitter. Methods to attenuate such jitter generation are introduced in
Chapter 3 including charge pump current matching, VCO phase noise reduction, adding
coupling capacitors between power supplies, and isolating the critical parts.
1.3 State of the Art

As mentioned in the beginning, the goal of this research was to explore the performance edges of the SerDes circuit under current technologies such as CMOS and BiCMOS technologies. The survey of state-of-the-art performance by current SerDes circuits and their components will guide our research.

In year 2001, the SerDes chip had reached 10 Gb/s with 0.25µm technology [17]. Currently optical communication devices under the OC-192 Standard, which applies to a 10 Gb/s data rate, have been in mass production, and the technique is very mature. Speeds of 20 Gb/s and even 50 Gb/s were thought to be at the cutting edge of the maximum potential of 0.18µm technology [18]. However, old technologies severely limited the data rate and performance. The increase in power consumption as data rates sped up made high speed applications impractical. Thus, advanced technologies and better circuit structures were necessary for better power efficiency and data performance.

Today, many publications reflect to the SONET OC-768 Standard, which has a transmission speed up to 39.813 Gb/s. The next generation of the OC-192 Standard has a transmission speed of up to 9.953 Gb/s. Many exciting tape out results have been published. Some of them use sub-0.13 µm CMOS technology to achieve a 40 Gb/s data
rate [16], [19], [37]-[40]. Some of them use older technologies (above 0.13μm CMOS) which yield a 20 Gb/s rate [20], [21]. Some of them use well-developed BiCMOS technology with its high $f_T$ to achieve up to 140GHz data rates [22], [23], [41]-[45]. The MUX/DEMUX in [19] saved 33.5% power compared with that in [16] because of the lower power supply utilized in [19]. However, the structure in [16], [19]-[23] still requires a higher clock frequency, namely half of the data rate. If the clock frequency can be further decreased, then we believe that the power consumption will be reduced even more, and the power efficiency will be higher. This is a very important parameter in today’s industrial products. This is a goal of our proposed MUX/DEMUX structure which can save more power while still maintaining good performance because of the reduced demand on the VCO frequency. We claim that the data rate can be boosted up to 60 Gb/s based on the proposed structure.

Table 1-2 lists the references mentioned above in the categories of year, structure, technology, bit rate and power. From the table, 40 Gb/s has been widely researched within the last five years from SiGe BiCMOS technology to CMOS technology, most below 0.13μm technology. The old SiGe achieved the same speed as the advanced CMOS technology but at the cost of greater power consumption due to the higher supply voltage. Advanced CMOS technology has good low power characteristics because of the low supply voltage. We believe that achieving low power with old technology and higher speed with advanced technology will be a trend in future SERDES design and was a worthwhile goal for this project.

To realize high performance SerDes design, one critical component must be considered very carefully, the PLL. This is because the PLL not only supplies all the
control clocks in the chip, but also plays an important role in data quality. The PLL has been widely used in every field and has been developed over a very long time. Lots of publications regarding PLL designs can be found based on different structures and operating frequencies, ranging from several MHz to above 10GHz. Some of them were implemented with ring oscillators, such as those in [24] and [25]. Some of them were implemented with LC-tank oscillators as described in [26] and [27]. Those PLLs with different types of VCOs have achieved very good performance, either in tuning range such as [25], or in noise performance, such as [27]. Higher speed PLLs have exceeded 50GHz and 60GHz center frequencies, as shown in [28] and [29], but with degraded
power consumption and phase noise performance. Moreover, there is continued research to trade off performances, including speed, noise, power consumption and frequency tuning range. The PLL in [25] reached both higher frequency, and wider tuning range than that in [24] while sacrificing the phase noise and power consumption performance. The PLL in [26] reached both wider tuning range and lower power consumption than [27], but at the cost of worse phase noise performance. Thus, how to better trade off all the parameters becomes a challenging topic that we explore in this research. We have supplied our answers with our proposed LC-ring oscillators in the later chapters.

Table 1-3 listed several PLL papers published in the past five years with different technologies. Wider tuning ranges is often associated with higher phase noise. Higher clock frequencies come with larger phase noise.

<table>
<thead>
<tr>
<th>Year</th>
<th>Technology</th>
<th>Center frequency</th>
<th>Tuning range</th>
<th>Phase noise (1MHz offset)</th>
<th>Jitter (RMS)</th>
<th>Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>[25] 2004</td>
<td>90nm SOI CMOS</td>
<td>9.6 GHz-12.8 GHz</td>
<td>35%</td>
<td>-85.3 dBc/Hz</td>
<td>1.33 ps</td>
<td>1.7V/195 mW</td>
</tr>
<tr>
<td>[24] 2005</td>
<td>90nm CMOS</td>
<td>10 GHz</td>
<td>N/A</td>
<td>-104 dBc/Hz</td>
<td>1 ps</td>
<td>1.2V/132 mW</td>
</tr>
<tr>
<td>[27] 2005</td>
<td>0.25µm SiGe</td>
<td>9.75 GHz-10.6 GHz</td>
<td>8.35%</td>
<td>-104 dBc/Hz</td>
<td>N/A</td>
<td>3.3V/N/A</td>
</tr>
<tr>
<td>[85] 2005</td>
<td>BJT 46 GHz, fT</td>
<td>5.09 GHz</td>
<td>8.4%</td>
<td>-101 dBc/Hz at 100 kHz</td>
<td>N/A</td>
<td>3.3V/N/A</td>
</tr>
<tr>
<td>[86] 2005</td>
<td>BJT 46 GHz, fT</td>
<td>10.18 GHz</td>
<td>8.4%</td>
<td>-95 dBc/Hz</td>
<td>N/A</td>
<td>3.3V/N/A</td>
</tr>
<tr>
<td>[28] 2008</td>
<td>0.25µm SiGe</td>
<td>26.2 GHz</td>
<td>6.6%</td>
<td>-89 dBc/Hz</td>
<td>N/A</td>
<td>2.5V/400 mW</td>
</tr>
<tr>
<td>[29] 2008</td>
<td>90nm CMOS</td>
<td>49 GHz-55 GHz</td>
<td>11.5%</td>
<td>-95 dBc/Hz</td>
<td>N/A</td>
<td>1.8V/60mW</td>
</tr>
</tbody>
</table>

A calibration circuit is developed in this project for the purpose of compensating the phase mismatches caused by PVT variations and tier-to-tier interconnections in three
dimensional (3D) processes necessary for constructing multi-phase clocks. With phase calibration, lower data jitter can be achieved because the random jitter mainly comes from the PLL. Current state-of-the-art calibration circuits have extended from above ten picoseconds (20ps in [30], 10ps in [31], 29.6ps in [32]) to below 5ps timing resolution (5ps in [33], 4ps in [34], 1ps in [35]). Further reduction of the timing resolution will benefit high speed calibrations.

Although many papers on the SerDes under the OC-768 Standard have been published, there is still some time before industrial mass production will occur. But we believe that day will come very soon, and we hope this thesis will be a good reference to future higher speed and higher performance SerDes implementation.

1.4 Contributions to the Field

The advantages of high speed serial links over parallel transmission promise the SerDes chip a very bright future. How to solve the challenges in SerDes design has therefore become a very worthwhile topic. This project focused on the development of structure to solve these challenges.

First, a novel multi-phase VCO was implemented to achieve both low phase noise and wide tuning range. Currently two different types of oscillators are widely used: ring oscillators and LC-tank oscillators. The ring structure can easily achieve a wide frequency tuning range by changing its number of VCO delay buffers. However, the phase noise of a ring oscillator is high because of its lack of passive devices. The LC-tank oscillator, on the contrary, generates low phase noise because of its high quality factor. However, the tuning range of LC-tank oscillators is narrow due to the limited tuning of
the varactors. In our VCO design, the two different types of oscillators are combined with each other to realize both wide tuning and low phase noise. Wide tuning makes the VCO tolerate more process, voltage and temperature (PVT) variations. Low phase noise guarantees good jitter performance. In addition, this hybrid structure makes it easy to realize the multi-phase clocks without any additional components.

Second, a quarter-rate MUX/DEMUX structure is implemented to relax the design constraints at high operation speeds. Compared with the traditional MUX/DEMUX structure, the proposed one requires the clock frequency to be at only 1/4 of its data rate instead of at half of its data rate or higher. This structure becomes more attractive as the data rate increases because higher frequency VCOs (above several tens of GHz) are very difficult to implement with current widely used processes, such as CMOS and BiCMOS technologies. Some VCOs exceeding 50GHz have been published ([28]-[29]); however, either they consume too much power, or routing is difficult at such high speeds. Also, at such high speeds, clock jitter becomes more significant and affects the data jitter. In addition, the influences of the PVT variations become more difficult to predict. The proposed quarter-rate structure proves to be a more efficient way to achieve high speed rates with affordable VCO design.

Third, a novel calibration circuit was developed based on the time-to-digital converter (TDC) circuit to compensate for the phase mismatches among the multi-phase clocks in the PLL. A new technology, three dimensional fully depleted silicon on insulator (3D FDSOI) CMOS technology, was used for this calibration circuit design. Because of the 3D stack structure, the analog PLL and the digital calibration circuit were separated into two different tiers, thus eliminating the noise coupling between the analog
Furthermore, the TDC calibration circuit successfully compensated for the phase mismatches caused by the process variation, the 3D vertical interconnection, and the temperature issues in this stack structure. The proposed TDC calibration circuit achieved a timing resolution as high as 2ps.

1.5 Thesis Organization

In this thesis, the design of the high speed SerDes chip is explained in detail from system level to transistor design, and finally to the layout design. The solutions to these design challenges are included in separate chapters.

Chapter 2 specifies the design of the transmitter in this project. To solve the challenge of low noise clock generation, we proposed a novel multi-phase LC-ring oscillator. This oscillator achieved both wide tuning range and low phase noise. A quarter-rate 4:1 multiplexer was also implemented to significantly relax the VCO design constraints and decrease the power consumption. A delay matching circuit was used in the clock path to align the data with the clock. Inductive peaking was used to achieve broadband operation for high speed operation.

Chapter 3 describes the design of the receiver including the proposed demultiplexer and the clock and data recovery (CDR) circuit with the multi-phase VCO we developed. The quarter-rate 1:4 DEMUX we developed significantly decreased the required clock frequency, and reduced the power consumption due to the reduction in the number of devices and the lower frequency requirement of the clock.

Chapter 4 includes the design of a time-to-digital calibration circuit to improve the jitter performance of the multi-phase PLL. A 3D FDSOI process was used with this
The advantages of the 3D FDSOI process are explained in details. The 3D process helps separate the analog PLL function from the digital calibration function through separate tiers in the 3D structure, eliminating the noise coupling issues present due to use of the common substrate in the 2D process. The TDC calibration circuit successfully compensated for the phase mismatches caused by the process variations, the vertical interconnections and the temperature influence in the 3D stack structure.

In Chapter 5, conclusions are made and the discussion of future work is addressed.
CHAPTER 2

TRANSMITTER

As shown in Fig. 2-1, a serializer typically consists of an input driver, an encoding block, a multiplexer (MUX), and a phase-locked loop (PLL). The encoding block is used to avoid long inputs of consecutive “1”s or “0”s. For example, a normal encoding block is an 8B/10B encoder that converts long consecutive “1”s or “0”s into either four “1”s and six “0”s, or five “1”s and five “0”s, or six “1”s and four “0”s [36]. This encoding prevents a sequence of too many consecutive “1”s and “0”s, assisting the clock extraction and the data recovery in the receiver. The PLL generates the clock signals for the MUX, and the PLL’s performance directly determines the output data quality. Its jitter is transferred to the receiver side, thus affecting the recovered data quality. As a consequence, a low-phase-noise PLL is required. In addition, due to the process, voltage and temperature (PVT) variations, a wide tuning range is required so that the PLL can maintain the in-lock condition within the PVT variations.

In this chapter, a novel multi-phase VCO combining the ring structure and the LC tank was implemented, realizing both the wide tuning range and low phase noise. The phase-locked loop with this proposed VCO achieved a better figure of merit (FOM) than some previously published work. A quarter-rate 4:1 MUX was also implemented. Compared to the typical 4:1 MUX, which requires the clock rate to run at half the data rate, our proposed structure required the clock to run at only 1/4 of the data rate. This significantly relaxes the design constraints for the high speed clock. In addition, the
power consumption can be decreased due to the lower requirement of the clock speed, thereby realizing a better power-speed tradeoff.

The structure of this chapter is as follows. Section 2.1 introduces the proposed 16:1 multiplexer design, focusing on the last 4:1 MUX because of its highest speed. Section 2.2 describes the design details of the proposed phase-locked loop with our specially developed LC-ring oscillator. This includes the phase detector, charge pump, loop filter, VCO and frequency divider. The VCO frequency tuning range and phase noise analyses are supplied. The stability issues are also considered. Section 2.3 briefly introduces the pseudo-random bit generator design. Section 2.4 shows the simulation and measurement results of the transmitter.

![Figure 2-1 Block diagram of the Serializer.](Image)

**2.1 16:1 Multiplexer**

The currently well-developed 0.18μm SiGe technology offers the great potential of implementing high speed transmitters and receivers [22], [23], [41]-[45]. For the same network capacity, a higher data rate will process more information and cost less for the product suppliers. Especially in optical communications, a higher data rate will transfer more information within a limited number of channels. A widely used implementation of
the multiplexer is a serializer circuit in which parallel data is transformed into a serial stream. In short haul SerDes applications, such as chip-to-chip data transmission in multi-chip systems, parallel data transfer may result in clock skews among the data streams, which can be eliminated after being converted into serial data. With an increasing rate of parallel data bits, the speed of the output serial data will increase more. This large serial data rate input requires a high-speed PLL, which is mostly dependent on the device performance of the technology being employed. Large $f_T$ transistors are necessary for high speed [22], [23], [41]-[46]. This is because a higher $f_T$ means lower parasitic capacitances for the same circuit gain, yielding a larger slew rate and wider bandwidths.

The bipolar device we used in this paper had a maximum $f_T$ of 160 GHz which we think may make it possible for the circuit to operate up to 40 Gb/s. The high speed clock is another obstacle for high speed SerDes design. For example, a traditional half rate MUX with a 40 Gb/s data rate needs a 20GHz clock at the last stage. At such a high speed, the VCO clock period may be comparable to the clock jitter, which makes the high speed design less robust than that with the lower speed VCO. The jitter of the high speed clock may move the clock edges to the wrong positions, thus causing sampling of the wrong data.

In this project, we considered two MUX designs: one was a 16:1 MUX at 36 Gb/s using 0.18µm SiGe BiCMOS technology, and the other one was a 4:1 MUX at the higher speed of 60 Gb/s with a quadrature phase clock using 65nm CMOS technology. As mentioned in Chapter 1, advanced technologies, such as those below 90nm, make higher speed operation achievable because of the lower intrinsic parasitic capacitances. The 65nm CMOS technology was thus used to explore the limits in speed, power and jitter
performances. And most significantly, the structure optimization explored in this project was focused to have a positive effect on high speed operation. This chapter will concentrate on how the developed structure in this project achieved better performance under the stricter design constraints.

Compared to the traditional MUX in which a clock of 30 GHz is needed to achieve a 60 Gb/s data rate, our quarter rate MUX only needed a 15 GHz clock without additional dividers to get a data rate of 60 Gb/s. This made the realization of good performance by the VCO much easier because the VCO frequency was significantly decreased. Moreover, this quarter phase clock decreased the power consumption because it needed fewer components than the traditional MUX. Another advantage of the multiphase clock is that we can align the data with the clock at different phases, thus decreasing the number of delay components. Inductive peaking was also used in the last 2:1 MUX because of the highest data rate present at the last stage. This inductive peaking expanded the bandwidth to about twice that without inductive peaking [47].

The 16:1 multiplexer in this thesis included a 16:4 MUX and a 4:1 MUX. Fig. 2-2(a) shows the block diagram of the proposed 16:1 MUX. Since different bandwidths are required for the 16:4 MUX and the last 4:1 MUX, we may use different structures to relax the design complexity. The first four 4:1 MUXs work at 10 Gb/s. With the SiGe HBT BiCMOS technology or the 65nm CMOS technology, this data rate is easy to generate with enough bandwidth. The traditional 2:1 MUX already has a 28GHz bandwidth with 65nm CMOS technology.

As shown in Fig. 2-2 (a), the first four 4:1 MUXs were implemented with traditional structure since the input data speed was only 3.75 Gb/s. Figure 2-2 (b) shows the data
Figure 2-2 Block diagram of the proposed 16:1 MUX and the data transfer of the 4:1 MUX in the first stage.
transfer diagram of the 4:1 MUX in the first stage. The “Clk” signal in Fig. 2-2 (a) is from the PLL output. However, the last 4:1 MUX generates the highest output data at 60 Gb/s, which traditional structure organization is impossible to implement with 65nm CMOS technology. Thus, a quarter-rate 4:1 MUX was implemented as shown in Fig. 2-3 in which a 15GHz clock frequency was required instead of 30GHz!

As shown in Fig. 2-3, the first stage consisted of four MS latches, which were used to pre-sample the data by the same clock signal CLKI. The first four MS latches were also used to hold the data until the following 2:1 MUXs finished the selection. Because the data rate of the last 2:1 MUX was the highest, the bandwidth requirement of that 2:1 MUX
was the most critical and needed to be larger than half the data rate, to avoid signal attenuation. Therefore, inductive peaking was used to increase the bandwidth of the last 2:1 MUX. Delay component I matched the clock delay from node “clk” to node “Q” in the first four MS latches, so that the data and clock signal arrived at the first two 2:1 MUX at the same time. Delay component I plus a 2:1 MUX with fixed inputs matched the clock delay from the first two 2:1 MUXs to the last 2:1 MUX.

In this project, current mode logic (CML) was implemented for the high speed circuits. Fig. 2-4 shows three different logic gates with CML implementation we used in our design. There are several advantages of CML over the CMOS logic.

First, the CML logic has current steering from each branch, thus generating the limited voltage swing instead of the full possible swing in CMOS voltage at the outputs. This made it easier for the CML to achieve higher speed.

Second, CML logic can easily generate differential signals because of its differential implementation. This differential structure has many advantages over the single structure in CMOS logic, such as common mode noise rejection, easier inversion of the output simply by switching the output nodes, and higher output swing. In CMOS logic, additional inverters are needed to generate the differential signals. These inverters introduce a time delay between the two differential signals. Thus, the point at which the differential signals cross will shift away from the center point of the data. This duty-cycle distortion will result in CDR errors.

Third, CMOS CML logic can be easily integrated with bipolar circuits in BiCMOS technology.

However, there are two drawbacks of CML logic. One is the power consumption
because of the constant tail current $I_{ss}$. The other is increased area because of the differential structure and the current sources.

The schematics of the first two 2:1 MUXs and the last 2:1 MUX are shown in Fig. 2-5. With 65nm technology, the PLL clock frequency could achieve 15GHz and the data rate could thus be increased to 60 Gb/s. The bandwidth of the high speed 2:1 MUX was
expanded from 28GHz to 60GHz with inductive peaking, as shown in Fig. 2-6. Since the inductor quality factor was not a critical parameter in the inductive peaking circuit and since the inductance was what we were most concerned about, we were able to choose the inductor with an appropriate value and minimum area from the technology library.

(a) Schematics of the first two 2:1 MUX

(b) Schematic of the last 2:1 MUX with inductive peaking [47]

Figure 2-5 Schematics of the three 2:1 MUXs in the last 4:1 MUX
Figure 2-6 Bandwidth comparison of the 2:1 MUXs shown in Fig. 2-5 (a) and (b).

Compared to the structure in [44], the proposed structure in Fig. 2-3 needed only one constant frequency for all blocks instead of multiple frequency dividers to generate divided clocks. Thus, lower power consumption and fewer components could be achieved with the proposed structure. Let’s take our 65nm version as an example. In the traditional 4:1 MUX, a 30 GHz clock would be required for the last 2:1 MUX to achieve the output data rate of 60 Gb/s. However, with our proposed structure in Fig. 2-3, quadrature phase clocks at the single frequency of 15GHz would be enough to achieve a 60 Gb/s data rate. This advantage is more significant in higher speed circuits where faster VCOs will be more difficult to implement with a traditional structure.

Fig. 2-7 shows the data transfer diagram for the last 4:1 MUX. As shown in Fig. 2-7, input data A1-A4 had the rate of 15 Gb/s. A1 and A3 were sampled by CLKI’s rising edge, generating Q1 and Q3, while A2 and A4 were sampled by CLKI’s falling edge, generating Q2 and Q4. Then, CLKI controlled the top 2:1 MUX to generate M1 through
Q1 and Q2, and \( \overline{\text{CLKI}} \) controlled the bottom 2:1 MUX to generate M2 through Q3 and Q4. Finally, with accurate control of CLKQ, the last 2:1 MUX generated the highest speed serial data, which was at four times the input data rate.

![Data transfer diagram of the last 4:1 MUX](image)

Figure 2-7 Data transfer diagram of the last 4:1 MUX

As mentioned above, another important component in this 4:1 MUX was the delay component as shown in Fig. 2-3. Q1-Q4 were triggered at the rising edge of CLKI, so that there was a delay from the data sampling to the output equal to the delay from the clock’s rising edge to the output of the MS latch, a time period we will denote as \( T_{C-Q} \). Thus, Delay Component I needed to generate a delay equaling \( T_{C-Q} \). Fig. 2-8 shows the schematic of the delay component we developed in this project. This connection ensured
the same output capacitance as that in a D latch except that the two data input nodes were fixed to a constant value to model the stable input of the MS latch.

![Figure 2-8 Schematic of the delay component I](image)

Unfortunately, the capacitance mismatch between states of the D latch caused mismatched delay, as shown in Fig. 2-4 (b). If the sampled data of the MS latch was logic “low,” then the delay component generated the same amount of delay as that in the MS latch. If the sampled data was logic “high,” then because the delay component had already fixed two inputs to logic “low,” the capacitance at the output nodes was slightly different from that in the MS latch due to the fact that one gate was “OFF”, as shown in Table 2-1.

<table>
<thead>
<tr>
<th></th>
<th>Output node voltage</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>High</td>
</tr>
<tr>
<td><strong>D latch</strong></td>
<td>2 drains (OFF) + 1 gate (ON)</td>
</tr>
<tr>
<td><strong>Delay component I</strong></td>
<td>2 drains (OFF) + 1 gate (OFF)</td>
</tr>
</tbody>
</table>

Table 2-1 Device comparison between the D latch and the delay component I
2.2 Phase-Locked Loop

Multi-phase phase-locked loops (PLLs) have been widely used in high speed I/O circuits such as Serializer/Deserializer (SerDes) [48], [49] and time-interleaved analog-to-digital converters (ADCs) to generate multi-phase clock signals [50], [51]. The operation of these SerDes and ADCs are multiplexed with the multiphase clock signals to achieve throughputs beyond multi-gigabit samples per second.

However, multi-phase PLLs have the following two issues. The first one is that the phase noise in clock signals must be small enough to guarantee a wide opening in the eye diagram. The second one is that the PLLs must have wide tuning ranges to tolerate process variation in deep-submicron technologies. In high-speed SerDes and ADC systems these two issues become more significant. The jitter in multi-phase PLLs is often comparable to the clock periods. The deep-submicron processes used for high-speed systems generally have more process variations than previous generations.

In this thesis, we present a low power, low phase noise 9GHz multi-phase PLL designed with Jazz Semiconductor’s 0.18μm silicon germanium (SiGe) BiCMOS technology. The SiGe process was selected because of such desired properties as large gain, high speed, low noise, and high breakdown voltages. We concentrated on jitter reduction and tuning properties of the voltage-controlled oscillator (VCO) in the PLL, since the VCO is a major source for phase noise. Our main contribution to the PLL was this VCO design.

Contemporary VCOs are typically either “ring oscillators” or “LC tank oscillators.” The ring oscillator has a wide tuning range but comes with high phase noise due to the lack of passive elements. On the other hand, the LC tank oscillator offers low phase noise
but with a narrow tuning range, due to the limited tuning range of varactors. In this paper, a novel LC-ring oscillator was implemented to achieve both wide tuning range and low phase noise. This proposed VCO combined the feed-forward interpolated (FFI) ring structure with the LC tank, thus achieving both advantages of those two types of VCOs through an architectural approach.

A traditional PLL is a feedback system that operates on the phase difference between a periodic input signal and the feedback signal, and it operates by causing the output phase and frequency to be aligned with those of the input signal [52]. This is shown in Fig. 2-9. The phase detector (PD) measures the phase difference with its averaged output being linearly proportional to this phase difference. The low pass filter (LPF), called the loop filter, contains a negative impedance amplifier (NIA) charge pump and an RC filter. The limited bandwidth of the LPF eliminates undesired signals from the charge pump and controls the running frequency of the VCO. The VCO then generates a periodic waveform whose frequency is linearly dependent on its control (input) voltage. A typical feedback divider is a pulse swallow frequency divider, which can be programmed to bring the output waveform to the desired frequency [14].

According to different orders of the loop filter, a PLL can be categorized into different types as shown in Table 2-2.
Table 2-2 Different PLL types with different loop filter orders [53]

<table>
<thead>
<tr>
<th>Order of low pass filter</th>
<th>PLL Type I</th>
<th>PLL Type II</th>
<th>PLL Type III</th>
</tr>
</thead>
<tbody>
<tr>
<td>Without low pass filter</td>
<td>$I_{in}$</td>
<td>$V_{out}$</td>
<td>$I_{in}$ $R$ $C$ $C_1$ $C_2$</td>
</tr>
<tr>
<td>First order RC low pass filter</td>
<td>$V_{out}$</td>
<td>$V_{out}$</td>
<td></td>
</tr>
</tbody>
</table>

Type I PLL has no loop filter. Type II PLL consists of a first order loop filter, as shown in Table 2-2. Type III PLL consists of a second order loop filter, etc. Type I PLLs can eliminate only the phase error but cannot eliminate the frequency error at steady state. Type II PLLs have zero phase and frequency errors at the steady state. However, they may suffer from the disturbances on the VCO control line caused by charge pump switching, thus deteriorating the PLL's phase noise. Type III PLL can resolve the problem of the type II PLL because of the additional capacitor in the loop filter which attenuates any disturbances in the VCO control line [52]-[54]. Higher order PLLs are rarely used because of the stability issues caused by the additional poles. Therefore, Type III PLL was chosen in this project, and it is widely used nowadays.

2.2.1 Voltage Controlled Oscillator (VCO)

2.2.1.1 VCO Structure

In general, two different kinds of VCOs are widely used in PLLs, as shown in Fig. 2-10.
One is the ring oscillator VCO as shown in Fig. 2-10 (a), in which different numbers of RC delay buffers form the ring to achieve different oscillation frequencies. The advantages of the ring oscillator VCO are its simple structure and wide tuning range. However, because of its worse noise performance due to the lack of passive components and its lower quality factor, it is unattractive in high speed applications. The other commonly used oscillator is the LC tank oscillator as shown in Fig. 2-10 (b). The inductor and capacitor form a resonator. Since the inductor and capacitor do not consume any energy, theoretically it can achieve a very high quality factor, thus delivering better noise performance than the ring oscillator. Its high frequency selectivity property also makes it a good choice for low noise applications. In addition, high frequencies require small inductances, which consume only a small area. With spiral inductors, better phase noise characteristics are achieved by as much as 20dB lower than those of ring oscillators. Thus, LC-tank oscillators are mostly used in high speed and low noise applications. However, narrower tuning ranges of only several hundred MHz are achieved with LC
tank oscillators because of the limited tuning range of reverse-biased varactors due to oscillation conditions and power constrains. More varactors or larger ranges of the varactors are necessary for wider tuning ranges. Unfortunately, to make the LC-tank start oscillating, a larger $g_m$ (and thus a higher phase noise), is generated, and more tail current is required, which can conflict with the power consumption requirements of the design.

Thus, there are always challenges for VCO design, including noise, tuning range, and power consumption. In the past, several approaches were proposed to combine these two architectures to achieve both a wide tuning range and low phase noise. For example, the ring oscillator and LC filter tanks were combined in a single VCO [55]. However, the tuning range of the VCO relies heavily on the size of its varactors. If the varactor size is considerably smaller than the VCO output capacitance, the tuning range will be substantially limited. Kim et al. [56] combined the LC inductor with the cascaded ring oscillator structure to achieve a phase noise of -132dBc/Hz. Although this oscillator achieved a 10% tuning range with on-chip varactors, the tuning range would have been less if the varactor capacitance had been smaller. In our project, two multi-phase LC-ring oscillators were designed with center frequencies at 9GHz and 15GHz respectively.

The multi-phase ring oscillator has been widely used. The LC tanks are combined with the FFI ring VCO. Instead of using varactors, the VCO relies on FFI architecture to adjust the tuning range. A pair of inductors in each delay buffer is used to achieve low phase noise. More importantly, the noise filtering performance is better because of the cascaded ring structure. Theoretically, if $N$ identical oscillators are properly cascaded, the effect of noise filtering will be realized and the output noise power will be reduced by a factor of $N^2$ [56]. Compared to the ring structure in [48] which has a phase noise of only
-90dBc/Hz at 1MHz offset, the phase noise of this VCO achieves -110dBc/Hz at 1MHz offset.

Although the FFI VCO proposed in this paper had the same topology as the traditional resistor-pair based FFI VCO [1], the tuning mechanisms were significantly different. Previous FFI VCOs have a tuning range of 50%. The VCO output period changes between 1/4T and 1/8T, with T being the delay of a single delay buffer. The VCO we developed changed the output frequency by adjusting the current vectors.

As shown in Fig. 2-11, the proposed VCO consisted of four stages connected similarly to the FFI VCO shown in reference [48].

Figure 2-11 (a) Block diagram of the eight-phase hybrid LC-ring oscillator
Figure 2-11 Block diagrams of the proposed VCO and its tuning structure

(b) $\alpha_1=0$, $\alpha_2=1$

(c) $\alpha_1=1$, $\alpha_2=0$
The oscillator frequency can be changed by varying the DC differential control signals $Ctt$ and $Ctt$. When the voltage of $Ctt$ is significantly higher or lower than $Ctt$, the oscillator turned into either a four-stage ring oscillator or two separate two-stage ring oscillators.

We developed the delay buffers $S_1 - S_4$ based on reference [48] shown in Fig. 2-12. The resistor pair was replaced with an LC tank to decrease phase noise and increase the output swing. $T_1 - T_4$ are implemented with bipolar transistors to achieve lower input referred noise because of the higher gain of the bipolar transistors. $T_5$ and $T_6$ are also bipolar transistors. This is because the switching of the VCO control signal is very slow and settles to a DC value in the locked condition. The bipolar transistors have lower flicker noise at lower speed.

![Schematic of the proposed VCO delay buffer](image-url)

Figure 2-12 Schematic of the proposed VCO delay buffer
A potential problem with this structure is that the oscillator may be out of synchronization and have random phase offsets between the two separate two-stage ring oscillators. For example, if $T_5$ is fully turned off, $T_3$ and $T_4$ cannot accept the incoming signal from the previous stage. The two-input buffer then becomes a one-input buffer. Thus, stages $S_1$ and $S_2$ and stages $S_3$ and $S_4$ are decoupled. As a consequence, the phase offsets among the four outputs cannot be maintained. To solve this problem, two resistors ($R_2$) were added across $T_3$ and $T_6$ respectively. When $T_3$ or $T_6$ was switched off, a small amount of current could still flow through the resistor $R_2$ to prevent the branch from being fully turned off. Therefore, $a_1$ and $a_2$ can be close to 1 but could never equal to 1.

2.2.1.2 VCO Phase Noise Analysis

From Fig. 2-11 (a), the open loop transfer function of the proposed LC-ring VCO can be derived as:

$$H_{\text{loop}}(j\omega) = \frac{-(a_1)^4 + (a_2)^4}{1 + (a_2)^2 T^2(j\omega) + (a_1)^2 T^2(j\omega)},$$  \hspace{1cm} (2-1)

where $T(j\omega)$ is the transfer function of the delay buffer shown in Fig. 2-13:

$$T(j\omega) = G_m Z_T = G_m \left[ sL \right] \left[ (Cs)^{-1} \right] \left[ R_p \right] = G_m \cdot \frac{j\omega L}{1 + \frac{j\omega L}{R_p LC \omega^2}},$$ \hspace{1cm} (2-2)

where $Z_T$ is the LC tank impedance in Fig. 2-13.

Figure 2-13 The simplified VCO buffer model
If we assume $T(j\omega) = A(\omega)\exp[j\phi(\omega)]$, then:

$$A(\omega) = \frac{G_mL_0}{\sqrt{(1-LC\omega)^2 + \left(\frac{L}{R_p}\omega\right)^2}}, \quad (2-3)$$

$$\phi(\omega) = \tan^{-1}\left(\frac{1-LC\omega}{\frac{L}{R_p}\omega}\right). \quad (2-4)$$

If $Q = R_p\sqrt{\frac{C}{L}}$, the noise shaping function of $T(j\omega)$ is [57]:

$$\left|\frac{\sqrt{X}}{\sqrt{Y}} [j(\omega_0 + \Delta \omega)]\right|^2 = \frac{1}{4} \left[\frac{\omega_0}{2} \sqrt{\left(\frac{dA}{d\omega}\right)^2 + \left(\frac{d\phi}{d\omega}\right)^2}\right]^2 \left(\frac{\omega_0}{\Delta \omega}\right)^2 = \frac{1}{4Q^2} \left(\frac{\omega_0}{\Delta \omega}\right)^2. \quad (2-5)$$

When $\alpha_1 \approx 1$ and $\alpha_2 \approx 0$, the loop function $H_{\text{loop,4stage}}$ of the four-stage VCO is:

$$H_{\text{loop,4stage}}(j\omega) = \left(\frac{jG_mL_0}{1-LC\omega^2 + j\omega L/R_p}\right)^4, \quad (2-6)$$

$$A_4(\omega) = \left(\frac{G_mL_0}{\sqrt{(1-LC\omega)^2 + \left(\frac{L}{R_p}\omega\right)^2}}\right)^4, \quad (2-7)$$

$$\phi_4(\omega) = 4\tan^{-1}\left(\frac{1-LC\omega}{\frac{L}{R_p}\omega}\right). \quad (2-8)$$

Thus, the noise shaping function is:

$$\left|\frac{\sqrt{X}}{\sqrt{Y}} [j(\omega_0 + \Delta \omega)]\right|^2 = \frac{1}{16} \cdot \frac{1}{4Q^2} \left(\frac{\omega_0}{\Delta \omega}\right)^2. \quad (2-9)$$

Comparing Equations (2-5) and (2-9), we found that the noise power spectrum of the four-stage LC-ring VCO corresponding to Fig. 2-11 (c) is 1/16 of that in a single LC tank. Thus, the proposed LC-ring structure had better noise shaping over a single LC tank oscillator. Similarly, when $\alpha_1 \approx 0$ and $\alpha_2 \approx 1$, the noise shaping function of the two-stage LC-ring VCO as shown in Fig. 2-11 (b) was 1/4 of the one in a single LC tank.
However, the four stages in the LC-ring VCO are never completely decoupled. Thus, the noise power was always between 1/16 and 1/4 of that in single LC tank and never reached the extreme values.

Thus, to achieve low phase noise of the proposed VCO, three points need to be carefully considered. First, the inductor must be chosen with the quality factor $Q$ as high as possible. This can be drawn from the list of the inductors the technology supplies. Second, the $g_m$ of the coupling MOSFETs must be small while still satisfying the startup condition of the oscillator. This can be understood by noise theory in the transistors. As explained in [52], the noise current of a MOSFET is described as $i_n = 4kT\gamma g_m$, where $k$ is Boltzmann constant, $T$ is the temperature, $\gamma$ is typically 2/3 for long channel devices and larger for short channel devices, and $g_m$ is the transconductance of the transistor. Thus, lower $g_m$ generates lower noise current. However, to satisfy the Barkhausen criteria, Equation (2-1) should be equal to “1” to start up the oscillation. Therefore, $g_m$ should be high enough to guarantee that the total loop gain is larger than 1, both when $\alpha_1 \approx 0$, $\alpha_2 \approx 1$ and when $\alpha_1 \approx 1$ and $\alpha_2 \approx 0$. Our design typically had loop gain between 2 and 3 to allow for process variations. Third, the tail current must be increased until the VCO output saturates. This is because the noise voltage caused by the tail current is proportional to the square root of the current, while the output swing is proportional to the tail current. Thus the output signal power increases more than the noise power induced by the tail current, and the total effect is that the noise is reduced. However, after the tail current increases above the value that makes the output swing saturate, the output signal power stops increasing while the tail current noise power keeps increasing. Therefore, the relative total noise begins to increase.
2.2.1.3 VCO Tuning Range Analysis

Fig. 2-14 shows the phase relationship at various points within the proposed multi-phase VCO at the two sets of alpha values, and Fig. 2-15 shows the simplified diagram of Fig. 2-14 and its phasor diagram under the two sets of alpha values.

If we assume the frequency tuning is close to linear because of appropriate linearity resistors, then from Fig. 2-15,

\[ I_T = a_1 I_{T_1} + a_2 I_{T_4} + I_{M_1}, \]  

(2-10)
Figure 2-15 The simplified phasor diagram of the proposed VCO under the two boundary conditions
where $I_T$ denotes the tank current. $I_{T1}$, $I_{T4}$ and $I_{M1}$ are the currents through transistors $T_1$, $T_4$ and $M_1$, and $I_{T1}$ and $I_{T4}$ have the same amplitude but different phases.

As shown in Fig. 2-15 (a), there is a constant 45° phase offset among $I_{T1}$, $I_{M1}$ and $I_{T4}$. Thus, when $\alpha_1 \approx 1$ and $\alpha_2 \approx 0$ the angle $\theta$ between $I_{T1}$ and $I_{M1}$ is:

$$\theta = \tan^{-1} \left( \frac{|I_{T1}| \cos(\pi/4)}{|I_{M1}| + |I_{T1}| \sin(\pi/4)} \right). \tag{2-11}$$

Similarly, when $\alpha_1 \approx 0$ and $\alpha_2 \approx 1$ as shown in Fig. 2-15 (b), $\theta$ is:

$$\theta = \tan^{-1} \left( \frac{|I_{T4}|}{|I_{M1}|} \right). \tag{2-12}$$

From Equations (2-11) and (2-12), $\theta$ is related to the amplitude ratio of $I_{T1}/I_{M1}$ or $I_{T4}/I_{M1}$. When $\alpha_1 \approx 1$ and $\alpha_2 \approx 0$,

$$I_{T1} = g_{m,T_1} \cdot V_A = \frac{qC}{kT} \cdot V_A, \tag{2-13}$$

where $I_C$ is the DC current through $T_1$, which is $I_{SS1}/2$.

$$I_{M1} = g_{M_1} \cdot V_{out} = \frac{2I_D}{V_{gs,M_1} - V_{th}} \cdot V_{out}. \tag{2-14}$$

where $I_D$ is the DC current through $M_1$, which is $I_{SS2}/2$. Thus, if the VCO output swing maintains constant,

$$\left| \frac{I_{T1}}{I_{M1}} \right| = \frac{qI_{SS1}}{2kT} \cdot \frac{|V_A|}{2I_D} \cdot \frac{V_{gs,M_1} - V_{th}}{V_{out} \cdot 4kT/q} \tag{2-15}$$

According to [6], $V_{out}$ needs to be aligned with $I_{M1}$. The phase shift introduced by the LC tank should cancel $\theta$:

$$\frac{\pi}{2} \cdot \tan^{-1} \left( \frac{L_0 \omega_{osc}}{R_f(1-LC\omega_{osc})} \right) = - \theta. \tag{2-16}$$

Thus, the oscillation frequency $\omega_{osc}$ is:
\[
\omega_{\text{osc}} = \frac{1}{\sqrt{LC}} \left[ \tan \theta + \sqrt{\tan^2 \theta + 4Q^2} \right].
\] (2-17)

If \( \omega_{\text{osc1}} \) and \( \omega_{\text{osc2}} \) indicate the oscillation frequencies under the two alpha conditions as shown in Fig. 2-15 (a) and Fig. 2-15 (b), then the tuning range of the proposed VCO is:

\[
\omega_{\text{osc2}} \cdot \omega_{\text{osc1}} = \frac{1}{\sqrt{LC}} \left( \frac{\tan \theta_2 + \sqrt{\tan^2 \theta_2 + 4Q^2}}{2Q} - \frac{\tan \theta_1 + \sqrt{\tan^2 \theta_1 + 4Q^2}}{2Q} \right). \tag{2-18}
\]

In order to get the maximum frequency tuning, \( I_{T1}/I_{M1} \) must be as large as possible. However, this ratio is limited by the phase noise performance and resonance requirement. Q is the quality factor of the LC tank defined as \( R_p \sqrt{C/L} \). According to above equations and parameters used in the design (\( Q=9.8, V_{gs} - V_{th} = 0.232V, I_{SS1}/I_{SS2} = 1.27, \) and \( 1/\sqrt{LC} = 8.5GHz \)), the frequency can be tuned from 8.75GHz to 9.77GHz with a center frequency at 9.1GHz. This 11.1% tuning range can be improved even further by increasing the ratio \( I_{SS1}/I_{SS2} \). However, large tail currents may saturate the output swing and deteriorate the VCO phase noise. Thus, there exists a trade-off among power consumption, tuning range, and the phase noise.

2.2.2 Phase Detector

An ideal phase detector (PD) can generate an output signal whose dc voltage depends linearly dependent on the phase difference between the two input periodic signals. When there is no input signal, the phase detector should maintain the same output as when the signals are exactly in phase. As shown in Fig. 2-16, a standard 3-state phase frequency detector is used in our PLL [52]. The pulse widths of the output signals “Up”
and “Dn” are proportional to the total delay of the AND gate and the delay from “RST” to “Q” in the MS latch.

![MS Latch Diagram](image)

Figure 2-16 Structure of the 3-state phase detector

However, in a typical CMOS or current mode logic (CML) AND gate, the two input-to-output delays are different due to the asymmetry in the AND gate circuit. This asymmetry will lead to a narrow pulse with its width being proportional to the difference between these two delays. As a consequence, the charge pump will be misled as it charges/discharges the LPF, which then erroneously changes the VCO frequency. To avoid this problem, we implemented a one-level AND gate shown in Fig. 2-17, which provides symmetric loading and matched input levels for both latches.

![One-Level AND Gate Diagram](image)

Figure 2-17 Schematic of the proposed one-level AND gate
This one-level structure ensures a uniform delay from both of the two inputs to the output of the AND circuit, making possible the simultaneous reset of both latches, thus eliminating any timing problems. The delay component is added to eliminate the dead zone of the phase detector [52].

2.2.3 Charge Pump and Low Pass Filter

The charge pump with a loop filter is shown in Fig. 2-18, which includes a negative impedance amplifier (NIA) charge pump and a second order RC loop filter [84]. The RC filter is placed between the two differential output nodes $X$ and $Y$. $M_{5,8}$ form two groups of differential current sources. The cross-coupled pair, $M_1$ and $M_2$, is used to compensate the current leakage through the $R_1$ pair [58]. Node $X$ and $Y$ float when the output pull-down negative resistor equals the pull-up resistor. Therefore, the loop filter becomes a perfect integrator when these two resistors cancel each other.

![Figure 2-18 Schematic of the NIA charge pump with a second order loop filter](image)

Figure 2-18 Schematic of the NIA charge pump with a second order loop filter

50
The equivalent pull-down negative resistance between node $X$ and node $Y$ can be calculated as

$$R_{dn} = -2(R_2 + 1/g_m),$$

(2-19)

where $g_m$ is the transconductance of $M_1$ and $M_2$.

The equivalent pull-up resistance between node $X$ and node $Y$ is

$$R_{up} = 2R_1.$$ 

(2-20)

If $R_{up} = -R_{dn}$, then

$$g_m = 1/(R_1 - R_2).$$

(2-21)

At this point, the total impedance between the differential output nodes $X$ and $Y$ will be infinite. This also indicates the current through the capacitors can be in either direction while still keeping the differential center voltage of the output nodes unchanged. This is an important advantage of this type of charge pump/loop filter.

The values of $R$, $C_1$, $C_2$ were determined by the PLL transient response requirements and stability considerations. This will be explained in a later section.

Fig. 2-19 shows how the output of the loop filter changes according to different lead-lag inputs at the phase detector. In Fig. 2-19 (a), the reference leads the feedback. Then, the output voltage of the loop filter will increase to adjust the VCO frequency to be faster. When the reference lags behind the feedback as shown in Fig. 2-19 (b), the output of the loop filter will decrease to adjust the VCO frequency to be slower.

### 2.2.4 Integer-N Frequency Divider

A pulse swallow frequency divider [14] was used in this design as shown in Fig. 2-20. The divider consisted of a prescaler (divide-by-$N/(N+1)$), a program counter (divide-
Figure 2-19 Phase detector and charge pump outputs with different input signals.
by-\(P\) and a swallow counter (divide-by-\(S\)). The frequency relationship between input and output is

\[ f_{in} = (NP+S)f_{out} \]  

(2-22)

Figure 2-20 Structure of the programmable integer-N divider [14].

It should be noted that a buffer is usually interposed between the VCO and the prescaler to isolate the former from the switching noise in the latter.

2.2.5 PLL Components Parameter Determinations

The specifications of the proposed PLL are as follows.

1) PLL phase margin was targeted to 60° for best trade-off between the settling time and the overshoot [54];

2) PLL output frequency was set to be 9GHz for the 36 Gb/s rate and 15 GHz for the 60 Gb/s rate. Best tradeoffs among power, speed and output jitter were needed compared to some previously published work;

4) PLL phase noise was designed to be lower than -100dBc/Hz to meet SONET jitter specifications [59];
5) Frequency tuning range was targeted to be larger than 10% to tolerate more process variations than some previously published work [85], [86].

Each component was designed based on the above specifications and applied to 0.18μm BiCMOS technology. The same system analysis method can also be implemented in 65nm CMOS PLL.

To use the continuous time-domain analysis, the PLL bandwidth must be much smaller than the reference frequency and normally less than $1/10 f_{\text{ref}}$ so that the continuous time-domain analysis can be applied [53], [54]. Therefore, we chose the PLL loop bandwidth to be less than 4MHz and the frequency divider to be N=90. As shown in Fig. 2-21, the phase detector compares the phase difference between its two inputs and then generates pulses as a function of the phase difference. These pulses control the charge pump to generate the charge or discharge current through the loop filter, and the VCO control signal is thus generated.

Figure 2-21 PLL diagram showing how the VCO control voltage reflects the phase error
If we assume the phase of the reference and the frequency divider output are $\theta_{ref}$ and $\theta_{div}$, and the charge pump current is $I_{cp}$, then the transfer function from the phase detector to the charge pump can be written as:

$$H_{PD}(s) = \frac{I_{out}}{\theta_{ref} \theta_{div}} = \frac{I_{cp}}{2\pi}.$$  \hfill (2-23)

The transfer function reflecting $I_{out}$ to the VCO control voltage is the impedance of the loop filter as shown in Fig. 2-18. Thus,

$$H_{LF}(s) = \frac{V_{ctl}}{I_{out}} = K_L \frac{1+s/\omega_z}{s(1+s/\omega_p)},$$ \hfill (2-23)

where $K_L = 1/(C_1+C_2)$, $\omega_z = 1/RC_1$, and $\omega_p = (C_1+C_2)/(RC_1C_2)$.

The VCO transfer function is that of an integrator:

$$H_{VCO}(s) = \frac{K_{VCO}}{s},$$ \hfill (2-24)

where $K_{VCO}$ is the VCO gain. Integration transfers frequency into phase.

The PLL model in the frequency domain is shown in Fig 2-22.

![Diagram of PLL model](image)

Figure 2-22 The proposed PLL model in the frequency domain.

The open loop transfer function is:

$$G_{OL}(s) = H_{PD}(s) \cdot H_{LF}(s) \cdot H_{VCO}(s) \cdot \frac{1}{N} = \frac{I_{cp}K_{VCO}}{2\pi N(C_1+C_2)} \cdot \frac{1+s/\omega_z}{s^2(1+s/\omega_p)}.$$ \hfill (2-25)

The closed loop transfer function is:

$$H_{CL}(s) = \frac{G_P(s)}{1+G_{OL}(s)} = \frac{H_{PD}(s)H_{LF}(s)H_{VCO}(s)}{1+H_{PD}(s)H_{LF}(s)H_{VCO}(s)\frac{1}{N}}$$
The natural frequency is:

\[ \omega_n = \sqrt{\frac{I_{CP} K_{VCO}}{2\pi N (C_1 + C_2)}} \]

and the unity gain bandwidth is \( \omega_u \approx \frac{\omega_n^2}{\omega_z} \). The phase margin of the PLL is:

\[ \Phi_M = \arctan\left(\frac{\omega_p}{\omega_z}\right) - \arctan\left(\frac{\omega_z}{\omega_p}\right) \]  

(2-27)

Fig. 2-23 shows the simulated PLL step response and frequency response. As shown in Fig. 2-23 (a), the PLL settling time is approximately 1.5\( \mu \)s. Fig. 2-23 (b) shows that the PLL phase margin is 69°.
2.2.6 PLL Noise Analysis

The noise sources in a PLL mainly consist of five types: reference noise, phase detector/charge pump noise, loop filter noise, VCO noise, and divider noise, as shown in Fig. 2-24.

Figure 2-23 Simulation results of the PLL step response and frequency response

(b) PLL frequency response

Figure 2-24 Block diagram of the PLL with the noise sources
The noise transfer functions from each PLL component to the PLL output are shown as follows.

\[
\Phi_{\text{out}, \text{VCO}} = \left(\frac{1}{1+G(s)}\right)^2 \cdot \Phi_{\text{noise}, \text{VCO}} \quad (2-28)
\]

\[
\Phi_{\text{out}, \text{LF}} = \left[\frac{H_{\text{LF}}(s)}{1+G(s)}\right]^2 \cdot \Phi_{\text{noise}, \text{LF}} \quad (2-29)
\]

\[
\Phi_{\text{out}, \text{PD\&CP}} = \left[\frac{H_{\text{PD\&CP}}(s)}{1+G(s)}\right]^2 \cdot \Phi_{\text{noise}, \text{PD\&CP}} \quad (2-30)
\]

\[
\Phi_{\text{out, div}} = \left[\frac{H_{\text{PD}}(s)H_{\text{LF}}(s)H_{\text{VCO}}(s)}{1+G(s)}\right]^2 \cdot \Phi_{\text{noise, div}} \quad (2-31)
\]

\[
\Phi_{\text{out, ref}} = \left[\frac{H_{\text{PD}}(s)H_{\text{LF}}(s)H_{\text{VCO}}(s)}{1+G(s)}\right]^2 \cdot \Phi_{\text{noise, ref}} \quad (2-32)
\]

From Equation (2-28), we see that the VCO noise transfer function is a high pass filter while other noise transfer functions are all low pass filters. Since we know from the previous section that the PLL functions like a low pass filter, the noise within the PLL bandwidth is determined by the reference, phase detector, charge pump and frequency divider, while the noise out of the PLL bandwidth is determined by the VCO. Thus, there exists a tradeoff between the PLL bandwidth and the PLL noise: a wide PLL bandwidth can reject more VCO noise while it passes more input referred noise and reference noise; a narrow PLL bandwidth can reject more input referred noise and reference noise while it passes more VCO noise. In addition, a wider bandwidth achieves faster PLL settling. Since the reference noise and VCO noise are the two main noise sources in PLL, we must design our PLL according to the reference noise quality, VCO noise quality and the settling time requirement.

According to [59]-[62], the noise of each component in the PLL can be expressed as follows:
\[ \Phi_{\text{noise,ref}}^2 = K_0 + \frac{K_1}{f} + \frac{K_2}{f^2} + \frac{K_3}{f^3}, \quad (2-33) \]

\[ \Phi_{\text{noise,VCO}}^2 = K_4 + \frac{K_5}{f^2} + \frac{K_6}{f^3}, \quad (2-34) \]

\[ \Phi_{\text{noise,PD&CP}}^2 = K_7 + \frac{K_8}{f}, \quad (2-35) \]

\[ \Phi_{\text{noise,div}}^2 = K_9 + \frac{K_{10}}{f}, \quad (2-36) \]

where \( K_0 - K_{10} \) can be determined by the component simulation results [59], [62]. The extracted values are:

\[ K_0 = 10^{-17}, \quad K_1 = 10^{-13.5}, \quad K_2 = 10^{-10.5}, \quad K_3 = 10^{-8.5}. \]

\[ K_4 = 10^{-15}, \quad K_5 = 10^0, \quad K_6 = 2000. \]

\[ K_7 = 7 \times 10^{-25}, \quad K_8 = 1.5 \times 10^{-26}. \]

\[ K_9 = 10^{-16.4}, \quad K_{10} = 5 \times 10^{-13.4}. \]

Figure 2-25 Noise generated in the loop filter

For the noise injected by the low pass filter, the noise source is the resistor thermal noise as shown in Fig. 2-25. The noise voltage is given by

\[ V_R^2 = 4kTR, \quad (2-37) \]

where \( k \) is Boltzmann's constant and \( T \) is the temperature in Kelvin. This noise voltage is modeled as a voltage source in series with \( R \). Thus, the output noise of the loop filter caused by the resistor thermal noise can be calculated as follows:
\[ \Phi_{\text{noise,LF}}^2 = V_R^2 \cdot \left( \frac{1}{\frac{1}{C_2} + \frac{1}{C_{1s} + R}} \right)^2 \] (2-38)

If we substitute Equation (2-33) through Equation (2-38) into Equation (2-28) through Equation (2-32), the phase noise of each component and the total PLL phase noise are shown in Fig. 2-26.

![Fig. 2-26 Simulation results of PLL component phase noise in Matlab.](image)

From Equation (2-28) through Equation (2-32), we know that the reference noise can be minimized by decreasing the divider factor \( N \). The noise contributed by the phase detector and charge pump can be minimized by increasing the phase detector gain, for example, by increasing the charge pump current. The noise contributed by the VCO can be minimized by increasing the loop gain, since the transfer function from the VCO noise to the PLL output noise is approximately inversely proportional to the loop gain. All the noise contributions are also correlated to the PLL bandwidth, and there exist tradeoffs between the noise and the PLL bandwidth, as mentioned earlier.
2.3 Linear Feedback Shift Register (LFSR)

A pseudorandom bit sequence (PRBS) generator is used to assist on-chip testing. Normally, the linear feedback shift register (LFSR) is a PRBS generator. In this project, a 16-bit PRBS was generated with a 12-bit LFSR in which 4 bits were repeated so that all sequences could be detected. The period is thus $2^{12}-1$ bits/cycle. The speed of the LFSR is 1/16 of the output data rate. The LFSR coefficients we used were \{12, 9, 8, 5\} for the four required XOR gates. Fig. 2-27 shows the schematic of the standard 12-bit LFSR [63]. Q1 – Q12 are the twelve outputs, and the last four outputs are chosen twice to form the 16 data bits. A detailed introduction is given in Appendix A.

![Figure 2-27 Schematic of the 12-bit LFSR](image)

2.4 Experimental Results and Layouts

All the components of the PLL were simulated with Jazz Semiconductor’s 0.18μm SiGe BiCMOS technology. The measurement instruments were: Tektronix 11801C oscilloscope, Rhode & Schwarz SML01 signal generator, NAVY version of the FSP40 spectrum analyzer, and GGB 10 channel RF probes.
Fig. 2-28 shows the simulated VCO frequency vs. the control voltage and VCO phase noise vs. the control voltage in Cadence. As shown in Fig. 2-28 (a), the frequency tuning is more linear when R1 is larger. However, the phase noise of the VCO does not change linearly with the control voltage ($Ct1 - \bar{Ct1}$). When the differential control voltage is raised from -0.6V to 0.6V, the phase noise is approximately 7dB higher. The phase noise difference between the four-stage VCO and the two-stage VCO is $10\log4 = 6.02$dB.

(a) The simulated VCO output frequency vs. control voltage ($Ct1 - \bar{Ct1}$);

(b) Simulation of the phase noise vs. control voltage ($Ct1 - \bar{Ct1}$) when $R1$ is 350. Figure 2-28 Simulation results of the VCO tuning and the VCO control voltage vs. the VCO phase noise.
Fig. 2-29 shows the simulated phase noise of the proposed VCO. The phase noise is -110dBc/Hz at 1 MHz offset.

Fig. 2-30 shows the measured spectrum and output waveform of the PLL output. Due to the limited bandwidth of the spectrum analyzer, the PLL output was divided by 4 so that the spectrum could be displayed. The actual PLL frequency was 2.24GHz×4=8.96GHz.

Fig. 2-31 shows the measured PLL output phase noise. The measurement result is -118.64dBc/Hz at 1MHz offset, and the rms jitter is 246.3fs.

Fig. 2-32 shows the measured phase noise versus power supply and temperature variations. As shown in Fig. 2-32, the phase noise performance is best in the range 3.25V-3.5V. When the power supply was lower than 3.2V it was hard to start the oscillation. Voltages larger than 3.65V caused excess current noise. The lowest noise performance was -121dBc/Hz at approximately 35°C in our measurement.
(a) The PLL divide-by-4 output

(b) The power spectrum density of the divide-by-4 PLL output

Figure 2-30 Spectrum and output waveform of the proposed PLL: center frequency is 2.24GHz×4=8.96GHz
All devices shown in Fig. 2-12 have breakdown voltages higher than 3.3V for safe operation.

<table>
<thead>
<tr>
<th>Settings</th>
<th>PHASE NOISE</th>
<th>Spot Noise</th>
</tr>
</thead>
<tbody>
<tr>
<td>Signal Frequency</td>
<td>2.24 GHz</td>
<td>Evaluation from 1 kHz to 10 MHz</td>
</tr>
<tr>
<td>Signal Level</td>
<td>10 dBm</td>
<td>Residual PM 0.199 dB</td>
</tr>
<tr>
<td>Analyzer Mode</td>
<td></td>
<td>Residual FM 11.24 kHz</td>
</tr>
<tr>
<td></td>
<td></td>
<td>RMS Jitter 0.2463 ps</td>
</tr>
</tbody>
</table>

Figure 2-31 The measured phase noise of the PLL: -118.64dBc/Hz at 1MHz offset

Figure 2-32 The measured PLL phase noise vs. the power supply voltage

65
Table 2-3 shows the comparison in performance of the proposed multi-phase PLL with some previously published PLLs.

<table>
<thead>
<tr>
<th>Table 2-3 PLL performance comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
</tr>
<tr>
<td>90nm CMOS</td>
</tr>
<tr>
<td>Tuning range</td>
</tr>
<tr>
<td>Center frequency</td>
</tr>
<tr>
<td>Phase noise (at 1MHz offset)</td>
</tr>
<tr>
<td>Power (mW)</td>
</tr>
<tr>
<td>RMS jitter</td>
</tr>
<tr>
<td>FOM</td>
</tr>
</tbody>
</table>

A figure of merit (FOM) is used to compare the performance, which is a function of the power consumption, center frequency, and phase noise. It is defined as:

$$FOM = L(\Delta f) - 10\log \left[ \frac{1mw}{P} \cdot \left(\frac{f_0}{\Delta f}\right)^2 \right],$$

(2-39)

where $L(\Delta f)$ is the PLL’s phase noise at $\Delta f$ offset, $P$ is the PLL’s power consumption and $f_0$ is the PLL’s center frequency [64]. As shown in Table 2-3, compared with three other previously published PLLs ([25], [85] and [86]), our PLL has the best phase noise performance and FOM. The PLL in [25] has a wider tuning range because of its ring structure. However, its phase noise is the worst among the four PLLs. The PLL in our work is -106dBc/Hz because the divide-by-4 divider introduces another -12dB and needs to be subtracted from the measured result.

Fig. 2-33 shows the simulated and measured eyediagram of the final transmitter output with Jazz Semiconductor’s 0.18μm SiGe BiCMOS technology. The 0.18μm SiGe
BiCMOS technology achieved a 36 Gb/s data rate at a 9GHz clock frequency with a 3.4V power supply. The measured peak-to-peak jitter was 8.7ps. The measured eye opening was 0.64UI in the horizontal direction and 245mV in the vertical direction.

In the simulation, the inductor was first verified in Sonnet Tool with the extracted inductance and parasitic parameters included. Then this model with the new parameters was converted back into Cadence for the simulation. There is an approximate 2ps difference of the peak-to-peak jitter between the simulated and the measured results. The discrepancy was caused by the noise introduced in the measurement. As will be discussed in Chapter 5, the oscilloscope needs a triggering signal for its input data. In this measurement, the triggering signal we used was a noisy high frequency signal from the PLL output, which had 246fs rms jitter. This jitter was also transferred into the data jitter. In addition, the cable itself introduced 2 to 5 mV noise.

There are several methods to improve the measured jitter performance. First, we can decrease the frequency of the triggering signal to achieve low noise triggering and use cables with wider bandwidth. This can enlarge the eye opening in both horizontal and vertical directions. Second, adding coupling capacitors either on chip or off chip between the power supply and the ground can decrease the power supply noise. This can enlarge the vertical eye opening. Finally, adding shunt capacitors between the gate/base of the current source can decrease the tail current noise and widen the eye opening.

Table 2-4 shows the simulated power consumption of the data and clock paths in the proposed 16:1 MUX. The total power consumption of the 16:1 MUX is 1.01W at 3.4V power supply voltage.
(a) Simulated eyediagram of the transmitter output at 36 Gb/s rate: the maximum horizontal eye opening is 18.8ps and the maximum vertical eye opening is 265mV, the peak-to-peak jitter is 8.7ps

(b) Measured eyediagram of the transmitter output at 36 Gb/s rate: the maximum horizontal eye opening is 18ps and the maximum vertical eye opening is 245mV, the peak-to-peak jitter is 10ps

Figure 2-33 Simulated and measured eyediagram of the transmitter output at 36 Gb/s data rate.
Table 2-4 Simulated power consumption of the data path and the clock path in the proposed 16:1 MUX

<table>
<thead>
<tr>
<th>Simulated power</th>
<th>Data path</th>
<th>Clock path</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0.70W</td>
<td>0.31W</td>
</tr>
</tbody>
</table>

Fig. 2-34 shows the layout of each part in the transmitter with 0.18μm BiCMOS technology.

(a) 16:4 MUX: 480μm×280μm

Figure 2-34 (b) 4:1 MUX: 440μm×300μm
2.5 PLL with 65nm CMOS Technology

The alternative technology we used in this project for the SerDes design was 65nm CMOS technology. A 60 Gb/s 4:1/1:4/MUX/DEMUX was implemented with this technology. As mentioned in Chapter 1, besides the high $f_T$ BiCMOS technology, 65nm CMOS is also an attractive technology because of its low power consumption made possible by the low supply voltage. The power supply of the core VCO can be 2V with 65nm CMOS technology implementing the same LC-ring structure.

The design targets of this PLL were shown in the beginning of Section 2.2.5, except that the reference frequency was set to 58.6MHz and the frequency divider factor was set
to 256. In addition to the same steps of the system level configuration described in previous sections, other factors also need careful consideration.

First, unlike the 0.18\textmu m BiCMOS technology in which a 3.4V power supply was used, the lower power supply in 65nm CMOS technology made it difficult to achieve three different voltage levels with only source followers in a current-mode logic (CML) MS latch, as shown in Fig. 2-35. This MS latch structure assumes input data (D), clock (Clk), and reset (Rst) inputs. In BiCMOS technology, we easily achieved three levels: 3.4V, 2.5V and 1.6V for the above three level signals with \( V_{th} = 0.9V \) difference between the levels. However, in CMOS technology, the voltage difference between the three signals can be as small as the overdrive voltage \( V_{d_{sat}} \), which typically is only approximately 200mV-300mV. Such low voltage level shifting cannot be generated by the source follower because \( V_{gs} \) in the source follower consumes the voltage of \( (V_{th} + V_{d_{sat}}) \), which is approximately 0.8V if the NMOS threshold voltage is 0.5V and the overdrive voltage is set to 0.3V. The 2V power supply is not enough to achieve three different voltage levels. A new level shifter circuit was thus developed to achieve the small voltage level shifting of 300mV. Fig. 2-36 shows the structure of this level shifter.

![Figure 2-35 Schematic of the three-level MS latch](image-url)
As shown in Fig. 2-36, the maximum output swing can be calculated as:

\[ V_{\text{out,swing}} = V_{\text{out,max}} - V_{\text{out,min}} = (V_{dd} - I_{ss}R_1) - (V_{dd} - I_{ss}R_1 - I_{ss}R_2) = I_{ss}R_2. \] (2-40)

The shifted voltage from the input to the output is \( V_{\text{shift}} = I_{ss}R_1 \). Thus, by correctly choosing \( I_{ss}, R_1 \) and \( R_2 \), both the required voltage shifting value and the output swing can be satisfied.

The second factor we need to consider is the design of the frequency divider, since the VCO output frequency now is 15GHz, which is faster than the 8.9GHz in the BiCMOS technology. The frequency divider here is composed of divide-by-two components, which involve the inversion connection of a MS latch. Thus, the first MS latch, which is connected with the VCO output, must satisfy the highest speed requirement. A traditional MS latch was implemented as shown in Fig. 2-37 and was tested to see whether it could operate at such a high speed.
Fig. 2-38 shows the simulated waveforms of the divide-by-256 divider at each stage. The traditional flip-flop divider works well after careful sizing of the input differential pair.

Figure 2-38 Simulation waveforms of the frequency divider with 15GHz VCO output
The third factor that needed to be considered was the safe operation of the cross-coupled pair in the VCO buffer as shown in Fig. 2-12. Since the highest output voltage in Fig. 2-12 reached \((V_{dd} + V_{swing}/2)\), those two coupled NMOS transistors may enter the breakdown region which can cause failure. Thus, NMOS transistors with high breakdown voltage were used here. This issue was also considered in the 0.18\(\mu\)m BiCMOS technology in which NMOS transistors with high breakdown voltage were used.

Similar to Section 2.4, the experimental results with the 65nm CMOS technology are shown as follows.

Fig. 2-39 shows the Matlab simulation waveforms of the phase margin of the PLL in which the phase margin is 63 degree.

![Bode Diagram]

Figure 2-39 Simulated phase margin of the PLL with Matlab
Fig. 2-40 shows the simulation waveforms of the PLL while settling. The Matlab simulation result is approximately 3μs.

![Step Response Diagram](image)

Figure 2-40 PLL step response with Matlab

Fig. 2-41 shows the phase noise of the LC-ring VCO at 15GHz. The phase noise is -109.3dBc/Hz at 1MHz offset.

![Phase Noise Diagram](image)

Figure 2-41 Simulation waveform of the VCO phase noise
Fig. 2-42 shows the eye diagram of the final 60 Gb/s data output.

Figure 2-42 Simulated eyediagram of the 4:1 MUX at 60 Gb/s rate.

The maximum horizontal eye opening is 12.1 ps and the maximum vertical eye opening is 230mV. The data’s peak-to-peak jitter is 4.8ps.

At such high operation frequency, the inductor model from the library needed to be verified with special RF design tools such as the Sonnet Tool. The simulation was thus based on the inductor models, which were extracted from the Sonnet Tool.

Table 2-5 shows the simulated power consumption of the data and the clock paths. The total power consumption of the 4:1 MUX excluding the PLL is 31mW.

Table 2-5 Simulated power consumption of the data path and the clock path in the proposed 4:1 MUX with 65nm CMOS technology

<table>
<thead>
<tr>
<th></th>
<th>Data path</th>
<th>Clock path</th>
</tr>
</thead>
<tbody>
<tr>
<td>Simulated power consumption</td>
<td>24.8mW</td>
<td>6.2mW</td>
</tr>
</tbody>
</table>
Finally, Fig. 2-43 shows the layout of the 4:1 MUX and the PLL with 65nm CMOS technology. The area of the 4:1 MUX is \(0.42\text{mm} \times 0.16\text{mm}\) and the area of the PLL is \(0.54\text{mm} \times 0.6\text{mm}\).

Figure 2-43 Layouts of 4:1 MUX and PLL with 65nm CMOS technology
2.6 Discussion and Future Work

2.6.1 Discussion

In this chapter, two methods for solving the low noise clock design and low power design were described.

A novel multi-phase VCO combining the ring oscillator and the LC tank oscillator was implemented, realizing both wide tuning range and low phase noise. The phase-locked loop with the proposed VCO achieved a better figure of merit (FOM) than some previously published work. Compared to a single LC-tank oscillator in [86], the phase noise in the proposed VCO achieved a value that was 20dB lower. This performance improvement was realized by the cascade structure of four LC-tank oscillators. Through Equation (2-9), the proposed LC-ring oscillator achieved a phase noise of 12dB lower than a single LC-tank oscillator. Compared to the three-stage ring oscillator in [25], this work achieved a phase noise of 30dB lower. Although the VCO tuning range in this work is narrower than that in [25], the FOM is still better. Wider tuning range can be achieved with the sacrifice of the phase noise. Thus, there always exits a trade-off between the tuning range and the phase noise depending on the system requirements.

A quarter-rate 4:1 MUX was also implemented, realizing lower power consumption and relaxing the design constraints of the high speed clock and high speed frequency divider. Moreover, due to the high output data rate, such as 60 Gb/s in this project, a 30GHz clock is needed with the traditional 4:1 MUX. This high speed clock requires inductive peaking with current 65nm CMOS technology. Thus, the quarter-rate structure avoided the use of inductive peaking in the clock buffer and saved the chip area.
Because of the lower clock frequency requirement with the proposed quarter-rate structure, 80 G/s could be the potential higher data rate in the future with 65nm CMOS technology.

2.6.2 Future Work

2.6.2.1 Line Encoding

As mentioned above, the line encoding is a very important part in the serializer since it ensures the DC balance of the output data and also improves the data recovery performance in the deserializer. The line encoding for Gb/s data rates, either 8B/10B or 64B/66B encoding, can be implemented for the transmitter in future designs. Appendix B gives an introduction to 8B/10B and 64B/66B encoding. Each bit in the encoded 10-bit symbol should be a function of the input 8 bits. The final expression contains only the operations of addition, multiplication and inversion. The circuit then can be generated through AND/NAND gates, OR/NOR gates, and the inverters.

2.6.2.2 Inductor Modeling

As mentioned above, what we care about in the inductive peaking circuit is the inductance instead of the quality factor of the inductor. This is different from the LC tank in which the quality factor needs to be as large as possible. Therefore, including an inductor in the inductive peaking circuit can realize the same bandwidth extension. The significant advantage of replacing the foundry model with the designed inductor is the area reduction.
2.6.2.3 Speed Enhancing

With 65nm CMOS technology and future advanced technology, the data speed may be further increased to above 80 Gb/s. This is mainly determined by the process parameters such as the cut-off frequency of the transistors and the parasitic capacitances.

2.6.2.4 Flip-Chip Bonding

The type of pads in this project is the bondpad. This kind of pads causes long connections between the chip and the external circuits in the chip packaging, as shown in Fig. 2-44. These long connections introduce large parasitic inductance and capacitance, thus limiting the speed of the circuits in the chip.

Currently, another type of chip bonding, flip-chip bonding, is used for high speed I/O circuits, as shown in Fig. 2-45. Unlike the wire bonding in Fig. 2-44, the flip-chip bonding connects the chip pads and the matching pads on the external circuit face to face. Therefore, the interconnection between the chip and the external circuit becomes much shorter than that in the wire bonding. The shorter interconnection with flip-chip bonding
extends the bandwidth, compared to the wire bonding. This could be part of our future work, considering the chip packaging of our high speed I/O circuit.

Figure 2-45 Diagram of the flip-chip bonding
CHAPTER 3

RECEIVER

In contrast to serializers, a deserializer typically contains a demultiplexer (DEMUX), a clock and data recovery (CDR) circuit, a decoder and the output buffer as shown in Fig. 3-1. The DEMUX acts as the reverse of the MUX which converts the serial data into parallel data. The CDR is the most critical component in the receiver side, because it extracts the clock information from the received data and then uses that extracted clock to recover the data. Therefore, the design of the CDR is the most challenging step in the receiver, especially in high speed SerDes where the jitter performance is critical.

In this chapter, a quarter-rate CDR, which takes advantage of our previously developed multi-phase LC-ring VCO, was implemented to overcome the frequency limitations in traditional CDRs. Detailed analysis of the CDR design process is supplied. A novel 1:4 DEMUX was realized. Based on the proposed multi-phase VCO, this 1:4 DEMUX also requires the clock rate to be only 1/4 of the CDR’s input data rate, which is half of the clock rate in the typical 1:4 DEMUX. This relaxes the design constraints of the high speed clock. In addition, with the proposed 1:4 DEMUX structure, approximately 1/3 of the power consumption in the typical structure can be saved because of the reduction in the number of components.

The structure of this chapter is as follows: Section 3.1 introduces the pre-emphasis of the receiver, focusing on the broadband techniques. Those techniques help to improve the quality of the received data [6]. Section 3.2 describes the design of the 1:16 demultiplexer
which includes the first 1:4 DEMUX and the following 4:16 DEMUX. The 4:16 DEMUX is formed by four 1:4 DEMUXs. Section 3.3 describes the design of the CDR circuit in this project. Detailed explanations of the critical design concerns, such as jitter generation, jitter transfer, and jitter tolerance, are supplied. Section 3.4 gives the simulation results and the full layout including all the components.

Figure 3-1 Block diagram of the deserializer

3.1 Pre-Emphasis Amplifier

As mentioned in Chapter 2, the transmitter generates the high speed data and the receiver extracts the clock information from this high speed serial input data. Thus, the quality of the received high speed data is critical, because it determines whether the CDR can successfully extract the information from the data. If the data attenuates severely after the channel or if the jitter performance is poor, it will be very difficult for the receiver to recover the data. Two main aspects will affect the jitter performance of the received data: the clock duty cycle jitter and the data dependent jitter.

If jitterless data enters the transmitter, it may contain jitter by the time it comes out, due to the duty cycle of the transmitter’s clock. The clock duty cycle jitter is shown in Fig. 3-2. In Fig. 3-2 (a), the clock duty cycle is 50%, so the data widths equal to each other, and the final eye opening is the widest. However, as shown in Fig. 3-2 (b), a clock duty
cycle which is not 50% will cause unequal data widths and the eye opening will be limited by the narrowest data. Therefore, the widest eye opening only happens when the clock duty cycle is 50%. Any noise which causes the clock duty cycle to be unequal to 50% will narrow the data eye opening.

![Data, Clk, TX output diagrams for different clock duty cycles](image)

Figure 3-2 TX output data with different duty cycle of the clock signal

The data dependent jitter or inter-symbol interference (ISI) is caused by the limited bandwidth of the channel [6]. As shown in Fig. 3-3, different data patterns may have different responses through the channel due to the limited channel bandwidth. For example, if long “0”s or “1”s come after long “1”s or “0”s, as shown in Fig. 3-3 (a), the output has enough time to discharge or charge to the lowest or highest level. However, if long “1”s or “0”s are followed by frequently switched data such as “0101…” as shown in Fig. 3-3 (b), then the output cannot fully charge or discharge to the same value as shown in Fig. 3-3 (a) due to the limited rising or falling time due to the narrow channel bandwidth. Thus, the eye opening will become narrower with frequently changed data.

In this project, the pre-emphasis amplifier and inductive peaking were utilized to enlarge the bandwidth and attenuate the ISI issue.
As described in the transmitter section above, there are two methods that can be used to widen the bandwidth. The first one is inductive peaking, which increases the bandwidth without deteriorating the circuit gain. This method enlarges the eye opening by pulling the original small eye levels $V_{HI}$ and $V_{L,1}$ to the larger eye levels $V_H$ and $V_L$, as shown in Fig. 3-4 (a). However, the disadvantage of this method is the overshoot the inductive peaking brings to the circuit, which can itself cause jitter in the data. In addition, the inductor consumes more area. The second method is using feedback to achieve a wider bandwidth while sacrificing the circuit gain as shown in Fig. 3-4 (b). A widely known circuit is the Cherry Hooper amplifier which increases the bandwidth by significantly decreasing the output impedance, thus moving the poles to a higher frequency [6]. This method will bring the highest and the lowest eye level $V_H$ and $V_L$ in the eye diagram to the smallest eye level $V_{HI}$ and $V_{L,1}$. This method saves more area than the inductive peaking but with smaller gain. In this project, inductive peaking was used on the transmitter side, as we attempted to minimize attenuation of the signal, as it passed through the transmitter and then from the transmitter to the receiver. The Cherry Hooper amplifier was used on the receiver side.
to save area because of the elimination of the inductors. The combination of the two methods can also be used for higher data rate operation.

(a) Inductive peaking technique and its effect to the eye diagram

(b) Feedback technique and its effect on the eye diagram

Figure 3-4 Two methods to broaden the circuit bandwidth and their effects on the eye diagram

3.2 1:16 Demultiplexer

This 1:16 DEMUX contains one 1:4 DEMUX followed by four 1:4 DEMUXs as shown in Fig. 3-5. Each of the five 1:4 DEMUXs in the effective 1:16 DEMUX has the same structure.
In the traditional and widely used 1:4 DEMUX design shown in Fig. 3-6 ([15], [16]), the clock frequency is half the data rate. That is, if we assume the input data rate is 60 Gb/s from the transmitter output, the clock frequency in the first 1:2 DEMUX must be 30GHz, and the clock frequency in the second stage 1:2 DEMUXs is 15GHz after a frequency divider. Therefore, we need three 1:2 DEMUXs and 30 GHz clock and, of course, the necessary frequency dividers. This will cost significant power with a current mode logic (CML) structure in high speed circuits and will create difficulties for the CDR design due to such a high speed clock. In addition, the frequency divider at such a high speed is very difficult to implement.
As mentioned in Chapter 2, a quarter-rate MUX was implemented to significantly relax the clock frequency and the power consumption requirements. The same idea can be used in the receiver design. With our developed multi-phase VCO, we can implement a quarter-rate CDR with 15GHz to recover the 60 Gb/s data. Fig. 3-7 shows the structure of the proposed 1:4 DEMUX with the recovered multi-phase clock. In this proposed 1:4 DEMUX, two clock signals, which operate at 15 GHz and have 90 degree phase differences, are used. As shown in Fig. 3-7, only two 1:2 DEMUXs are needed with the quadrature phase clock to achieve a 1:4 DEMUX. No frequency dividers are required. A 15 GHz VCO is easier to implement than 30GHz VCO in the current BiCMOS and CMOS technologies. The total power consumption of the 1:16 DEMUX can be decreased significantly with five such 1:4 DEMUXs. Thus the design constraints can be significantly relaxed.

One potential problem of the proposed 1:4 DEMUX is the loading mismatches between Clk0 and Clk90, as shown in Fig. 3-7 (a). To solve this problem, two redundant D latches were added to Clk0 to balance the loading of Clk0 and Clk90.
Fig. 3-8 shows the data transfer diagram of the proposed 1:4 DEMUX. If the input data stream is D0, D1, D2, ..., the output data of the top 1:2 DEMUX is controlled by the rising edge of Clk0 and the output of the bottom 1:2 DEMUX is controlled by the rising edge of Clk90, which lags Clk0 by 90 degrees. These two clock signals are both from the proposed quarter-rate CDR. Because Clk0 and Clk90 have a 90 degree phase difference, the outputs of the top 1:2 DEMUX and of the bottom 1:2 DEMUX also have a 90 degree phase shift. We added two D latches which are controlled by Clk90 after the top 1:2
DEMUX. This clock signal will then shift A1’ and A2’ 90 degrees to the right and thus all four outputs (A1, A2, A3 and A4) will align in the end. As shown in Fig. 3-7 (a), we see that at first, A1’ and A2’ lags A3 and A4 by 90 degrees. However, after being re-sampled by Clk90, A1’ and A2’ are shifted to A1 and A2, which then align with the rising edge of clk90, thus becoming aligned with A3 and A4.

![Figure 3-8 Data transfer diagram of the proposed 1:4 DEMUX](image)

As explained in the previous paragraph, this novel 1:4 DEMUX requires two quadrature phase clocks (clk0 and clk90), which are easily generated with our previously developed multi-phase VCO. According to the above operation, the following 4:16 DEMUX needs only a 15GHz × 1/4 = 3.75GHz clock. Thus, a divide-by-four circuit is required for the extracted clock before it reaches the 4:16 DEMUX. One thing we need to be careful of is the output data sequence. As shown in Fig. 3-8, the input data sequence and the output data sequence need necessary re-routing so that the final outputs correspond to the input signals.
Fig. 3-9 shows the schematic of the D latch we used in the 1:4 DEMUX with BiCMOS technology. The same structure is used in the 65nm CMOS with all devices replaced by MOSFETs. We used the CML structure because of its high speed. The source follower helps to increase the circuit loading capability because of its low output impedance, and the two diode connected transistors are added to decrease the voltage across the two current sources to prevent them entering the breakdown region, guaranteeing the safe operations of the devices in the circuit.

![Figure 3-9 Schematic of the D latch](image)

Since the input data rate can reach as fast as 60 Gb/s, a high speed buffer is necessary before the first 1:4 stage to compensate for the signal attenuation suffered during the transmission between the transmitter and the receiver. In consideration of the high speed of the data, the buffer is implemented with a Cherry Hooper amplifier to extend the bandwidth as shown in Fig. 3-4 (b).
Fig. 3-10 shows the output driver circuit of the deserialized data. The tail current is increased stage by stage through three stages from the DEMUX output to the pad. This takes into the consideration the driving ability of each stage. The last stage in the output driver is an open collector differential pair. One branch is connected with a 50Ω internal resistor to balance the impedance of the cable used for testing the chip. The other branch is connected with the pad in an open-collector configuration as shown in Fig. 3-10. An 8mA tail current is required for the last stage to achieve at least a 400mV output swing. The current source in the second stage is set to 3mA, and that in the first stage is set to 1.2mA. The component values are listed in Table 3-1. Generally, the probe and connecting wires will introduce some resistance, so R₃ is set to be a little larger than 50Ω.

![Figure 3-10 Single ended output pad driver circuit](image)

<table>
<thead>
<tr>
<th>Resistor value</th>
<th>Tail current value</th>
</tr>
</thead>
<tbody>
<tr>
<td>( R_j=350\Omega )</td>
<td>( I_{SS1}=1.2\text{mA} )</td>
</tr>
<tr>
<td>( R_j=140\Omega )</td>
<td>( I_{SS2}=3\text{mA} )</td>
</tr>
<tr>
<td>( R_j=70\Omega )</td>
<td>( I_{SS3}=8\text{mA} )</td>
</tr>
</tbody>
</table>

Fig. 3-11 shows the schematic of the current sources in the output driver. A Widlar current source is implemented, and all the transistors are sized to integer multiples of a
basic size so that only one MOSFET needs to be drawn, and all larger MOSFETs are composed of that basic unit in the layout. This can alleviate device mismatches due to process variations.

![Figure 3-11 Schematic of the current source](image)

3.3 Clock and Data Recovery (CDR)

As mentioned in the previous section, quadrature phase clocks are required in the proposed 1:16 DEMUX. Thus, the clock and data recovery (CDR) circuit with a multi-phase VCO was implemented to generate the required clock signals for the receiver. Our previously implemented LC-ring oscillator was used in the CDR.

The CDR circuit is widely used in receiver designs. The objective of the CDR is to extract the clock information from the unknown data and use this extracted clock to recover the data. Ideally, the recovered data should be an exact copy of the input data. However, due to errors such as a less than optimum sampling instant, the recovered data may not precisely be equal to the input data. The difference is represented as the Bit Error Rate (BER) [6]. The optimum sampling instant is midway between when the data signal
rises and when it falls. This is when the BER is lowest and when the SNR is highest. The relationship between BER and SNR for binary signals with additive white Gaussian voltage noise is shown below, where \( \text{erfc}(*) \) denotes the complementary error function [65]. The BER rate decreases quickly as the SNR increases.

\[
BER = \frac{1}{2} \text{erfc} \left( \frac{\text{SNR}}{2\sqrt{2}} \right).
\]  

(3-1)

In addition, as mentioned in Section 3.1, the data quality also affects the recovered data. Data with the lowest jitter will, of course, make the design of the recovery circuit easier.

Fig. 3-12 shows the structure of the CDR in this project. It is a closed-loop CDR, which is based on the PLL structure in our transmitter. The phase detector compares the phase and frequency of the input data with that of the internal clock and generates an error signal. This error signal then controls the charge pump to generate the corresponding charge or discharge current. The loop filter will then transfer the charge or discharge current into a voltage signal to adjust the VCO frequency. In this structure, the outputs of the phase detector contain the four demuxed data streams.

![CDR Structure](image)

Figure 3-12 CDR structure in this project

CDRs operating at tens of gigabits data rate evoke many challenges, such as high speed phase detection, clock jitter, speed limitation, power consumption, and low noise.
VCO design. In addition, the jitter performance of the CDR determines the quality of the recovered data [6]. The CDR lock acquisition time is also an importance parameter telling us how fast the CDR can lock to the data if the data rate changes.

3.3.1 Phase Detector

Several different types of phase detectors are used now: analog phase detectors, binary phase detectors, multi-rate phase detectors, etc. [66], [67].

The analog phase detector shows a linear relationship between the input phase difference and the output error signal as shown in Fig. 3-13. Normally, the "Up" signal represents the phase difference between the data and the internal clock, and a fixed signal named "Dn" is generated to compare with "Up". In the locked state, "Up" is equal to "Dn". The advantage of this type of phase detector is the linear dependence which causes only small fluctuations in the voltage control signal when the CDR is in the locked state. The disadvantage of the analog phase detector is its sensitivity to process variations.

The binary phase detector generates a fixed error signal no matter what the phase difference is, as shown in Fig. 3-14. A commonly used binary phase detector is the
Alexander phase detector as shown in Fig. 3-15 [68], [6]. Through the sampling of three different points, two XOR outputs determine whether the data leads the clock or the data lags behind the clock. When the CDR is locked, Q₃ and Q₄ become “0”s, thus both Up and Dn become “0”s. This is because when one of the two differential pairs of the XOR is zero, the output of the XOR will be zero.

![Pulse width diagram](image)

**Figure 3-14 Output of the binary phase detector vs. phase difference**

There are three advantages for the Alexander phase detector. The first one is the automatic data retiming at Q₁ and Q₂. No other circuits are needed to retime the data. The second advantage is the better tracking property because of its tri-state output: high, low and zero. The additional zero state represents the locking condition and generates no
change/discharge current to the VCO control line. Therefore, the VCO frequency keeps constant when there are no data transmissions. This is very important to avoid loss of locking when there are continuous “1”s or continuous “0”s at the CDR input. The third advantage is the absence of the charge pump. Another circuit, a voltage to current (V/I) converter, is added to generate the corresponding error current to the loop filter. The elimination of the charge pump makes it possible for the Alexander phase detector to work at high speed. However, as shown in Fig. 3-15, the loading of the clock is a limiting factor for the high speed application of the Alexander phase detector. Some solutions for this loading problem have been implemented in multi-rate phase detectors.

Multi-rate phase detectors are more attractive than Alexander phase detectors to high speed CDRs. The main idea of this type of phase detector is to use multi-phase clocks at lower clock frequencies. Currently, there are half-rate, quarter-rate, 1/8 rate CDRs, etc [6], [69]-[71]. A lower rate may cause a more complicated design and consume more power. The advantage of the multi-rate CDR is the reduction of the VCO frequency, which relaxes the design in high speed applications. In addition, multi-phase structures significantly decrease the loading of each clock, which is a very attractive point for high speed operations [72]. Some concerns do exit with the multi-rate CDRs, such as phase mismatches among the clocks phases, circuit complexity, and the potential of extra power consumptions in other logic gates, which are used to generate the phase errors. Phase mismatch is a critical factor for multi-rate CDRs, because the mismatch directly affects the sampling positions. For example, in the half-rate phase detector which uses both in-phase and quadrature phase clocks to sample the data, the rising and falling edges of the in-phase clock and the rising edge of the quadrature phase clock are used to
generate the three sampling points. In the locked state, the quadrature phase clock samples right at the center of the data and the rising edge is equally distant to the rising and falling edges of the in-phase clock, as shown in Fig. 3-16. However, phase mismatches between the in-phase and the quadrature phases will cause the phase difference to be less than 90 degrees. Then the zero phase error will not happen when the quadrature phase clock is at the center of the data. Therefore, the VCO control voltage will not maintain constant when the CDR is locked. The CDR will dither around its locked state.

![Figure 3-16 Signal waveforms of the half-rate CDR](image)

3.3.2 Charge Pump

The charge pump in the CDR is used to convert the phase error into a charge or discharge current, similar to that in a PLL. This charge or discharge current will then generate a VCO control voltage through the low pass filter. However, there exists a significant difference between the charge pump in a CDR and the charge pump in a PLL. As mentioned in Chapter 2, in the PLL, the phase error signal has the same frequency as that of the reference signal, which is normally a very low frequency compared with the VCO frequency. In the CDR, the phase error signal has a very high frequency because of the high speed of the input data. Therefore, standard CMOS logic implementation is not applicable in the high speed CDR. Another type of phase detector, the current steering
phase detector, is always used in high speed CDRs [73]. The current mode logic (CML) implementation was selected in this project.

Fig. 3-17 shows the differential version of the standard current steering charge pump. The differential structure has many advantages over the single ended structure, such as enhanced common mode noise rejection, even order harmonic elimination and, especially, compatibility with the differential VCO control signals in this project. However, the structure represented in Fig. 3-17 still suffers mismatches because of the asymmetrical loading of the charge pump output. The differential output structure can solve this problem with a totally symmetrical structure.

Figure 3-17 A typical current steering charge pump

The charge pump structure in Chapter 2, as shown in Fig. 2-18, is still used in the CDR. This type of current steering charge pump achieves the symmetrical structure for all the input signals and eliminates the current mismatches caused by the asymmetrical structure. Also as mentioned in Chapter 2, with accurate matching between the pull-up
and pull-down resistance, this charge pump can form a nearly ideal integrator, by making the current into the loop filter equal to the current out of the loop filter.

### 3.3.3 Loop Filter

The loop filter functions as the current-to-voltage converter in the CDR as it does in the PLL. It determines the poles and zeros of the system. The first order loop filter can solve the stability issues by introducing a zero. However, it still suffers from glitches in the VCO control line caused by the current switching of the charge pump. These glitches generate noise in the VCO control line and deteriorate the performance of the CDR. Therefore the second order loop filter is used in our CDR to further attenuate the glitches in the VCO control line. We do this by adding an additional capacitor in parallel with the first order loop filter as shown in Fig. 3-18. The capacitor and resistor’s values are determined by the jitter transfer and the jitter tolerance requirements. The loop filter in Fig. 3-18 is a differential circuit, which includes $C_1$, $R$, and $C_2$ in differential configuration. Here, we do not combine the two $C_1$ and the two $C_2$ together because of the necessity to maintain symmetry in the layout, as shown in Fig. 3-19. Fig. 3-19 shows the typical layout of a Metal-insulator-Metal (MIM) capacitor. The top metal is asymmetrical with the bottom metal because they are in different layers, and the metals’ specifications are different. Thus, the differential signals see different loadings if a single MIM capacitor is connected between them. However, this asymmetrical loading problem can be eliminated by placing two MIM capacitors, face to face or back to back, as shown in Fig. 3-20. In Fig. 3-20, the loading of the two differential signals are totally symmetrical, thus improving the current matching of the charge pump.
3.3.4 VCO

In our transmitter design, a multi-phase VCO was implemented to achieve the quarter-rate operation. The same VCO can still be used in our CDR to achieve a quarter-rate CDR that relaxes the design limitations. The detailed description of the proposed LC-ring VCO can be found in Chapter 2. Four clock phases are necessary for this quarter-rate CDR: 0, 90°, 180°, and 270° [6]. With our developed VCO shown in Fig. 2-15, eight
phases were generated, which have a 45° phase difference between one another. Therefore, all the eight clocks in the CDR were used to further decrease the components and the loading of each clock signal.

### 3.3.5 Jitter Transfer

Jitter transfer describes the change in the jitter characteristics as the data goes from the CDR’s input to its output. It is the same as the closed-loop transfer function from the input to the output in the PLL described in Chapter 2 [6]. The CDR circuits in optical communications must satisfy stringent jitter specifications. In optical standards, the jitter is often described in terms of the bit period called “unit interval” (UI). A jitter of 0.15UI means the jitter is within 15% of the bit period. Fig. 3-21 shows the jitter transfer curve of the OC-192 standard.

Two important properties in optical receivers regarding the jitter transfer need to be considered. The first one is the CDR bandwidth, which must be small enough to attenuate additional jitter components. For example, the -3dB bandwidth of the OC-192 standard shown in Fig. 3-21 requires that all CDR bandwidths must be smaller than 8MHz to attenuate the jitter components above 8MHz. The other concern is jitter peaking as shown in Fig. 3-22. This jitter peaking is caused by the pole and zero of the loop filter [6]. In the long-haul SerDes, even a small amount of jitter can be amplified by tens of thousands of repeaters, deteriorating the received data. Therefore, the long-haul SerDes has more stringent requirements than the short-haul SerDes.
The loop bandwidth is also related to the acquisition ability of the CDR. For example, if the input data contains a slow jitter, then the output must follow the data change. This can be easy to achieve with a narrower loop bandwidth. However, if the input data contains a high speed jitter, then the output may not be able to track the data change, thus leading to higher BER. In this case, a wider bandwidth is helpful for tracking the data. Another method is to pre-filter the data with a low pass filter so that the high-frequency jitter can be eliminated before the receiver. Thus, a CDR with a narrower bandwidth can still be applicable to high-frequency jitter.

As described in Section 2.2.5 and Equation (2-26), the jitter transfer function of the CDR can be written as:
where
\[
\omega_n = \sqrt{\frac{I_{CP}K_DK_PC_{VCO}}{2\pi C_1}},
\]
(3-3)

\[
\zeta = \frac{R}{2} \sqrt{\frac{I_{CP}K_DK_PC_{VCO}}{2\pi}}.
\]
(3-4)

Here, the data transition density \( K_D \) is assumed to be 1, which is the case of the highest data rate such as 010101... [90]. At the -3dB frequency,
\[
\frac{4\zeta^2\omega_n^2\omega_{3dB}^2 + \omega_n^2}{(-\omega_{3dB}^2 + \omega_n^2)^2 + (2\zeta\omega_n\omega_{3dB})^2} = \frac{1}{2}.
\]
(3-5)

Then,
\[
\omega_{3dB}^2 = \omega_n^2[(2\zeta^2 + 1) + \sqrt{(2\zeta^2 + 1)^2 + 4}].
\]
(3-6)

Thus, if we assume \( 2\zeta^2 \gg 1 \), then Equation (3-6) can be simplified as \( \omega_{3dB}^2 \approx \omega_n^2[2\zeta^2 + 2\zeta^2] \), thus,
\[
\omega_{3dB} \approx 2\zeta\omega_n = \frac{R I_{CP}K_PC_{VCO}}{2\pi}.
\]
(3-7)

This shows that by assuming \( 2\zeta^2 \gg 1 \), the -3dB bandwidth becomes independent of \( C_1 \).

Thus, from Equations (3-4) and (3-7), we see the method to decrease the -3dB bandwidth by a factor of \( N \) while maintaining the same \( \zeta \) is to decrease \( R \) by \( N \) and increase \( C_1 \) by \( N^2 \).

To analyze the jitter peaking, let’s first examine Equation (3-2). There are two poles and one zero. The zero occurs at
\[
\omega_z = \frac{-\omega_n}{2\zeta},
\]
(3-8)

and the poles occur at
Fig. 3-22 shows the Bode plot of the magnitude of the closed-loop transfer function with zero and poles indicated. If we assume \( \omega_{p2} \gg \omega_{p1} \), then we can consider that the transfer function is flat between \( \omega_{p1} \) and \( \omega_{p2} \)[6].

If jitter, \( J \), is defined as the magnitude of the jitter transfer function, then jitter in dB is \( 20 \log J \). The value of \( J \) in the peak region shown in Fig. 3-22 we shall call \( J_p \). In dB this becomes

\[
20 \log J_p = 20 \log \omega_{p1} - 20 \log \omega_z. \tag{3-10}
\]

Then [74]

\[
J_p = \frac{\omega_{p1}}{\omega_z} \approx 1 + \frac{1}{4\zeta^2}. \tag{3-11}
\]

Since \( 4\zeta^2 \gg 1 \), we get

\[
20 \log J_p \approx \frac{8.686}{4\zeta^2} = \frac{2.172}{\zeta^2}. \tag{3-12}
\]

Thus, to make sure that \( 20 \log J_p < 0.1dB \), \( \zeta \) must be larger than 4.66. Plugging Equation (3-4) into (3-12), we get:

\[
20 \log J_p = \frac{8.686}{R^2L_1C_1C_2K_pK_{VC}} < 0.1, \tag{3-13}
\]

Therefore, to get small jitter peaking, we need to increase the denominator of Equation (3-13). However, as mentioned Equation (3-7), \( R \) needs to be lowered to achieve small bandwidth, so \( C_1 \) needs to be increased by a factor greater than \( 1/R^2 \).

### 3.3.6 Jitter Tolerance

Jitter tolerance specifies how much input jitter a CDR can tolerate without increasing the bit error rate (BER) [6]. Fig. 3-23 shows the jitter tolerance mask for the
OC-192 standard. From the mask, we see that the CDR must be able to track the data when the jitter frequency is below 2.4 kHz with the maximum jitter magnitude at 15UI, which means a 1.5ns jitter magnitude. As jitter frequency increases, the tolerable jitter magnitude decreases. At jitter frequency above 4MHz, the CDR must track the data with jitter magnitude no larger than 0.15UI, which is only 15ps. Higher jitter amplitude can be tolerated at lower jitter frequencies since low frequency jitter can be tracked easier by the CDR. High frequency jitter makes it difficult for the CDR to track the data when the jitter amplitude is large. Above a specific jitter frequency, the jitter cannot be tracked anymore by the CDR. Therefore, the jitter tolerance curve becomes a constant value above a certain frequency. In practice, high frequency jitter tolerance is limited by the setup and hold times of the sampling register for the data stream in combination with the phase error under steady state conditions [6].

![OC-192 Jitter tolerance mask](image)

To determine the jitter tolerance curve of our CDR, a criterion must be met first. The maximum phase error must not exceed 0.5UI. Otherwise, the BER will increase [6]. For this, we get the following condition

$$\phi_{in} - \phi_{out} < 0.5UI.$$  \hspace{1cm} (3-14)
Thus,

$$\phi_{in}(1 - H(s)) < 0.5\text{UI},$$  \hspace{1cm} (3-15)

where $H(s)$ is the transfer function from $\phi_{in}$ to $\phi_{out}$.

$$\phi_{in} < \frac{0.5\text{UI}}{1 - H(s)}.$$  \hspace{1cm} (3-16)

Thus, the jitter tolerance with respect to UI can be described as:

$$J_T = \frac{0.5}{1 - H(s)}.$$  \hspace{1cm} (3-17)

Since the loop filter in the CDR is a second order loop filter, the system is actually a third order system. Here an approximation was made to use the second order system for frequency domain analysis. Plugging Equation (3-2) into (3-17), we get:

$$J_T = \frac{1}{2} \frac{s^2 + 2\zeta \omega_n s + \omega_n^2}{s^2}.$$  \hspace{1cm} (3-18)

There are two poles at zero and two zeros. Based on the definition of jitter tolerance, any jitter tolerance curve above the standard mask will satisfy the requirement. The higher the position of the curve, the more jitter tolerance can be achieved.

As shown in the previous analysis, the two poles in Equation (3-2) are the same as the two zeros in Equation (3-18). We can see that the jitter tolerance curve will shift to the top and right as shown in Fig. 3-24 when $\omega_n$ is increased. This improves the jitter
tolerance performance. However, increasing $\omega_n$ will increase the jitter transfer bandwidth, which may degrade the jitter transfer performance. Therefore, there is always a trade-off between the jitter transfer and jitter tolerance.

Some other observations need to be made. Let's still take the example of the OC-192 standard. As the PLL can track the jitter within its loop bandwidth, the CDR must have a loop bandwidth larger than 4MHz so that the CDR can track the jitter with the largest achievable jitter amplitude based on the jitter tolerance curve. In addition, from the jitter transfer curve, the CDR loop bandwidth must be less than the -3dB bandwidth so that the jitter above $\omega_{3dB}$ can be attenuated. Thus, the CDR loop bandwidth is always chosen between the jitter transfer bandwidth and the jitter tolerance bandwidth, which in the OC-192 standard is between 4MHz and 8MHz. In this project, the OC-768 standard was taken as the reference, and this is shown in Fig. 3-25.

![OC-768 Jitter tolerance mask](image)

Figure 3-25 Jitter tolerance mask of the OC-768 standard

### 3.3.7 Jitter Generation

Jitter generation represents the jitter caused by the CDR itself, assuming the input data contains no jitter [6]. Similar to the PLL in Chapter 2, there are several jitter sources in the CDR.
The first source is the ripple on the VCO control voltage. This is caused by the net current injection from the charge pump even when the CDR is locked. The mismatch between the charge and the discharge current also generates a ripple when the CDR is locked [6], [14], [52]. Since the VCO may have a high gain at the high frequency, the phase and frequency errors resulting from the ripple may be significant, resulting in a large jitter at the VCO output. Thus, better current matching in the charge pump is important to reduce the ripple, which in turn will reduce jitter. The differential structure helps to achieve the best current matching, and the negative impedance structure further helps to achieve a nearly ideal integrator for better current matching.

A second source of jitter internal to the CDR is the data coupling to the VCO through the phase detector and the retiming circuits. In a PLL, we know that this ripple is helpful to some extent for the elimination of the phase detector dead zone [52], which is caused by the delay component in the phase detector.

The third source of jitter is the VCO’s phase noise due to the device’s electronic noise. An equation describing the relationship between the closed-loop jitter of the PLL and its bandwidth, the VCO phase noise, and center frequency is derived in [6].

$$
\Delta T_{PLL} = \frac{1}{\sqrt{2\pi f_u}} \sqrt{S_\theta(\Delta \omega) \frac{\Delta \omega}{\omega_0}},
$$

where $f_u$ is the PLL loop bandwidth, $S_\theta(\Delta \omega)$ is the VCO phase noise at $\Delta \omega$ offset and $\omega_0$ is the oscillation frequency. Several methods to improve the VCO phase noise were mentioned in Chapter 2, such as choosing (1) a high Q inductor, (2) low $g_m$ coupling MOSFET while still satisfying the oscillation startup condition, (3) large tail current while not making the output saturate and (4) the high output swing [14]. The loop
bandwidth is determined by the jitter transfer and jitter tolerance values as mentioned above.

A fourth source of internal jitter is the power supply noise. This can be generated by the noise from the external power supplies or the noise coupled from other circuits through the parasitic capacitors of the PMOS in the current mirrors [52]. One method to attenuate the power supply noise is to add a capacitor between the power supply conductor and the substrate to filter the noise to the ground.

In addition, the noise coupling between the digital circuits and the analog RF circuits through the substrate also introduces jitter to the CDR data. Some methods have been used to attenuate this problem, such as isolating the analog and digital circuits. The use of silicon-on-insulator (SOI) technology also provides good immunity in the noise coupling issue [75].

3.3.8 The CDR in this Project

Fig. 3-26 shows the CDR block diagram in this project. The CDR is based on the structure in [72]. With our proposed LC-ring VCO, eight-phase clocks with low phase noise can be achieved. The V/I convertor in this project, using a negative impedance amplifier (NIA) circuit, can achieve better charge pump current matching and form a better integrator with resistance matching between pull-up and pull-down resistances, as explained in Chapter 2. Four of these V/I converters are connected together as shown in Fig. 3-27.

A significant advantage of this structure compared with the Alexander based CDR is the clock loading. As shown in Fig. 3-15, the clock signal in an Alexander phase detector
has to drive at least four D flip-flops, which limits the circuit bandwidth significantly. However, in Fig. 3-21, each clock drives only one D flip-flop, which relaxes the bandwidth limitation [72]. Thus, together with the V/I converter, the CDR in this project achieves high speed easier.

![Figure 3-26 The CDR implementation in this project [72]](image)

Unlike the PFD, in which the outputs become zero at the locking state, the pulse width of the XOR gate's output is at least one data period in the locked state. This width
can still generate the corresponding output current through the V/I converter. Therefore, the dead zone issue in the PFD will not happen in the XOR and V/I converter.

To determine the component parameters, both OC-192 and OC-768 were used as references in our design to determine the jitter transfer and the jitter tolerance. From Equations (3-2), (3-3), (3-13) and (3-18), the jitter mask can be determined.

Through calculations and simulation adjustment, the parameter values of the CDR components are as follows:

\[ R = 1.5 \Omega, \quad C_1 = 80 \text{pF}, \quad C_2 = 1.5 \text{pF}, \quad \text{and charge pump current } I_{cp} = 125 \mu\text{A}. \]

### 3.4 Simulation Results and Layouts

Based on our previously implemented VCO, we have the value of \( K_{VCO} \) from Chapter 2. Then, the determination of other components is based on the power consumption, the capacitor area, and the PLL jitter performance. Of course, further adjustments were required considering the parasitic effects.

Fig. 3-28 and Fig. 3-29 show the simulated jitter transfer and the jitter tolerance curves of the CDR in this project respectively. The jitter peaking is 0.09dB as shown in Fig. 3-28. This satisfies the jitter peaking requirement which is less than 0.1dB. The jitter tolerance curve is above the OC-768 and the OC-192 jitter tolerance masks as shown in Fig. 3-29, thus satisfying the jitter tolerance requirement of both the OC-768 and the OC-192 standards.

Fig. 3-30 shows the acquisition time of the CDR which is around 25ns. This acquisition time is also determined by the loop bandwidth. Higher loop bandwidth achieves faster acquisition times.
Fig. 3-31 shows the simulated eye diagrams of the extracted clock demuxed data. The recovered data is 15 Gb/s with peak-to-peak jitter of 3.8ps. The extracted clock is 15GHz with peak-to-peak jitter of 3.3ps.
Figure 3-30 Acquisition of the CDR

Figure 3-31 Eyediagram of the recovered data and clock: the data rate is 15 Gb/s and the clock frequency is 15GHz
Figure 3-32 shows the simulated waveforms of an input-output pair of the SerDes. The output follows the input after a certain amount of delay which is caused by the transmitter and receiver.

![Simulated waveforms of the input and output signals of the proposed SerDes](image)

Figure 3-32 Simulated waveforms of the input and output signals of the proposed SerDes

Table 3-2 shows the comparison of this work with some published work. The 3.6W with 0.18μm BiCMOS technology is the measurement result. The 92mW with 65nm CMOS technology is the simulated result which includes 31mW for the 4:1 MUX and 61mW for the 1:4 DEMUX/CDR. With our proposed quarter-rate structure, the speed and power achieved the best tradeoff compared to those published works. The previous 0.18μm BiCMOS consumed more power than ours, and our CMOS version consumed less power and achieved higher speed.
Table 3-2 A MUX/DEMUX comparison of this work with some published designs

<table>
<thead>
<tr>
<th>Year</th>
<th>Structure</th>
<th>Technology</th>
<th>Bit rate</th>
<th>Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>[23] 2003 ISSCC</td>
<td>16:1 MUX/DEMUX</td>
<td>0.18µm SiGe</td>
<td>43 Gb/s</td>
<td>-5.2V N/A</td>
</tr>
<tr>
<td>[41] 2003 JSSC</td>
<td>4:1 MUX/DEMUX</td>
<td>0.18µm SiGe</td>
<td>43 Gb/s</td>
<td>3.6V /2.3 W</td>
</tr>
<tr>
<td>[44] 2004 ISSCC</td>
<td>16:1 MUX/DEMUX</td>
<td>0.18µm SiGe</td>
<td>39.8-43 Gb/s</td>
<td>-5.2/1.8V /11.6 W</td>
</tr>
<tr>
<td>[45] 2005 JSSC</td>
<td>4:1 MUX</td>
<td>0.2µm SiGe</td>
<td>54 Gb/s</td>
<td>4.8V /2.95 W</td>
</tr>
<tr>
<td>[16] 2005 ISSCC</td>
<td>4:1 MUX/DEMUX</td>
<td>90nm CMOS</td>
<td>40 Gb/s</td>
<td>1.2V /194 mW</td>
</tr>
<tr>
<td>[21] 2006 ISSCC</td>
<td>1:4 DEMUX</td>
<td>0.13µm CMOS</td>
<td>20 Gb/s</td>
<td>1.2V /210 mW</td>
</tr>
<tr>
<td>[43] 2007 ISSCC</td>
<td>1:16 DEMUX</td>
<td>90nm CMOS</td>
<td>43 Gb/s</td>
<td>1.2V /910 mW</td>
</tr>
<tr>
<td>[20] 2007 MTT</td>
<td>1:2 DEMUX</td>
<td>0.18µm CMOS</td>
<td>20 Gb/s</td>
<td>2V /150 mW</td>
</tr>
<tr>
<td>[19] 2008 ICCCAS</td>
<td>2:1 MUX/DEMUX</td>
<td>0.13µm CMOS</td>
<td>50 Gb/s</td>
<td>1-1.5V /40 mW-129 mW</td>
</tr>
<tr>
<td>[37] 2009 JSSC</td>
<td>128:1 MUX/DEMUX</td>
<td>65nm CMOS</td>
<td>40 Gb/s</td>
<td>1.0V /1.46 W</td>
</tr>
<tr>
<td>[38] 2009 JSSC</td>
<td>32:1 MUX/DEMUX</td>
<td>0.13µm CMOS</td>
<td>40 Gb/s</td>
<td>1.45V /3.6 W</td>
</tr>
<tr>
<td>[40] 2009 SVLSI</td>
<td>4:1 MUX</td>
<td>90nm CMOS</td>
<td>40 Gb/s</td>
<td>1.5V /325 mW</td>
</tr>
<tr>
<td><strong>This work</strong></td>
<td>4:1 MUX/DEMUX</td>
<td>65nm CMOS</td>
<td>60 Gb/s</td>
<td>2.0V /92 mW</td>
</tr>
<tr>
<td><strong>This work</strong></td>
<td>16:1 MUX/DEMUX</td>
<td>0.18µm SiGe</td>
<td>36 Gb/s</td>
<td>3.4V /3.6 W</td>
</tr>
</tbody>
</table>

Fig. 3-33 shows the layout of the 4-bit SerDes at 60 Gb/s data rate with IBM 65nm CMOS technology. The chip area is 1.75mm×2.00mm.

Fig. 3-34 shows the chip photo of the 16-bit SerDes with Jazz Semiconductor’s 0.18µm BiCMOS technology. The chip area is 2.235mm×2.092mm.
Figure 3-33 Chip layout of the 4:1 MUX/DEMUX with IBM 65nm CMOS technology:
layout area is 1.75mm×2.00mm

Figure 3-34 Chip photo of the 16-bit SerDes with Jazz Semiconductor 0.18μm BiCMOS technology: the chip area is 2.235mm×2.092mm.
3.5 Discussion and Future Work

3.5.1 Discussion

In this chapter, a novel quarter-rate 1:4 DEMUX was implemented to reduce the power consumption and relax the design constraints of the VCO. The proposed low noise multi-phase VCO was implemented with a previously published CDR [72] to further increase the speed from 40 Gb/s [72] to 60 Gb/s. The V/I converter constructed with the negative impedance amplifier (NIA) was also used to achieve better current matching, thus lowering the noise in the data and the clock.

Compared to the traditional 1:4 DEMUX, the proposed 1:4 DEMUX saved about 1/3 the number of components and eliminated the high speed frequency divider. The clock frequency was also reduced to only 1/4 the data rate. A potential issue of this proposed 1:4 DEMUX is the asymmetrical loading of the quadrature phase clock signals, which may cause duty cycle distortion in the recovered data. Therefore, redundant D flip-flops were added to match the load of the quadrature clocks.

3.5.2 Future Work

3.5.2.1 Line Decoding

Line decoding will have to be undertaken in consideration of the encoding block in the serializer. 8B/10B or 64B/66B decoding can be implemented in the next step. The expressions of each bit in the 8-bit decoded data should be the combination of all the 10 bits of encoded data. The circuit implementation will involve use of AND/NAND gates, OR/NOR gates, and inverters.
3.5.2.2 Speed Enhancing

Increasing the speed of the receiver input will result in more difficulties for data and clock recovery because the CDR may not be able to track the data at a very high data speed. Higher speed, such as 80 Gb/s for the next generation SONET standard, will be attractive as the data capacity requirements increase. With our proposed quarter rate structure, lower power consumption and relaxed VCO design constraints can be achieved.
CHAPTER 4

TIME-TO-DIGITAL CALIBRATION

As mentioned in Chapter 3, the jitter performance of the clock is a critical factor determining the output data jitter. Larger jitter may cause phase mismatches among the multi-phase clocks such as the quadrature phase clocks in this project. This phase mismatch will narrow the eye opening of the output data and create difficulties for the data recovery circuit, thus deteriorating the BER performance of the SerDes.

In this chapter, a novel calibration circuit was developed based on the time-to-digital converter (TDC) circuit to compensate for the phase mismatches among the multi-phase clocks in the PLL. A new technology, a Three-Dimensional Fully-Depleted Silicon on Insulator (3D FDSOI) CMOS technology, was used for this calibration circuit design. This 3D FDSOI process has many advantages such as high integrity per unit area, short interconnection, elimination of noise coupling through the substrate, no latch-up, and reduced crosstalk. Because of the 3D stack structure, the analog PLL and the digital calibration circuit can be separated into two different tiers, thus eliminating the noise coupling between the analog circuit and the digital circuit. Furthermore, the TDC calibration circuit successfully compensated for the phase mismatches caused by process variation, the 3D vertical interconnections, and the temperature issues in this stack structure. The proposed TDC calibration circuit achieved timing resolution as high as 2ps.

The structure of this chapter is as follows. Section 4.1 focused on the top structure of the TDC embedded PLL. Section 4.2 describes the detailed structure of the novel TDC.
Two operations in this TDC are analyzed. Section 4.3 supplies the simulation results and the layout views with MIT Lincoln Lab’s 0.15μm 3DSOI CMOS process.

4.1 Technology Overview

With the development of conventional CMOS technology, scaling becomes more and more difficult and expensive. From the old CMOS technology such as 0.25μm CMOS, to current advanced CMOS technology, such as 40nm CMOS and 22nm CMOS, the technology keeps scaling down following Moore’s law. More and more transistors can be integrated in a unit chip area. This high integration boosts the performance of a single chip through the consolidation of more functions. However, there are no guarantees about how long Moore’s law will last. After the channel length becomes shorter than 5nm, the channel effect happens. The electrons may go from the source to the drain spontaneously, which makes the transistor no longer function correctly. Many people are currently working to find other materials to replace silicon. No better replacement has been found yet. In addition, any new material could have a prohibitive cost. Therefore, the 3D process becomes a more and more attractive technology. Gordon Moore et al. had foreseen the future implementation of 3D technology.

In contrast with scaling down in the 2D process, the 3D process uses wafer stacking to achieve large integration instead of transistor shrinking. Through the vertical stack of the wafer, more transistors can also be integrated within a certain chip area. Furthermore, another significant advantage of short delay time with the 3D process makes it a very valuable one in VLSI design. As we know, timing problems are a critical issue in VLSI circuit design. The critical path goes the longest distance, which severely limits the
circuit performance. In high speed wireless or wireline designs, such as RF or SerDes designs, long distance routing of high speed signals is always avoided as much as possible to minimize significant signal attenuation. However, as the data bits or the circuit integration increases, long distance routing is inevitable in the 2D process. Taking the advantage of the 3D structure, interconnections in the circuit can be routed vertically through the contacts between different tiers.

![3DOGC Opens to This Metal Level](image)

Figure 4-1 Profile structure of the 3D process [89]

Fig. 4-1 shows the profile of the 3D process. The vertical routing decreases the wire length significantly. In addition, large scale circuits can be separated into multiple tiers face to face, further reducing the routing length of the critical signals [76]-[78]. This is very important in current high speed circuit design, because of the parasitic effects of wire.

In this specific SerDes design, the 3D process provided another advantage of noise coupling reduction. As we know, in mixed-signal circuit design, both the analog RF part and the digital part share the same substrate. This brings problems to high speed
operation, in which the switching noise in the digital part is coupled to the analog part through the finite resistance of the substrate. In the 3D process, the analog RF part and the digital part can be built on different tiers so that the noise through the substrate can be decoupled. Therefore, the noise coupling issue between analog and digital circuits in the 2D process can be eliminated in the 3D process.

In this project, MIT Lincoln Lab’s (MITLL) 3D fully depleted silicon-on-insulator (FDSOI) 0.15µm CMOS process is used. Besides the advantages of the 3D process over the 2D process described above, the SOI process also introduces more advantages over traditional CMOS technology. Fig. 4-2 shows the profiles of the standard SOI process and the CMOS process. There are several advantages of the SOI process over the traditional CMOS process.

![Profiles of standard CMOS and SOI CMOS processes](image)

Figure 4-2 Profiles of standard CMOS and SOI CMOS processes
First, the high-resistivity buried oxide isolation layer shown in Fig. 4-2 (b) reduces the crosstalk and helps the integration of high quality on-chip inductors in RF circuits. This makes high frequency and microwave applications realizable with the SOI process. In addition, as the result of the insulating layer, all the side wall capacitances of the diffusion area have been greatly reduced. There is no longer a bottom junction capacitance, and only channel capacitances exist.

Second, the SOI process eliminates latch-up problems, which makes it a promising technology for high speed and low power design. Latch-up is defined as the generation of a low impedance path in the CMOS process between the power supply rail and the ground rail due to the interaction of parasitic npn and pnp bipolar transistors. These parasitic BJTs create positive feedback loops, by which a small current can be amplified to virtually short the power rail and the ground rail, causing excessive current flow and even permanent device damage [52]. Currently there are some remedies such as the use of an epitaxial layer and other layout techniques which have lessened the severity of latch-up problems [52], [80]. Reliability concerns persist with latch-up, especially in I/O circuits, since its packing density also increases with decreasing feature sizes and spaces. However, in the SOI process, devices are fully wrapped in an insulator instead of the shared substrate as shown in Fig. 4-2. Therefore, no positive feedback loop is formed, and the latch-up problem is totally removed.

Third, there is reduced noise coupling and substrate current in mixed signal design with the SOI process. In CMOS technology as shown in Fig. 4-3 (a), the switching noise in digital circuits caused by CMOS switching activity can be coupled to the analog circuits through the substrate and deteriorate the quality of the analog signals. For
example, noise can be added to the input of an Op Amp through parasitic capacitors of the MOSFETs. However, in the SOI process as shown in Fig. 4-3 (b), each device is implanted within an isolated region, so the coupling between digital and analog circuits disappears and the noise coupling issue is eliminated.

Furthermore, there are some other advantages for the SOI process, such as high temperature compatibility, since the junction leakage current is significantly reduced; radiation hardness due to the reduction in exposed silicon volume; and smart power
integration, which achieves high voltage operations easily without any increase in process complexity due to the insulation inherent in the SOI stack [75].

As mentioned above, the vertical interconnections between different tiers in the 3D process shorten the wire length and increase the operation speed. The influence of this tier-to-tier interconnection on circuit performance becomes dominant. For example, for the PLL design in our project, the phase relationship between the in-phase clock and the quadrature phase clock was critical. The tier-to-tier interconnection introduces noise to the clock signals and causes phase mismatches. Also, due to the 3D stack structure, the temperature influence is always a concern. A phase calibration circuit is necessary in this 3D process to compensate for the phase mismatches caused by the above sources.

The proposed phase calibration circuit was implemented based on the TDC circuit. The TDC structure presented in this paper replaced the traditional long Vernier delay line and achieved as high as 2ps timing resolution. A feedback structure was implemented into the TDC to program and stabilize the timing resolution. The frequency of the multiphase PLL in this project is 3.8GHz. Simulation results show that the clock jitter was decreased from 5.98ps to 3.58ps with the proposed TDC phase calibration circuit.

The phase-locked loop (PLL) is very critical in today’s high speed transmitter/receiver design. New techniques, such as multi-phase clocks, can be used to relax the speed of clocks in PLL without reducing data rates, as mentioned in Chapter 2 and Chapter 3. In multi-phase structures, each phase in the multi-phase clock must maintain a certain offset from the others. This inevitably creates a new issue called phase mismatches. For example, there must be a 90 degree phase difference between the in-phase and the quadrature phase clock signals. However, often this offset cannot be
maintained because of process, voltage, and temperature (PVT) variations. This offset will affect the performance of the transmitter output data, making both horizontal and vertical eye opening narrower under high speed operation.

In the 2D process, phase calibration circuits have been published with different timing resolutions as shown in Table 4-1.

<table>
<thead>
<tr>
<th>Reference</th>
<th>[30]</th>
<th>[31]</th>
<th>[32]</th>
<th>[33]</th>
<th>[34]</th>
<th>[35]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Timing resolution</td>
<td>20ps</td>
<td>10ps</td>
<td>29.6ps</td>
<td>5ps</td>
<td>4ps</td>
<td>1ps</td>
</tr>
</tbody>
</table>

Vernier delay line (VDL) TDCs have been widely implemented. However, VDL strongly depends on matching among each delay component and suffers from the PVT variations, which make them less reliable for high speed phase calibration. A new structure has been developed to solve the PVT variation issues brought on by Vernier delay lines [79]. However, this novel structure may have PVT issues of itself.

In this dissertation, a novel TDC-based calibration circuit with high timing resolution and wide tuning range is presented to compensate for the phase mismatches caused by tier-to-tier interconnections and also by PVT variations. Improved timing resolution means higher calibration accuracy for higher speed. Compared with the Vernier delay line (VDL) TDC, each Vernier delay line in this project was replaced by one VCO which was used to set the timing resolution. A feedback loop was implemented to further decrease the influence of PVT variations in the timing resolution. Therefore, the new TDC is more reliable for high speed clock calibration and more immune to PVT variations than traditional TDCs. The calibrated clocks in this paper are quadrature phase clocks.
4.2 The Proposed Phase Calibration Circuit

Fig. 4-4 shows the block diagram of a multiphase PLL [15] and the waveforms of its multiphase clocks. Even with careful layout, phase mismatches among those clocks still exist as shown in Fig. 4-4 (b). Some ideas in previously published papers have been introduced [82]. One phase from the PLL output is chosen as the reference to calibrate the other phases. A phase comparator is used to determine the phase difference. However, the noise introduced by this phase comparator may cancel its calibration effects. In this paper, the TDC circuits are implemented as phase mismatch detectors, as shown in Fig. 4-5. This TDC is designed using CML logic and is helpful for noise reduction because of its digital implementation.

(a) Structure of a multiphase PLL

(b) Waveform of a multiphase PLL and its existing phase mismatch

Figure 4-4 Structure of a multi-phase PLL and its existing phase mismatch
As shown in Fig. 4-5, there are two different loops in the PLL. One is the original PLL loop which functions as the multiphase clock generator. The other one is the calibration loop to adjust the phase mismatches among the multiphase clocks. In this project, only the in-phase and the quadrature phase clocks are chosen for calibration, since they are used in the transmitter. We can summarize this process of phase calibration with two steps: (1) calibration reference settling down and (2) phase mismatch adjustment.

First, the quadrature phase clocks from the PLL go through a sampling circuit to generate the new clocks, \( f_0' \), \( f_{90}' \) and \( f_{180}' \), which have the same phase difference but longer periods. The reason we need longer period clocks will be explained later. Second, TDC1 generates the phase difference between \( f_0' \) and \( f_{90}' \), and TDC2 generates the phase difference between \( f_{90}' \) and \( f_{180}' \), which is only the wired inverse of \( f_0' \). The comparator will then compare the difference between the two TDC outputs and generate a digital number representing the mismatch between the in-phase and quadrature phase clocks. After that, this comparison result goes through a loop filter and generates the
corresponding current to the charge pump. Through the PLL loop filter, a correction signal will adjust the phase of the quadrature clocks, and this adjustment may make the PLL leave in-lock status so that the PLL loop will then be re-activated. The result will be an adjusted control voltage, which will eventually eliminate the phase mismatch.

This structure can be implemented into a quadrature phase calibration, an eight phase calibration or other multi-phase calibration mechanisms. Compared with the calibration structure in [82], the proposed TDC embedded PLL directly adjusts the PLL clocks to be evenly spaced instead of adding some delay components to compensate for the phase mismatches. The advantage of directly adjusting the PLL phase mismatches is that we can predict the PLL characters more clearly and easily. The PVT variations introduced by the added compensating parts can be eliminated. In addition, the calibrated PLL clocks can be easily implemented in the 3D process without adding compensation parts in each tier.

4.2.1 Structure of the Proposed TDC

Fig. 4-6 shows the traditional TDC with a Vernier delay line structure and its operation process. Clk1 and Clk2 are two signals from which we want to generate the phase difference through the Vernier delay line. As shown in Fig. 4-6 (b), the period of Clk2 is slightly longer than that of Clk1 and Ci samples Di (i = 1 → n) after every delay increment, T1 and T2. The timing resolution is defined as $\Delta T = T_2 - T_1$. Since $T_2 > T_1$, the sampling point will shift slightly to the right as shown in Fig. 4-6 (b). Eventually, the sampling value will jump from “1” to “0”, and at this time the data and clock of that D flip-flop is considered to be aligned. In Fig. 4-6 (b), the data and clock rising edge is
(a) Structure of a typical TDC with Vernier delay line

(b) The input and output data of the typical TDC

Figure 4-6 A typical TDC with Vernier delay line
aligned after 5 delay increments. Finally, the product of the digital number and the timing resolution represents the phase difference between Clk1 and Clk2.

The critical issue for this structure is that PVT issues cause variations among those delay increments. As mentioned above, the phase difference is related to the timing resolution. However, this result is based on the assumption that the delay increments in each delay line are equal to each other, which is difficult to guarantee at the design stage because of the process, temperature, and voltage variations. Thus, this structure may be more suitable for low speed application and under less critical conditions.

Previous papers have mentioned this issue and an alternative method has been developed to eliminate the PVT variations [79]. In [79], the Vernier delay line is replaced with a single VCO instead of several delay increments so that the PVT variations of the Vernier delay line are eliminated. But the timing resolution, which is the period difference of the newly introduced VCOs, is still affected by PVT variations resulting in the instability of the VCO frequency.

Fig. 4-7 shows the proposed TDC structure. A feedback loop with preset frequency is incorporated into the TDC to stabilize the VCO frequency in the TDC, achieve wide tuning range, and alleviate the PVT variations in the VCO.

As shown in Fig. 4-7, the proposed TDC consists of two parts: the timing resolution determining loop (TRDL) and the phase comparison (PC) block. The TRDL loop determines the period of VCO1 and VCO2, which is T1 and T2, and thus the TDC timing resolution ΔT. The PC part will use the generated T1 and T2 to detect the phase difference between two quadrature clocks and then store that phase difference into a 6-bit counter. The timing resolution can be adjusted through programmable preset numbers.
Fig. 4-7 (b) shows the whole calibration structure. One TDC generates the phase difference between $f_0$ and $f_{90}$. The other one generates the phase difference between $f_{90}$ and $f_{180}$. $f_0$ is in the PLL loop which functions as the calibration reference, and $f_{180}$ is the wired inverse of $f_0$. So, when the PLL is locked, $f_0$ and $f_{180}$ will be exactly 180 degrees apart.
Then, the generated phase difference from the two TDCs will be stored in two 6-bit counters. The output of the two 6-bit counters then goes into a comparator [83] which is followed by a low pass filter. The differential filter output then goes into the PLL’s charge pump/loop filter to adjust the VCO control signal.

4.2.1.1 Timing Resolution Determine Loop (TRDL)

This TRDL is a phase-locked loop with a programmable preset reference. The counter block generates a number which is determined by the VCO frequency. In this project, a 10-bit counter is used. The compare block in TRDL determines the difference between the counter output and the preset number and then feeds back the comparison results through a loop filter to adjust the TDC VCO frequency. $T_1$ and $T_2$ can be stabilized through those two feedback loops, and a wide tuning range can be achieved by those two ring oscillators. We also call this a timing resolution determination PLL (TRDPLL). The VCO in the TRDL loop is ring oscillator with tuning varactors to change the buffer delay, thus adjusting the VCO oscillation frequency. This is shown in Fig. 4-8.

![Schematic of the proposed VCO in the TRDL loop](image)

Figure 4-8 Schematic of the proposed VCO in the TRDL loop
To explain how to set the timing resolution, let’s take an example. Suppose the counting period is set to $t$ ns, and the preset number is set to $N_1$ and $N_2$ separately for the “Preset1” and “Preset2” blocks. Then, when the TRDPLL is in lock, the timing resolution can be determined as:

$$\Delta T = T_2 - T_1 = \frac{t}{N_2} - \frac{t}{N_1} = \frac{N_1 - N_2}{N_1 N_2} t. \tag{4-1}$$

Thus, if $t=200$ns, $N_1=1000$, $N_2=990$, then, $\Delta T=2$ps. The frequency of VCO1 and VCO2 is 5GHz and 4.95GHz which is easy to achieve. $t$ can be set to integer times of the reference clock which is precisely determined.

Here, large $N_1$ and $N_2$ with small $t$ will definitely increase the timing resolution, but the frequency of the TDC VCO will also increase greatly, which makes this method more dependent on the process. Therefore, we choose a relatively large $t$ and small $N_1$ and $N_2$ to both achieve high timing resolution and to relax the TDC VCO requirements.

4.2.1.2 Phase Comparison (PC)

The preset timing resolution in TRDL part will then be applied to a PC unit, to determine the phase difference between the two quadrature clocks, $f_0$ and $f_{90}$. Those two clocks control two “1 to 2” selection blocks, and two different operations.

If the clock signal is high, VCO1 and VCO2 will connect with the D latch, and VCO2 will sample VCO1 with a D latch. Because the frequency of VCO2 is smaller than that of VCO1, the output of the D latch is zero in the first several periods until the output of the D latch jumps from low to high and thus the time difference between Clk1 and Clk2 equals the TDC resolution multiplied by the value of the 6bit counter as shown in Fig. 4-9. After that, the output will hold its previous value until the following comparison
is completed. The frequency of VCO1 and VCO2 must be much smaller than the frequency of the comparison clocks, so that there is sufficient time for the comparison when the clock signal is high. This operation is called the phase difference determination process. Signal VTune1 and VTune2 are maintained in this operation to keep the VCO frequency stable.

![Waveform Diagram]

**Figure 4-9** The input and output waveforms of the TDC block

![Schematic Diagram]

**Figure 4-10** Schematic of the proposed D flip-flop with holding and resetting

After the output of the D flip-flop jumps from "0" to "1", the D flip-flop keeps its value at "1" so that the counter maintains its output. The counter output will represent only the phase difference and will not keep increasing after the phase difference is
generated. Fig. 4-10 shows the schematic of the proposed D flip-flop with both resetting and holding controls. The holding is achieved by connecting the output back to the first D latch so that if the output is “1”, the first D latch will enter the latch mode and be isolated from the inputs. After the reset, the D flip-flop will then sample the input again until it senses “1” at the output, thus achieving a holding status after the output of the D flip-flop jumps to “1”. Fig. 4-11 shows the schematic of the 6-bit counter. It is a synchronous structure with carry-in and carry-out. It consists of 6 discrete parts, and each one has a D flip-flop, an XOR gate and an AND gate. The D flip-flop together with the first XOR gate actually functions as a J-K flip-flop with both J and K tied to logical “1”.

If the control signal is low, the TRDL loop is on and the timing resolution is refreshed. Therefore, through the periodical alternating of those two operations, the timing resolution can be stabilized very well, and the effects of PVT variations can be strongly alleviated.

Finally, the comparison results feed back to the PLL loop filter which then adjusts the multiphase clocks.
4.2.1.3 The Comparator

In this project, the output of TDC2 is set as the reference of the comparator as shown in Fig. 4-7 (b). This comparator compares its two inputs which are either “1” or “0” and generates the comparison result which is “1” if the output of TDC1 is larger than that of TDC2, and “0” if the output of TDC1 is smaller than that of TDC2. Since the comparator output “1” means that the phase difference between $f_6$ and $f_{90}$ is larger than that between $f_{90}$ and $f_{180}$, the frequency of the VCO needs to be reduced to alleviate this difference. The frequency of VCO needs to be increased if the comparator output is “0”. As shown in Fig. 4-7 (b), the comparator is a 6-bit array.

Fig. 4-12 shows the structure of the 6-bit comparator [84]. A1 – A6 are the counter outputs from TDC1; and B1 – B6 are the counter outputs from TDC2. The device size ratios of the current sources are W/L for the first 6 tail sources and W/L for the last tail unit. The device ratios of the bottom five sources are 2W/L. Thus, the current produced at each branch is progressively divided by a factor of “2,” which makes it compatible with the binary weight of the preceding TDC outputs.

![Figure 4-12 Schematic of the 6-bit comparator](image)
4.2.1.4 Sampling Circuit

As explained in Section 4.2.1.2, the two signals, \( f_0 \) and \( f_{90} \), must maintain a high level when the phase difference is being generated as shown in Fig. 4-9. However, the VCO output frequency is comparable with the VCO frequency in the TDC component, which cannot ensure enough time for the comparison to be finished. Thus, we need a signal with a much longer period than the PLL outputs while still reflecting the phase difference between the PLL outputs. One possible method is to add a divider like that in the PLL loop. However, the divider may also introduce phase mismatches. Therefore, a more stable signal is preferred. Through the observation of the circuit signals, the PLL input reference signal is the most stable one since it is generated externally through a crystal.

![Diagram of sampling circuit]

Fig. 4-13 Schematic and waveforms of the sampling circuit

Fig. 4-13 shows the schematic and the waveforms of the sampling circuit. The reference signal was first sampled by the quadrature clocks from the PLL. Thus, the new outputs, \( f'_0 \), \( f'_{90} \) and \( f'_{180} \) were synchronized with \( f_0, f_{90} \) and \( f_{180} \) but with the same phase
differences. The sampled outputs now have long enough periods for the phase comparison.

4.3 Simulation Results and Layouts

Fig. 4-14 shows the simulation waveforms of the proposed TDC. The top two waveforms are the sampled waveforms of the reference by the quadrature phase clocks at 3.8GHz, and the phase differences between them are 65.8ps. The third and fourth waveforms are the outputs of VCO1 and VCO2. The preset timing resolution of $\Delta T = T_2 - T_1 = 356.6ps - 354.5ps = 2.1ps$. The bottom waveform is the D latch output. We can see that after 31 sample periods the D latch output becomes high, which means the time difference is $31\Delta T = 65.1ps$, which has an error of only 1.01%.

Fig. 4-14 Simulation waveforms of the proposed TDC, from top to bottom: VCO1, VCO2 and D latch output
Fig. 4-15 shows the activating and deactivating of the VCO in the TRDL loop in the proposed TDC. When the comparison signals, which are the quadrature phase clocks, jump to a high voltage level, the VCO signal passes to the D flip-flop as shown in Fig. 4-15. In this process, the VCO is transparent to the D flip-flop. However, when the quadrature clocks jump to a low voltage level, the VCO is disconnected from the following D flip-flop. Fig. 4-15 shows that the VCO output is a constant low voltage level in this case.

Figure 4-15 VCO output in the TRDL loop under activating and deactivating conditions.

Fig. 4-16 shows the eyediagram of the PLL output (a) before and (b) after the TDC was embedded. The frequency of the multiphase PLL is 3.8GHz. The jitter of the eyediagram decreases from 5.98ps to 3.58ps after the TDC circuit is implemented. We believe we can expand the calibrated PLL speed to 10GHz, or even higher, by increasing
Fig. 4-16 Simulated eyediagram of the multi-phase PLL output with and without the calibration circuit.

(a) PLL eye diagram without calibration circuit: jitter is 5.98ps

(b) PLL eye diagram with calibration circuit: jitter is 3.58ps
the two preset numbers $N_1$ and $N_2$. According to (1), timing resolution can be higher, which will allow the capability of higher speed multiphase PLL calibration.

Due to the limited resolution of this TDC, the phase mismatch will be more difficult to decrease as the frequency increases, because higher timing resolution is needed for high speed clocks. There also exists a tradeoff between the jitter performance and the power consumption. Since lower jitter requires higher timing resolution to accurately calculate the phase difference, more counters are needed, which causes more power dissipation.

Table 4-2 shows the peak-to-peak jitter of the PLL output with and without the proposed calibration circuit. The PLL center frequency is 3.8GHz. As shown in Table 4-2, the jitter improvement becomes more significant as temperature increases, although the absolute peak-to-peak jitter becomes worse as the temperature increases. This proves that our proposed TDC calibration circuit successfully compensated for the mismatches caused by the temperature change.

Table 4-2 Jitter performance comparison at different temperatures with and without the proposed calibration circuit

<table>
<thead>
<tr>
<th>Temperature (°C)</th>
<th>22</th>
<th>27</th>
<th>32</th>
<th>37</th>
<th>42</th>
<th>47</th>
<th>52</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>$J_{peak-to-peak without TDC}$</strong></td>
<td>5.1ps</td>
<td>5.98ps</td>
<td>7.7ps</td>
<td>9.53ps</td>
<td>14.63ps</td>
<td>21ps</td>
<td>28.6ps</td>
</tr>
<tr>
<td><strong>$J_{peak-to-peak with TDC}$</strong></td>
<td>3.2ps</td>
<td>3.58ps</td>
<td>4.26ps</td>
<td>4.8ps</td>
<td>6.72ps</td>
<td>8.97ps</td>
<td>11.24ps</td>
</tr>
<tr>
<td><strong>Jitter improvement</strong></td>
<td>37.2%</td>
<td>40%</td>
<td>44.7%</td>
<td>49.6%</td>
<td>54.1%</td>
<td>57.3%</td>
<td>60.7%</td>
</tr>
</tbody>
</table>

Fig. 4-17 shows the layout of the proposed TDC embedded PLL. The TDC circuit is in tier2 and the PLL is in tier 1. The total chip size is 1.4mm x 1.0mm.
4.4 Discussion and Future Work

4.4.1 Discussion

As described above, a time-to-digital (TDC) calibration circuit was implemented with MIT Lincoln Lab’s 3D FDSOI technology. This calibration circuit attenuated the phase mismatches caused by the process and temperature influences of the 3D FDSOI process. The 3D process also helps eliminate the noise coupling issue through the substrate by separating the analog PLL and the digital calibration circuit into different tiers.

A VCO was used to replace the Vernier Delay Line, eliminating the delay mismatch issues. A timing resolution determine loop (TRDL) was also implemented to set up and stabilize the timing resolution. Compared to the TDCs in [30]-[34], the proposed structure in this project achieved higher timing resolution up to 2ps because of the larger counting number in the TRDL loop. Higher timing resolution requires more power.
consumption. Therefore, a tradeoff exists between the calibration performance and the power consumption.

The same approach can be extended to any high speed 3D integrated circuit where timing is critical.

4.4.2 Future Work

Phase mismatches between the quadrature phase clocks were compensated with the TDC calibration circuit in this project. Moreover, the proposed structure can be extended to compensating for the phase mismatch in other multi-phase clocks, such as eight-phase clocks. This may involve more power consumption since more TDCs are required. How to realize low power becomes a critical issue as the number of compensated clocks increases. Some components may be shared during the operation such as the TRDL loop in order to save power.
CHAPTER 5

CHIP FABRICATION AND MEASUREMENT

This chapter describes the design flow of this SerDes project from the parameter determination at the system level to the chip testing. The equipment setup of the chip testing is then described, and the methods of measuring the required performance are also explained.

5.1 Chip Design Flow

The chip design flow in this project consists of several steps as shown in Fig. 5-1.

First, it started from the system specification determination and system-level simulation using Matlab as described in Chapter 2 and Chapter 3 for both transmitter and receiver parts and then the whole SerDes system. This included noise performance, circuit bandwidth, jitter transfer, jitter tolerance, jitter generation, etc.

Second, the transistor-level design was conducted based on the system level specifications. This included the transistor sizing and the transistor-level simulation with Cadence design tools. The devices' physical limitations such as the breakdown voltage, the burning current, and the operation speed limitation, also were considered in this step.

Third, the chip layout was drawn after the transistor-level design was done. This layout drawing had to follow the design rules from the foundry, such as the line width and spacing, reliability considerations to avoid breakdown and burning of the devices, and the electrostatic discharge (ESD) protection. The tools we used were Calibre and
System Specification & Simulation

Transistor-Level Design

Layout Design

Post Layout Simulation

Fabrication in Foundry

Chip Testing

Noise performance, circuit bandwidth, jitter transfer, jitter tolerance, jitter generation, etc.

Transistor sizing, simulation, physical limitation requirements, etc.

DRC, LVS, ESD protection, etc.

PEX, RCX extraction

JAZZ, IBM, MIT Lincoln Lab

Instruments setup, waveform recording and analysis

Figure 5-1 Design flow of this project
Assura design tools.

Fourth, the layout parameters including the parasitic parameters had to be extracted after all layout design rules were passed, and the post layout simulation was then conducted. If the post layout simulation results did not satisfy the design requirements, we had to go back to the transistor-level design and adjust the transistor size. Then we had to redraw the layout and redo the post layout simulation, until the post layout simulation satisfied the design requirements.

After all of the above steps were completed, the chip was sent to the foundry for fabrication. This step normally takes three months.

Finally, the foundry sent the fabricated chip back to us, so we could build the testing environment for the chip measurement.

5.2 Equipment Setup

Table 5-1 lists the testing equipment we used in this measurement. The oscilloscope was used to measure the output waveforms and eyediagram. The function generator was used to generate the reference clock for the PLL. The spectrum analyzer was used to measure the power spectrum and phase noise of the PLL output. The ten-channel RF probe was used to connect the chip pads to the cable, and to connect to the instruments through the cable.

Fig. 5-2 shows the diagram of the equipment setup of the SerDes chip testing. The arrow represents the signal flow direction.

The chip was first placed on the probe station with an electron-microscope, which was used to observe the connections between the pads and the RF probe. The chip was
Table 5-1 Testing equipment and specifications

<table>
<thead>
<tr>
<th>Equipment</th>
<th>Model</th>
<th>Specifications</th>
</tr>
</thead>
<tbody>
<tr>
<td>Oscilloscope</td>
<td>Tektronix 11801C</td>
<td>DC to 50GHz</td>
</tr>
<tr>
<td>Function Generator</td>
<td>Rhode Schwarz SML01</td>
<td>9 kHz to 1.1 GHz/2.2 GHz/3.3 GHz</td>
</tr>
<tr>
<td>Spectrum Analyzer</td>
<td>Rohde &amp; Schwarz FSP40</td>
<td>9 kHz to 40 GHz</td>
</tr>
<tr>
<td>Power Supply</td>
<td>Agilent E3631A</td>
<td>0 - 6 V / 0 - 5A and 0 - 20 V /0 - 20 V / 0 - 1A</td>
</tr>
<tr>
<td>Ten-Channel RF Probe</td>
<td>GGB MCW-8-2880</td>
<td>DC to 40GHz</td>
</tr>
</tbody>
</table>

Figure 5-2 Equipment setup of the chip testing in this project
then fixed on the probe station by the two pairs of RF probes by connecting the probe to the chip pads. The cables were then connected to the instruments and sources as shown by the solid arrows in Fig. 5-2.

5.3 Testing Methods

The power supply provided a negative voltage with the ground as the highest value. In this project, a negative voltage of -3.4V was used. The signal generator’s output was connected with the PLL reference input to supply the sinusoidal reference signal at 100MHz frequency with proper swing requirement, which was above 400mV in this project. The power supply was turned on at the lowest level and increased slowly to avoid circuit damage in the chip due to any sudden jumps in the power supply.

To measure the spectrum and phase noise of the PLL output, the output of the PLL was connected with the spectrum analyzer through the cable as shown in Fig. 5-2. The spectrum could be observed by choosing the spectrum option of the spectrum analyzer. The power spectrum waveform was shown in the screen with an adjustable center position and scales. The phase noise was observed by choosing the noise analysis function of the spectrum analyzer, and the phase noise curve was shown with adjustable scales. The measurement results are displayed in Fig. 2-30 and Fig. 2-31.

To measure the eyediagram of the transmitter output, the TX output pad was connected to the oscilloscope through the cable as shown in Fig. 5-2. Moreover, because the PRBS data was generated on-chip by the PLL clock and the triggering signal of the oscilloscope had to be synchronized with the measured data, the PLL output had also to be connected to the oscilloscope serving as the triggering signal, so that the data and
triggering signal of the oscilloscope were synchronized. In this project, the 2.24GHz PLL output was connected with the oscilloscope as the triggering signal. The measured eye diagram is shown in Fig. 2-33 (b).
6.1 Conclusion

This material is based upon work supported by the National Science Foundation under Grant No. 0702109.

In this thesis, a high speed SerDes chip with a novel quarter-rate multiplexer/demultiplexer (MUX/DEMUX) and LC-ring oscillator was implemented with 0.18μm BiCMOS technology and 65nm CMOS technology.

The quarter-rate MUX/DEMUX not only relaxes the VCO design requirement, but also saves the circuit power consumption. The developed LC-ring oscillator achieved both wide tuning range (11%) and low phase noise (-110dBc/Hz at 1MHz offset). Using the high $f_t$ SiGe BiCMOS technology, a 16:1 MUX/DEMUX at 36 Gb/s data rate was implemented with measured 10ps of peak-to-peak jitter. With low power 65nm CMOS technology, a 60 Gb/s data rate was realized with a simulated 4.8ps peak-to-peak jitter. The power consumption of the proposed 4:1 MUX/DEMUX was as low as 92mW (31mW of MUX and 61mW of DEMUX) for the developed quarter-rate structure. The PLL frequency was only 1/4 of the data rate.

In addition, a phase calibration circuit was developed with the three dimensional fully depleted silicon on insulator (3D FDSOI) CMOS process to compensate for the phase mismatches among the multi-phase PLL clocks, which are caused by the process variations, the tier-to-tier interconnections, and the temperature influence in the 3D
process. The simulation result shows that the peak-to-peak jitter at 3.8GHz PLL frequency was decreased from 5.98ps to 3.58ps when the proposed calibration circuit was added.

The SerDes implemented in this project achieved good tradeoffs among the power consumption, the data rate and the jitter performances.

6.2 Discussion

As mentioned at the end of each chapter, future work includes the design of line encoding/decoding, further high speed implementation of the SerDes, and the low power implementation of the TDC with a larger number of multi-phase clocks.

The completed work in this project has realized the high speed SerDes with relaxed design constraints. Power saving is always the goal for industrial applications. Even in high speed circuits, high data rates may not be required all the time. For example, a high data rate is necessary for the online HDTV because of the large data capacity. However, a lower data rate is enough for the online news in text format. The clock frequency requirements under those two cases are different. In addition, higher frequency means more power consumed. Therefore, developing a structure which can self-adjust its speed based on the application requirements will be a worthwhile project. The power efficiency can be higher with reconfigurable SerDes in the future.

As we know in the SerDes chip, all parallel data goes through the multiplexers, and the number of MUXs increases as the data bits increase. This is the same as that in the DEMUX. The number of MUX/DEMUXs determines the power dissipation of the chip if the clock frequency does not change. Therefore, one idea is to add enable/disable signals
to each MUX/DEMUX, so that when there is a lower data capacity requirement, part of the MUX/DEMUX can be disabled to save power. The enable/disable signals can be a programmed digital word according to the input requirements. This method saves power while maintaining the same clock frequency in both large and small data capacity cases. Another idea is to keep the MUX/DEMUX working all the time while changing the frequency of the clock. This method may require that the VCO has a very wide tuning range for which a ring structure may be preferred. The third idea could be a combination of these two different ideas.

We believe that with proper structure configurations based on the above ideas, SerDes with higher data rate and better power efficiency can be realized in the near future.
LIST OF REFERENCES


[87] **RD1012 - 8b/10b Encoder/Decoder**, 


A linear feedback shift register (LFSR) is a pseudorandom bit sequence generator (PRBS) used to assist testing in serial link designs and wireless communication systems. An m-bit LFSR can generate a bit sequence of $2^m-1$ different values. Once these $2^m-1$ sequences pass the last state, the LFSR will begin the same cycle again from the first state to the last state. Each sequence in the $2^m-1$ sequences contains equal “1”s and “0”s. Because the LFSR’s running frequency is determined by the D flip-flop propagation delay, it can reach high speed easily. Therefore a LFSR is ideal for the generation of high frequency random patterns.

There are two different implementations of the LFSR: the Fibonacci implementation and the Galois implementation. They can be distinguished by their respective feedback path implementations.

Fig. A-1 shows the structure of the Fibonacci implementation. At each output of the D flip-flop, a modulo-2 sum of the binary weighted tap fed back to the input. Modulo-2 means 1+0=1, 0+1=1, 0+0=0 and 1+1=0. Thus, in the logic gate family, an XOR gate can exactly realize this modulo-2 operation. As shown in Fig. A-1, $a_i$ to $a_{m-1}$ represent the feedback factors of each path, where feedback is employed if $a_i$ is 1 and feedback is disconnected if $a_i$ is 0.
Fig. A-2 shows the structure of the Galois implementation. It is the reciprocal of Fibonacci implementation. Compared with the Fibonacci structure in which all the taps are in the feedback path, the Galois structure put all the taps in the feedforward path with no taps in the feedback path. Thus, the Galois implementation achieves higher speed than the Fibonacci one and is more attractive. The LFSR used in this project is a Galois implementation because of its higher speed.

As shown in Fig. A-1 and Fig. A-2, any m-bit LFSR can be represented as a polynomial of variable x:

\[ A(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + \cdots + a_{m-1}x^{m-1} + a_mx^m. \]  \hspace{1cm} (A-1)

As mentioned above, an m-bit LFSR can generate the maximum number of \(2^m-1\) bit sequences. The key to finding the right value of \(a_0\) to \(a_m\) is thus most important in an LFSR.

As explained before, the values of \(a_0\) to \(a_m\) represent whether the feedback paths are connected or not. For example, a 3-bit LFSR with \(A(x) = 1 + x + x^3\) shows the
configuration in Fig. A-3. A simplified representation of this polynomial is \{3, 1\} which means there are feedbacks at tap3 and tap1, as shown in Fig. A-3. To determine whether this combination produces the maximum sequences, some rules need to be followed. A primitive polynomial of Equation (A-1) is defined as a polynomial which is a factor of \((x^n + 1)\) where \(n = 2^m - 1\), yet which cannot be factored by \((x^p + 1)\) where \(p < 2^m - 1\). An LFSR represented by a primitive polynomial can generate the maximum bit sequences.

Based on the above definition, the polynomial \(A(x) = 1 + x + x^3\) is a primitive polynomial of \((x^7 + 1)\) because

\[
x^7 + 1 = (x + 1)(x^3 + x^2 + 1)(x^3 + x + 1)
\]

(A-2)

based on the mod-2 operations. Another polynomial \(A(x) = 1 + x^2 + x^3\) is also a primitive one. There are two different configurations for a 3-bit LFSR to generate the maximum number of bit sequences.

Table A-1 shows the factor combinations of the primitive polynomial for different bit LFSRs. Thus, the 12-bit LFSR in this project can be chosen as \{12, 9, 8, 5\}.

The initial value of the LFSR needs to be set carefully. As shown in Fig. A-1 and Fig. A-2, if the initial values of the output of the D flip-flops are all “0”s, then the output will lock to the all “0” state, which is an undesired state. Thus, the all-zero state must be avoided at the start.
Table A-1 The polynomial coefficients of the LFSR with different numbers of bits

<table>
<thead>
<tr>
<th>Bit number of LFSR</th>
<th>Polynomial coefficients</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>{3, 2}</td>
</tr>
<tr>
<td>4</td>
<td>{4, 3}</td>
</tr>
<tr>
<td>5</td>
<td>{5, 3}, {5, 4, 3, 2}, {5, 4, 3, 1}</td>
</tr>
<tr>
<td>6</td>
<td>{6, 5}, {6, 5, 4, 1}, {6, 5, 3, 2}</td>
</tr>
<tr>
<td>7</td>
<td>{7, 6}, {7, 4}, {7, 6, 5, 4}, {7, 6, 5, 2}, \ldots, {7, 6, 5, 4, 3, 2}, {7, 6, 5, 4, 2, 1}, \ldots</td>
</tr>
<tr>
<td>8</td>
<td>{8, 7, 6, 1}, {8, 7, 5, 3}, {8, 7, 3, 2}, \ldots, {8, 7, 6, 5, 4, 2}, {8, 7, 6, 5, 2, 1}, \ldots</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>12</td>
<td>{12, 9, 8, 5}, {12, 9, 7, 6}, \ldots, {12, 11, 10, 9, 8, 4}, \ldots, {12, 11, 10, 9, 8, 7, 6, 3}, \ldots, {12, 11, 10, 9, 8, 7, 5, 4, 3, 2}, \ldots</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>31</td>
<td>{31, 28}, {31, 25}, \ldots, {31, 30, 29, 28}, {31, 30, 29, 25}, \ldots, {31, 30, 29, 28, 27, 23}, {31, 30, 29, 28, 27, 12}, \ldots</td>
</tr>
<tr>
<td>32</td>
<td>{32, 31, 30, 10}, {32, 31, 29, 1}, \ldots, {32, 31, 30, 29, 28, 22}, {32, 31, 30, 29, 28, 19}, \ldots</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

In our implementation as shown in Fig. 2-28, an inverter is added in the last stage to eliminate the all-zero state so that if the LFSR goes into the all-zero state, after one clock cycle, the inverter will generate a "1," thus bringing the LFSR out of that lock state.
APPENDIX B

LINE ENCODING

In the SerDes chip, line encoding is necessary to assist data and clock recovery. The random input data may contain long continuous “1”s or “0”s. These consecutive “1”s or “0”s generate a constant DC value for the CDR input and cause the CDR to lose tracking. Therefore, to avoid such a problem, a line encoding technique is introduced. This technique converts a symbol with random “1”s and “0”s into a symbol which contains equal “1”s and “0”s to maintain the DC balance and ensure the maximum number of data transitions.

Many different encoding schemes have been developed such as 4B/5B, 8B/10B, 64B/66B, and so on. 4B/5B was first developed and now is only used in lower rate fiber optic communications. 8B/10B is currently widely used in the 10 Gb/s data transmission systems. Table B-1 shows some examples of 8B/10B encoding.

<table>
<thead>
<tr>
<th>8-bit data</th>
<th>10-bit symbol</th>
</tr>
</thead>
<tbody>
<tr>
<td>00000000</td>
<td>1001110100</td>
</tr>
<tr>
<td>11111111</td>
<td>1010110001</td>
</tr>
<tr>
<td>00000011</td>
<td>1100011011</td>
</tr>
</tbody>
</table>

The 8B/10B encoding is built based on the 5B/6B and 3B/4B encodings as shown in Fig. B-1. To ensure the DC balance, the easiest way is to use the same number of “1”s and “0”s in the 6B and 4B codes. However, these neutral states (equal number of “1”s and “0”s) are not enough to represent all the possible states. For example, the number of
neutral states in the 6-bit symbol is only 20, which is less than the total possible states of
the 5-bit symbol: 32. And the number of neutral states in the 4-bit symbol is only 6,
which is also less than the total possible states of the 3-bit symbol: 8. Thus, disparity
symbols are used in the 8B/10B encoding.

Since in a 4-bit symbol, the difference between the number of “1”s and “0”s is +2, 0
or -2, a corresponding disparity value (-2, 0, or +2) in the 6-bit symbol must be chosen to
cancel the disparity value in the 4-bit symbol so that the DC balance is maintained. The
8B/10B encoding is thus realized by choosing the corresponding pair of 3B/4B and
5B/6B, which makes the total disparity value equal to 0. Fig. B-2 shows an example of
how this works. The symbol “000” has a 4-bit code of “0100” (disparity value is -2) or
“1011” (disparity value is +2). And the symbol “00000” has a 6-bit code of “100111”
(disparity value is +2) or “011000” (disparity value is -2). Therefore, two combinations
can ensure the DC balance: “100111 0100” and “011000 1011”.

![Figure B-1 Scheme of the 8B/10B encoding](image)

![Figure B-2 8B/10B encoding of the symbol “000 0000”](image)
Table B-2 shows part of the 8B/10B encoding list together with the corresponding 3B/4B and 5B/6B encoding lists. The full table can be found in [87].

### Table B-2 8B/10B encoding lists

#### (a) 3B/4B encoding list

<table>
<thead>
<tr>
<th>3B Decimal</th>
<th>3B Binary</th>
<th>4B Binary</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>000</td>
<td>0100 or 1011</td>
</tr>
<tr>
<td>1</td>
<td>001</td>
<td>1001</td>
</tr>
<tr>
<td>2</td>
<td>010</td>
<td>0101</td>
</tr>
<tr>
<td>3</td>
<td>011</td>
<td>0011 or 1100</td>
</tr>
<tr>
<td>4</td>
<td>100</td>
<td>0010 or 1101</td>
</tr>
<tr>
<td>5</td>
<td>101</td>
<td>1010</td>
</tr>
<tr>
<td>6</td>
<td>110</td>
<td>0110</td>
</tr>
<tr>
<td>7</td>
<td>111</td>
<td>0001 or 1110 or 1000 or 0111</td>
</tr>
</tbody>
</table>

#### (b) 5B/6B encoding list

<table>
<thead>
<tr>
<th>5B Decimal</th>
<th>5B Binary</th>
<th>6B Binary</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>00000</td>
<td>100111 or 011000</td>
</tr>
<tr>
<td>1</td>
<td>00001</td>
<td>011101 or 100010</td>
</tr>
<tr>
<td>2</td>
<td>00010</td>
<td>101101 or 010010</td>
</tr>
<tr>
<td>3</td>
<td>00011</td>
<td>110001</td>
</tr>
<tr>
<td>4</td>
<td>00100</td>
<td>110101 or 001010</td>
</tr>
<tr>
<td>5</td>
<td>00101</td>
<td>101001</td>
</tr>
<tr>
<td>6</td>
<td>00110</td>
<td>011001</td>
</tr>
<tr>
<td>7</td>
<td>00111</td>
<td>111000 or 000111</td>
</tr>
<tr>
<td>8</td>
<td>01000</td>
<td>111001 or 000110</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>30</td>
<td>11110</td>
<td>011110 or 100001</td>
</tr>
<tr>
<td>31</td>
<td>11111</td>
<td>101011 or 010100</td>
</tr>
</tbody>
</table>

#### (c) 8B/10B encoding list

<table>
<thead>
<tr>
<th>Code</th>
<th>8B Binary</th>
<th>10B Binary</th>
</tr>
</thead>
<tbody>
<tr>
<td>D0.0</td>
<td>000 00000</td>
<td>100111 0100 or 011000 1011</td>
</tr>
<tr>
<td>D1.0</td>
<td>000 00001</td>
<td>011101 0100 or 100010 1011</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>D0.1</td>
<td>001 00000</td>
<td>100111 1001 or 011000 0110</td>
</tr>
<tr>
<td>D1.1</td>
<td>001 00001</td>
<td>011101 1001 or 100010 0110</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>D2.5</td>
<td>101 00010</td>
<td>101101 1010 or 010010 0101</td>
</tr>
<tr>
<td>D3.5</td>
<td>101 00011</td>
<td>110001 1010 or 001110 0101</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>K28.0</td>
<td>000 11100</td>
<td>001111 0100 or 110000 1011</td>
</tr>
<tr>
<td>K28.1</td>
<td>001 11100</td>
<td>001111 1001 or 110000 1001</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

169
As shown in Table B-2, there are twelve symbols encoded as control characters normally called K-characters, which are reserved for the bit alignment, data type specification, and some controls [88]. The code name Dx.y in Table B-2 (c) is widely used to represent the 8-bit symbol. x represents the least significant five bits and y represents the most significant three bits. This is the same as those in K-characters.

As described above, 8B/10B encoding achieves DC balance very well. However, the speed overload may prevent it from being used in higher speed applications because the data rate increases 1.25 times. Thus, another coding scheme called 64B/66B encoding is also used because of its lower speed overload, which is only 1.03.

Fig. B-3 shows the 64B/66B scheme. The first two bits in the 66-bit symbol are used to determine the frame types of the data. The “01” of the first two bits are followed by the 64 scrambled data. If the first two bits are “10”, then the following 8 bits determine the type of the remaining 56 bits of data [88]. Although the 64B/66B encoding has lower speed overload, its structure is more complex.