880 likes | 1.18k Views
Clocked Storage Elements for High-Performance and Low-Power Systems ICCD 2001 Tutorial. Vojin G. Oklobdzija University of California Davis http://www.ece.ucdavis.edu/acsel Integration Corp. Berkeley, CA 94708 http://www.integration-corp.com. Outline.
E N D
Clocked Storage Elements for High-Performance and Low-Power SystemsICCD 2001 Tutorial Vojin G. Oklobdzija University of California Davis http://www.ece.ucdavis.edu/acsel Integration Corp. Berkeley, CA 94708 http://www.integration-corp.com
Outline • Importance of Clocked Storage Elements (CSE) • Basic Definitions • Difference between Latch and Flip-Flop • Timing and Power metrics • Representative designs used in High-Performance Microprocessors • Comparison • Conclusion, New Directions and Some novel designs Prof. V.G. Oklobdzija, University of California
Importance of Clocked Storage Elements (CSE) Prof. V.G. Oklobdzija, University of California
Trends in high-performance systems: Higher clock frequency Prof. V.G. Oklobdzija, University of California
Power vs. Year High-end growing at 25% / year RISC @ 12% / yr X86 @ 15% / yr Consumer (low-end) At 13% / year Prof. V.G. Oklobdzija, University of California
Predictions Source: Shekhar Borkar, Intel Prof. V.G. Oklobdzija, University of California
Recent Interest in Clocked Storage Elements • Trends in high-performance systems • Higher clock frequency: 1.8GHz Pentium 4 • 4GHz logic presented) • More transistors on chip (214 million, ISSCC 2001) • Consequences • Increased Flip-Flop overhead relative to cycle time • Pipeline depth of 20 or more • Cycle time 10 - 20 FO4 delays, F-F overhead 3 - 4 FO4 Prof. V.G. Oklobdzija, University of California
Courtesy: Doug Carmean, Hot-Chips-13 presentation Prof. V.G. Oklobdzija, University of California
Processor Frequency Trend Source: Intel S. Borkar • Frequency doubles each generation • Number of gates/clock reduce by 25% Prof. V.G. Oklobdzija, University of California
Pentium 3 uArchitecture stage stage stage logic register logic register logic register Delay: 0.6 ? 0.3 ? 0.6 ? 0.3 ? 0.6 ? 0.3 ? The total delay from pipeline stage to pipeline stage is 0.9 ns. The maximum clock rate for this design is 1.1 GHz. Prof. V.G. Oklobdzija, University of California
The Pentium 4 Depends on Pipelines logic register logic register logic register logic register logic register logic register Delay: 0.4? 0.4? 0.4? 0.4? 0.4? 0.4? 0.16? 0.16? 0.16? 0.16? 0.16? 0.16? The total delay from pipeline stage to pipeline stage is 560 pS. This design, with twice the stages, has a maximum clock rate of 1.8 GHz. As the design is broken into more pipeline stages, the logic in each stage has less delay, and the registers between stages consume a higher percentage of the delay, causing diminishing returns. At some point the cost of adding more stages, such as branch prediction, causes a very marginal return. The only way out of this bottleneck is a faster register. This is one reason why the P4 is not significantly faster than a slower-clocked P3 for many applications. Prof. V.G. Oklobdzija, University of California
Courtesy: Doug Carmean, Hot-Chips-13 presentation Prof. V.G. Oklobdzija, University of California
Why Interest in Clocked Storage Elements ? • Higher impact of storage element delay • High-speed requires low CSE pipeline overhead: 3 FO4 or less. • Logic embedding property • Limits on performance • FF delays of 10pS - 100pS • Higher impact of clock skew • Ability to control both edges of the clock • Higher power consumption • >100W for recent processors • Clock system burns up to 40%, storage elements up to 20% of total power • Battery-powered applications Prof. V.G. Oklobdzija, University of California
Basic Definitions Prof. V.G. Oklobdzija, University of California
Clock Signals • Clocks are defined as pulsed, synchronizing signals that provide the time reference for the movement of data in the synchronous digital system. • The clocking in a digital system can be either single phase, or multi-phase (usually two-phase). • Clocking strategy is dependent and largely influenced by the choice of the CSE: latch or flip-flop Prof. V.G. Oklobdzija, University of California
Clock Signal Uncertainty • Effects on cycle- time: – maximum delay restriction – violation of set- up time • May cause race – minimum delay restriction – violation of hold time • Uncertainty is: Jitter, Skew, and Duty Cycle Prof. V.G. Oklobdzija, University of California
Jitter • Uncertainty in consecutive edges of a periodic signal • Caused by temporal noise events • Quantified as: – cycle-to-cycle or short-term jitter, tJS – long-term jitter, tJL Prof. V.G. Oklobdzija, University of California
Clock Skew • Time difference between temporally-equivalent or concurrent edges of two periodic signals • Caused by spatial noise events Prof. V.G. Oklobdzija, University of California
Clocking Strategies Single-phase clocking and single latch machine Edge-triggered clocking and Flip-Flop based machine Prof. V.G. Oklobdzija, University of California
Two-phase clocking and two-phase latch machine with single latch Two-phase clocking and two-phase latch machine with double latch Clocking Strategies Prof. V.G. Oklobdzija, University of California
Delay Restrictions • Clock defines hard boundaries for edge-triggered design • Clock boundaries are soft for level sensitive clocking and they are: • Tolerant for clock edge uncertainty • Tolerant to uncertainty of data arrival • Timing slack can voluntarily be passed forward • Time can forcefully be borrowed *Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation Prof. V.G. Oklobdzija, University of California
Single-Phase Clocking, Single Latch: Timing Constraints Prof. V.G. Oklobdzija, University of California
Two-Phase Clocking with Two-Phase Double Latch Prof. V.G. Oklobdzija, University of California
Two-Phase Clocking with One-Phase Double Latch Some people refer to this clocking arrangement as: “negative edge Flip-Flop” – erroneously ! Prof. V.G. Oklobdzija, University of California
Difference between Latch and Flip-Flop Prof. V.G. Oklobdzija, University of California
After the transition of the clock data can not change Latch is “transparent” Difference between Latch and Flip-Flop Prof. V.G. Oklobdzija, University of California
How can one recognize the difference without knowing what is inside the “black-box” ? Flip-Flop and M-S Latch Arrangement Prof. V.G. Oklobdzija, University of California
F-F and M-S Latch: Difference Experiment: Prof. V.G. Oklobdzija, University of California
F-F and M-S Latch: Difference Structural Difference: No Clock Flip-Flop M-S Latch Prof. V.G. Oklobdzija, University of California
Flip-Flop vs. Latch • Edge sensitive • Easier to use as frequency increases • Robustness to duty cycle • Simpler logic timing requirements • Fits into CAD tools • Level sensitive • May consume less power for the operation • Better clock skew/jitter characteristics • More difficult clock requirements Prof. V.G. Oklobdzija, University of California
Flip-Flop: Example HLFF (Partovi) Prof. V.G. Oklobdzija, University of California
Flip-Flop: Example HLFF (Partovi) Prof. V.G. Oklobdzija, University of California
Pulse-Based Flip-Flops* *Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation Prof. V.G. Oklobdzija, University of California
Flip-Flop: Example D=0 pulse D=1 SAFF DEC Alpha 21264 Prof. V.G. Oklobdzija, University of California
Requirements in the Flip-Flop Design • Small Clk-Output delay, Narrow sampling window • Low power • Small clock load • High driving capability (increased levels of parallelism) • Typical load ranges from 3-4 FO4 to 15-25 FO4. • High driving should be achieved by inserting inverters and following “logical effort” rules starting with minimal size CSE. • Symmetry: balanced D-Q and D-Q/not delay. • Integration of logic into the flop • Multiplexed or clock scan • Cross-talk insensitivity - dynamic/high impedance nodes are affected Prof. V.G. Oklobdzija, University of California
Timing and Power metrics Prof. V.G. Oklobdzija, University of California
Delay • Sum of setup time U and Clk-Q delay is the only true measure of the performance with respect to the system speed • T = TClk-Q + TLogic + Tsetup+ Tskew TClk-Q TSetup TLogic Prof. V.G. Oklobdzija, University of California
Delay vs. Setup/Hold Times Prof. V.G. Oklobdzija, University of California
Timing Characteristics Prof. V.G. Oklobdzija, University of California
Timing parameters, details The best point to pick on delay curve is minimum D-Q Prof. V.G. Oklobdzija, University of California
Simulation Condition and Testbench • Power • Data activity dependence as a FF characteristics • Consumption with 50% (30%)activity adopted as a figure of merit • Dissipation of driving inverters is part of total power consumption Prof. V.G. Oklobdzija, University of California
Simulation Condition and Testbench • Timing • Total FF overhead is setup + clock-to-output time • Circuit optimization towards td-q • Clock skew robustness obtained from observing DQ curve • Power-Delay Product • Overall performance parameter at fixed frequency Prof. V.G. Oklobdzija, University of California
Flip-Flop Performance Comparison Test bench • Total power consumed • internal power • data power • clock power • Measured for four cases • no activity (0000… and 1111…) • maximum activity (0101010..) • average activity (random sequence) • Delay is (minimum D-Q) • Clk-Q + setup time Prof. V.G. Oklobdzija, University of California
The sources of internal power consumption Prof. V.G. Oklobdzija, University of California
Design & optimization tradeoffs • Opposite Goals • Minimal Total power consumption • Minimal Delay • Power-Delay tradeoff • Minimize Power-Delay product (PDPtot) Prof. V.G. Oklobdzija, University of California
Clocked Storage Elements in High-Performance Microprocessors Prof. V.G. Oklobdzija, University of California
Master-Slave Latches • Positive setup times • Two clock phases: • distributed globally • generated locally • Small penalty in delay for incorporating MUX • Some circuit tricks needed to reduce the overall delay Prof. V.G. Oklobdzija, University of California
PowerPC 603 M-S Latch Combination • Used in PowerPC family • Low-power • High speed • Big clock load • Easily embedded scan function Our simulations show PowerPC 603 (Gerosa, JSSC 12/94) • Small internal power consumption • Low-power feedback • Double the clock load compared with other latches • Locally generated second phase (reduces overall clock load) Prof. V.G. Oklobdzija, University of California
mC2MOS M-S Latch • Small clock load (local clock buffering) • Low-power feedback • Big positive setup time • Robustness to clock slope, unlike classic C2MOS structure Our simulations show Y. Suzuki, “Clocked CMOS Calculator Circuitry”, IEEE J. Solid-State Circuits, Dec. 1973 Prof. V.G. Oklobdzija, University of California
Advanced Flip-Flops Prof. V.G. Oklobdzija, University of California