280 likes | 511 Views
CSE241 VLSI Digital Circuits Winter 2003 Lecture 07: Timing II. Delay Calculation. Cell Fall. 0.147ns. 0.1ns. 0.178. Cell Rise. 0.12ns. 1.0pf. 0.261. Fall delay = 0.178ns Rise delay = 0.261ns Fall transition = 0.147ns Rise transition = …. Fall Transition. 0.147.
E N D
Delay Calculation Cell Fall 0.147ns 0.1ns 0.178 Cell Rise 0.12ns 1.0pf 0.261 Fall delay = 0.178ns Rise delay = 0.261ns Fall transition = 0.147ns Rise transition = … Fall Transition 0.147
PVT (Process, Voltage, Temperature) Derating Actual cell delay = Original delay x KPVT
PVT Derating: Example + Min/Typ/Max Triples Proc_var (0.5:1.0:1.3) Voltage (5.5:5.0:4.5) Temperature (0:20:50) KP = 0.80 : 1.00 : 1.30 KV = 0.93 : 1.00 : 1.08 KT = 0.80 : 1.07 : 1.35 KPVT = 0.60 : 1.07 : 1.90 Cell delay = 0.261ns Derated delay = 0.157 : 0.279 : 0.496 {min : typical : max}
Conservatism of Gate Delay Modeling • True gate delay depends on input arrival time patterns • STA will assume that only 1 input is switching • Will use worst slope among several inputs Vdd A B F tpd Time Vdd A F tpd Time
This Class + Logistics • Reading • Smith, Chapters 15, 16 • http://vlsicad.ucsd.edu/Presentations/ICCAD00TUTORIAL/ • Possibly: Sarrafzadeh/Wong Chapters 2 - placement, 3 - routing, (4 – performance modeling) • Schedule • MT will be take-home (and, easy), BUT you lose 5% if you don’t show up on Thursday (attendance will be taken by Ben) • Thursday: Surprise guest lecturer on floorplan / placement • HW #12: Suppose that you want to work on timing edges that are most critical according to some F(slack of the edge, #paths through the edge). How would you modify the STA calculation (longest path in a DAG) so that it also calculates the number of paths through each edge? Slide courtesy of S. P. Levitan, U. Pittsburg
Buffer Clustering • Hierarchical clustering connecting clock source (= root) to clock sinks (= leaves) of clustering tree • Fanout at each level between 5 and 200 (depends on buffer library) • Often specify a clock topology in the tool as, e.g., (1)-6-8-5 root has 6 children, each of which has 8 children, each of which has 5 (leaf) children 240 clock sinks • Big question: how to perform the hierarchical buffer clustering? • What makes a “good” cluster? Sylvester / Shepard, 2001
Buffer Clustering by Space Partitioning • Example: Cadence CT-Gen • Pick fanout (e.g., 6-4) • Pick “long axis” of bounding box of sinks • Place buffers at medians (essentially) of chunks of sinks identified by space-partitioning • Why is this good? • Uses (or assumes) min wire; easily routed (Steiner routing; robust to ECOs; … • Why is it bad? • Oversizes drivers; commits to skew which could be avoided Sylvester / Shepard, 2001
Buffer Clustering by Traditional Clustering • Example: SPC, old Cell3 CTS • Pick fanout (e.g., 6) • Find clusters of size 6 • Place buffers at centers or centroids or … of clusters • Recurse • Why is this good? • Can get near-zero skew trees? • Why is this bad? • ECOs; hard to route; more wire(?); difficult algorithms! • HW #13: Propose a hierarchical clustering strategy for buffered clock trees, and explain its pros and cons Sylvester / Shepard, 2001
Outline • Clocking • Storage elements • Clocking metrics and methodology • Clock distribution • Package and useful-skew degrees of freedom • Clock power issues • Gate timing models
Skew Reduction Using Package • Most clock network latency occurs at global level (largest distances spanned) • Latency Skew • With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming Sylvester / Shepard, 2001
Skew Reduction Using Package mP/ASIC • RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring • Global skew reduced • Lower capacitance lower power • Opens up global routing tracks • Results not yet conclusive Solder bump substrate System clock • Incorporate global clock distribution into the package • Flip-chip packaging allows for high density, low parasitic access from substrate to IC Sylvester / Shepard, 2001
Useful skew FF FF FF FF FF FF slow slow fast fast hold hold hold hold setup setup setup setup Useful skew • Local skew constraints • Shift slack to critical paths Useful Skew (= cycle-stealing) Zero skew Timing Slacks Zero skew • Global skew constraint • All skew is bad W. Dai, UC Santa Cruz
D : longest path d : shortest path FF FF -d + thold < Skew < Tperiod - D - tsetup race condition safe cycle time violation permissible range Skew = Local Constraint • Timing is correct as long as the signal arrives in the permissible skew range W. Dai, UC Santa Cruz
FF FF FF 6 ns 2 ns 4 0 4 0 “2 0 2”: more safety margin 2 -2 Skew Scheduling for Design Robustness • Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge • Can solve a linear program to maximize robustness = determine prescribed sink skews T = 6 ns “0 0 0”: at verge of violation W. Dai, UC Santa Cruz
Potential Advantages of Useful Skew • Reduce peak current consumption by distributing the FF switch point in the range of permissible skew CLK CLK 0-skew U-skew W. Dai, UC Santa Cruz • Affords extra margin to increase clock frequency or reduce sizing (= power)
Synthesis Placement 0-Skew Clock Synthesis Clock Routing Signal Routing Extraction & Delay Calculation Static Timing Analysis Conventional Zero-Skew Flow W. Dai, UC Santa Cruz
Permissible range generation Initial skew scheduling Clock tree topology synthesis Clock net routing Clock timing verification Useful-Skew Flow Existing Placement U-Skew Clock Synthesis Clock Routing Signal Routing Extraction & Delay Calculation W. Dai, UC Santa Cruz Static Timing Analysis
Outline • Clocking • Storage elements • Clocking metrics and methodology • Clock distribution • Package and used-skew degrees of freedom • Clock power issues • Gate timing models
Clock Power • Power consumption in clocks due to: • Clock drivers • Long interconnections • Large clock loads – all clocked elements (latches, FF’s) are driven • Different components dominate • Depending on type of clock network used • Ex. Grid – huge pre-drivers & wire cap. drown out load cap. Sylvester / Shepard, 2001
Clock Power Is LARGE P = a C Vdd2 f Sylvester / Shepard, 2001 Not only is the clock capacitance large, it switches every cycle!
Low-Power Clocking • Gated clocks • Prevent switching in areas of chip not being used • Easier in static designs • Edge-triggered flops in ARM rather than transparent latches in Alpha • Reduced load on clock for each latch/flop • Eliminated spurious power-consuming transitions during latch flow-through (transparency) Sylvester / Shepard, 2001
Clock Area • Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area • Routing area is most vital • Top-level metals are used to reduce RC delays • These levels are precious resources (unscaled) • Power routing, clock routing, key global signals • Reducing area also reduces wiring capacitance and power • Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing Sylvester / Shepard, 2001
Clock Slew Rates • To maintain signal integrity and latch performance, minimum slew rates are required • Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew (ps)], more short-circuit power for large clock drivers • Too fast – burns too much power, overdesigned network, enhanced ground bounce • Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target) • 1 GHz clock; Trise = Tfall = 100-200ps Sylvester / Shepard, 2001
Example: Alpha 21264 Grid + H-tree approach Power = 32% of total Wire usage = 3% of metals 3 & 4 Sylvester / Shepard, 2001 4 major clock quadrants, each with a large driver connected to local grid structures
Alpha 21264 Skew Map Sylvester / Shepard, 2001 Ref: Compaq, ASP-DAC00
Power vs. Skew • Fundamental design decision • Meeting skew requirements is easy with unlimited power budget • Wide wires reduce RC product but increase total C • Driver upsizing reduces latency ( reduces skew as well) but increases buffer cap • SOC context: plastic package power limit is 2-3 W Sylvester / Shepard, 2001
Clock Distribution Trends • Timing • Clock period dropping fast, skew must follow • Slew rates must also scale with cycle time • Jitter – PLL’s get better with CMOS scaling but other sources of noise increase • Power supply noise more important • Switching-dependent temperature gradients • Materials • Cu reduces RC slew degradation, potential skew • Low-k decreases power, improves latency, skew, slews • Power • Complexity, dynamic logic, pipelining more clock sinks • Larger chips bigger clock networks Sylvester / Shepard, 2001