220 likes | 239 Views
ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time. Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National Taiwan University). University of Wisconsin-Madison http://vlsi.ece.wisc.edu. Outline. Background Motivation and contribution
E N D
ε-Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National Taiwan University) University of Wisconsin-Madisonhttp://vlsi.ece.wisc.edu
Outline • Background • Motivation and contribution • Literature overview • ClockTune algorithm • Problem formulation • ClockTune algorithm overview • Optimality and complexity analysis • Experimental results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Motivation • Clock skew cycle time penalty • Start with zero-skew clock tree • Minimize clock delay reduces system-level skew (Kuh, et al. [DAC ‘90]) • Clock tree is power-hungry (30% in Intel McKinley(0.18um/1GHz/130W) • P = f CV2 • Minimize switching capacitance (wiring area) • Stability affects design convergence • Allow incremental refinement to accommodate local changes • Interconnect delay dominates total delay • Wire-sizing is effective in reducing interconnect delay
Motivation • Non-convex zero-skew constraints • No known algorithm solves zero-skew wire-sizing problem optimally with polynomial runtime • Hence, a good clock tree wire-sizing algorithm can • Minimize delay and power • Guarantee optimality and runtime • Have good stability
Contribution • First ε-optimal algorithm for solving clock min-delay/power zero-skew wire-sizing optimization problem • Provide complete (Sampled) solution set of the delay/power/area trade-off information for design planning • Efficient pseudo-polynomial runtime (6170-branch clock tree in 6 minutes within 1% optimality) • Runtime v.s. Optimality tradeoff • Incremental clock re-balancing to speed up design convergence
Literature Overview • “Reliable non-zero skew clock tree using wire width optimization”, Pillage, et al. [DAC ’93] • Iteratively optimize skew and delay using adjoint sensitivity analysis • Aimed at reliable clock trees under process variation • Deferred Merging Embedding (DME) algorithm, Kahng, et al. [TCAD ’92] • Bottom-up merging segment construction, top-down embedding • Integrated Deferred Merging Embedding (IDME) algorithm, Wong, et al. [ISPD’00] • Handles simultaneous routing, buffer-insertion, and wire-sizing • Merging segment set: a set of line samples of a merging region • No optimality guarantee • The size of MSS grows exponentially • “Process variation aware clock tree routing”, Lu, et al. [ISPD ’03] • Based on DME/BST
Outline • Background • Motivation and contribution • Literature overview • ClockTune algorithm • Problem formulation • ClockTune algorithm overview • Optimality and complexity analysis • Experimental results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Problem formulation • min-ZSWS (Zero Skew Wire Sizing) problem • Given a clock routing minimize s.t. where Pi, Pj are paths from v to leaf nodes i and j • Zero-skew constraints are non-convex constraints • No known algorithm solves the problem optimally in polynomial runtime
DC region DC region approach • Clock Delay and wiring Capacitance are top concerns • Define f : RNR2, such that • fY(w) = Delay(Tv(w)), fX(w) = Capacitance(Tv(w)) • DC region (v):The projection of the feasible region • Choose a d-c pair from the DC region on R2 Feasible region
ClockTune algorithm overview • Phase 1: bottom-up construct DC regions for every node • Phase 2: top-down embedding after delay/power tradeoff
Optimality analysis • Embeddings not fall on the delay samples will be omitted • Propagated error • Delay sampling error • Wire width sampling error (detailed in the paper)
Optimality analysis • Error is bounded • d : delay sampling resolution • w : wire width sampling resolution • k, : Constants related to l, r0, c0, wm, wM … • Generally speaking, error reduced about a half when resolution doubled Error Resolution
Optimality runtime trade off • Control sampling resolution can trade off optimality with runtime and memory
Complexity analysis • Runtime • Bottom-up phase takes O(n p max(p,q)) • Top-down phase takes O(np) • Overall: O(n p max(p,q)) • Memory • O(np) where n : number of nodes of the clock tree, p : number of delay samples taken at each node q : number of wire width samples taken at each level-2 node
Outline • Background • Motivation and contribution • Related works • problem formulation • ClockTune Algorithm • Design space projection • Algorithm overview • Optimality and complexity analysis • Experimental Results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Experimental setup • ClockTune is implemented in C++, executed on a 128MB 533MHz Pentium III PC • Benchmarks r1 – r5 from Tsay et al. [ICCAD‘91] • Initial routing generated by BB+DME algorithm with minimum wire width w = 1 m • ClockTune uses wm = 1 m, wM = 4 m • p: number of delay samples taken at every node • q: number of wire width samples taken at every level-2 node • r0 = 0.03, c0 = 210-16/m2
Runtime and memory usage • Runtime and memory usage are linear to problem size when p, q are fixed • Within 1% optimality when p,q=256 (runtime < 6 minutes, memory ~ 64MB)
Optimality results • Optimality • Error below 1% with p=q=256 • Error reduced to about a half when resolution doubled
Power/Delay trade-off 5~150ns Delay r5 Minimum power 0.2~1.1nF Minimum delay Capacitance 15:1 delay:power trade-off
Incremental refinement • DC region captures the design space • Enables incremental refinement
Conclusion & Future Work • Provide a zero-skew clock tree wire-sizing algorithm which • Minimizes delay and area ε-optimally • Guarantees pseudo-polynomial runtime and memory usage • Provides delay/power trade-off information to designers • Speeds up design convergence by allowing clock tree re-balancing with minimum changes • Better delay model • Buffer insertion/sizing capability