270 likes | 283 Views
An Efficient Surface-Based Low-Power Buffer Insertion Algorithm. Rajeev R. Rao, David Blaauw, Dennis Sylvester, Charles Alpert*, Sani Nassif* Department of EECS, University of Michigan, Ann Arbor, MI IBM Austin Research Laboratory, Austin, TX*
E N D
An Efficient Surface-Based Low-Power Buffer Insertion Algorithm Rajeev R. Rao, David Blaauw, Dennis Sylvester, Charles Alpert*, Sani Nassif* Department of EECS, University of Michigan, Ann Arbor, MI IBM Austin Research Laboratory, Austin, TX* {rrrao, blaauw, dennis}@eecs.umich.edu, {alpert, nassif}@us.ibm.com*
80 clk-rep 70 rep 60 tot-rep 50 40 %repeater cells in block-level nets 30 20 10 0 32nm 90nm 65nm 45nm Source: P. Saxena, ISPD’04 Total Dynamic Power Breakdown Source: N. Magen, SLIP’04 Interconnect Trends • Interconnect power a major issue • Huge power consumption in both global and local signal nets • Repeater counts increasing drastically • IBM: 50% of leakage in inverters/buffers • Assuming continuation of current design styles, dramatic projections for the 32nm technology node • 70% of cell count = repeaters • 65-80% of dynamic power due to interconnects • Leakage increasing exponentially • Require: Optimal repeater usage with the objective of total power minimization
Outline • Introduction • Delay and Buffer models • Previous Work • Proposed Algorithm • Library characterization • Generation of different types of candidates • Merging, Propagation, Snapping • Results • Conclusion
2 Driver Receiver Wire Length = 2, Wire Delay (2)2= 4 Introduction • Wire RC delay is quadratic function of wire length • Segmenting wires decreases delay • Same idea applicable for interconnect tree structures • Buffers inserted for delay management • Additional benefit: Buffers/Inverters decouple large output loads 1 1 Repeater Driver Receiver Wire Length = 2, Wire Delay (1)2+(1)2 = 2
Source: Digital Int. Circuits, J. Rabaey Elmore Delay model • Represent interconnect tree with a lumped RC model • Assume binary tree topology is fixed with an initial Steiner tree estimation • n vertices (branch points) and (n-1) edges (ie., wires) • For a wire e connecting vertices (u, v) the Elmore delay is: where T(v) is the maximal subtree rooted at v that does not contain buffers • The total delay from a vertex v to a sink node si is:
Node v “sees” a downstream load = Cbuf. Cload is “invisible” to v. Cgd Cbuf Buffer model • Linear gate delay model used for the buffers • Assumption: Delay is a linear function of output capacitance • Isolation Property: Buffer devices decouple “downstream” output loads from the parent trees • Assumption: Miller effect (“bootstrapping”) due to Cgd is negligible Dbuffer = Dintrinsic-delay + Rintrinsic-resistance*Coutput-load v Cload
Source Sink Legal position Buffer Insertion Problem BufLib b1 b2 b3 … • Timing Metrics • Required Arrival Time (RAT) • Each sink specified a given RAT(si) value and source is fixed as RAT(so)=0 • Delay minimization Maximize slack at source q(so) • Subtree Delay (SD) • SD(si) = RATmax(si) – RAT(si) • Delay minimization Minimize SD(so) • Advantage: Unlike RAT, equations using SD are additive • Our approach • Tradeoff surfaces in 3D space of delay, capacitance and power • Continuously-sized buffer libraries
Outline • Introduction • Delay and Buffer models • Previous Work • Proposed Algorithm • Library characterization • Generation of different types of candidates • Merging, Propagation, Snapping • Results • Conclusion
Previous Work • L. P. P. P. van Ginneken (VG) – ISCAS’90 • Two phase dynamic programming algorithm • Backward traversal up the interconnect tree to compute of load and delay values • Forward solution pass to reconstruct “best” candidate Function BOTTOM_UP (v) 1. If v ε sink { return (Cv, SDv) } Else 2. /* compute options for subtrees */ 3. BOTTOM_UP( left(v) ) 4. BOTTOM_UP( right(v) ) 5. Join pairs of subtrees by a merge operation 6. Find best cnd among merged cnds to add a buffer 7. Add parent wire to both types of cnds 8. Prune inferior cnds from set of cnds 9. Store cnd list for node v and return Post-order DFS traversal Merge operation Cparent = Cleft + CrightSDparent = max(SDleft, SDright) Buffer candidate creation Pruning provably inferior candidates
VG Algorithm • Candidate Format: 2-tuple (Load, Subtree Delay) = (c,s) • Recursive forumulas for two possible cases • Pruning Criteria: (c1,s1) “better” than (c2,s2) if both load and subtree delay values are lower i.e., c1<c2 and s1<s2 • Merge operation linear • Complexity = O(n2) where n = number of buffer locations • Additional objective: Minimize buffer count Complexity is non-polynomial (c1.s1) (c1.s1) (c0.s0) (c0.s0)
Previous Work • Extensions to VG by Lillis et. al. – ICCAD’95, JSSC’96 • A buffer library B can be used during buffer insertion Complexity = O(n2|B|2) • Simultaneous wire sizing and buffer insertion • Incorporate signal slew into buffer delay model • Dynamic power minimization subject to timing constraints • Candidate Format: 3-tuple (Load, Subtree Delay, Power) = (c,s,p) • Equate power with effective “total” capacitance • Assumption: All capacitive values can be linearly mapped onto a polynomially-bounded integer domain (cmax = max cap value) • Sophisticated pruning mechanism using orthogonal range query • Complexity = O(n3|B|c2maxlog(ncmax)) based on the assumption
Previous Work • Several approaches presented in literature to target power minimization in conjunction with buffer insertion. Examples: • Quadratic programming: Chu et. al. – TCAD’99 • Lagrangian relaxation: C.-P.Chen et. al. TCAD’99 • ClockTune: J.-L.Tsai et. al. – TCAD’04 • Associate total power with effective capacitive area of wires + devices • Area minimization Power minimization • Ignores the contribution of static leakage power • Inclusion of this component results in non-polynomial complexity • Addition of extra components in candidates generally leads to exponential complexity for dynamic programming
Contributions of this paper • Novel “continuous” buffer insertion algorithm with total power minimization • Inclusive of both dynamic and leakage power • Generate tradeoff surfaces in the 3D DCP (Delay, Capacitance, Power) space • User is able to pick any desired point on this 3D surface • Easy to explore trade-offs between the 3 variables • Ability to handle arbitrarily large buffer libraries • Continuously sized cell libraries with numerous buffer sizes • Capable of snapping to discrete buffer sizes if necessary • Worst-case polynomial complexity O(n2) • Similar to “basic” VG algorithm
Outline • Introduction • Delay and Buffer models • Previous Work • Proposed Algorithm • Library characterization • Generation of different types of candidates • Merging, Propagation, Snapping • Results • Conclusion
Library Characterization • Buffer library with a set of continuously sized buffers • Let S = sizing factor of the library. Express delay (db), capacitance (cb) and leakage (lb) in terms of S. • Determine c0, c1, l0, l1, d0, d1 through empirical fitting constants • Equations combine discrete buffer sizes approximate the ideal of continuous buffer sizing cb Buffer Area cb = c0 + c1*S lb Device width lb = l0 + l1*S db Linear gate delay model db = d0 + d1*(Cout/S)
(D0, C0, P0) Generation of candidates • Point Candidate • Candidate Format: 3-tuple (Do, Co, Po) • Node has point candidate there are no buffers in subtree rooted at that node • All sinks have point candidates • Write equations to determine candidate at u b2 b3 b1 b4 lw1 lw2 lw3 o u v t
(D0, C0, P0) (Du, Cu, Pu) Generation of candidates • Curve Candidate • Candidate Format: {[Dumin,Dumax], (gi, ki) i=[0,2]} • Node has curve candidate Exactly one buffer in subtree rooted at node (D0, C0, P0) b2 b3 b1 b4 lw1 lw2 lw3 o u v t Variable S
(D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv) Variable S,Du Pv Cv Dv Generation of candidates • Surface Candidate • C-plane Format: {Cv, [Dmin,Dmax], (ki) i=[0,2]} • Candidate Format: vector<CPlane> (Du, Cu, Pu) b2 b3 b1 b4 lw1 lw2 lw3 o u v t For a given S, Cv fixed, Dv, Pv vary based on Du C-plane with “discrete” Cv
(D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv) (Dt, Ct, Pt) Generation of candidates • Similar equations can be written to determine candidate at t • Ct S but Dt, Pt Cv, Dv, S • New set of C-planes. • C-plane, Lower envelope Power optimal solution • Surface candidate Surface candidate (Du, Cu, Pu) (Dv, Cv, Pv) b2 b3 b1 b4 lw1 lw2 lw3 o u v t
Design Choices • Wire network is a binary tree • Zero-length wires, dummy nodes • Ignore signal polarity on buffers • Pair of solution sets (similar to Lillis) • Number of surface candidates per node = 2 (Buffered/Non-buffered) • Trade-off between more fine grained solutions and efficiency • No impact on optimality or complexity
Merging and Implicit Pruning • First, merge left and right candidate • Compare equal delay points by checking 4 combinations of left and right candidates • Create P/C curves and extract the lower envelope Pruning • Translate P/C curves with fixed D value into P/D curves with fixed C values Creation of C-planes for 4 different surface candidates • Next, recombine these 4 surfaces into single candidate • Map P/D curves from one C-plane to another using linear interpolation • (D,C) value pick lowest power value Pruning • Use composite surface to create the buffered/non-buffered candidate
Reconstruction and Snapping • Pair of candidate solutions created for source • Any trade-off point in the DCP surface can be picked • Forward solution pass to reconstruct the tree structure with buffer locations • Snapping: If required size is unavailable then buffer with nearest size value is chosen • Problem: Discrepancies in D, C, P values Solution: Local refinements in the C-planes • Single pass through the RC tree • Complexity = O(n2) where n = number of possible buffer locations
Outline • Introduction • Delay and Buffer models • Previous Work • Proposed Algorithm • Library characterization • Generation of different types of candidates • Merging, Propagation, Snapping • Results • Conclusion
Results • Benchmarks = C-tree nets • TSMC 0.13um buffer library • Number of discrete buffer choices = 9 • Multilinear fitting models using GNU Scientific Library • Example 3D surface
Results: Comparison • Implementation of Lillis algorithm with leakage included • Pruning less effective
Conclusion • Buffer insertion algorithm with total power (Pdyn + Pstat) minimization as objective • Generate 3D surfaces in Delay, Capacitance and Power space • Ability to explore different types of trade-offs • Able to handle large buffer libraries with continuous sizes • Worst case polynomial complexity