An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm Rajeev R. Rao, David Blaauw, Dennis Sylvester, Charles Alpert*, Sani Nassif* Department of EECS, University of Michigan, Ann Arbor, MI IBM Austin Research Laboratory, Austin, TX* {rrrao, blaauw, dennis}@eecs.umich.edu, {alpert, nassif}@us.ibm.com*

80 clk-rep 70 rep 60 tot-rep 50 40 %repeater cells in block-level nets 30 20 10 0 32nm 90nm 65nm 45nm Source: P. Saxena, ISPD’04 Total Dynamic Power Breakdown Source: N. Magen, SLIP’04 Interconnect Trends • Interconnect power a major issue • Huge power consumption in both global and local signal nets • Repeater counts increasing drastically • IBM: 50% of leakage in inverters/buffers • Assuming continuation of current design styles, dramatic projections for the 32nm technology node • 70% of cell count = repeaters • 65-80% of dynamic power due to interconnects • Leakage increasing exponentially • Require: Optimal repeater usage with the objective of total power minimization

Outline • Introduction • Delay and Buffer models • Previous Work • Proposed Algorithm • Library characterization • Generation of different types of candidates • Merging, Propagation, Snapping • Results • Conclusion

2 Driver Receiver Wire Length = 2, Wire Delay  (2)2= 4 Introduction • Wire RC delay is quadratic function of wire length • Segmenting wires decreases delay • Same idea applicable for interconnect tree structures • Buffers inserted for delay management • Additional benefit: Buffers/Inverters decouple large output loads 1 1 Repeater Driver Receiver Wire Length = 2, Wire Delay  (1)2+(1)2 = 2

Source: Digital Int. Circuits, J. Rabaey Elmore Delay model • Represent interconnect tree with a lumped RC model • Assume binary tree topology is fixed with an initial Steiner tree estimation • n vertices (branch points) and (n-1) edges (ie., wires) • For a wire e connecting vertices (u, v) the Elmore delay is: where T(v) is the maximal subtree rooted at v that does not contain buffers • The total delay from a vertex v to a sink node si is:

Node v “sees” a downstream load = Cbuf. Cload is “invisible” to v. Cgd Cbuf Buffer model • Linear gate delay model used for the buffers • Assumption: Delay is a linear function of output capacitance • Isolation Property: Buffer devices decouple “downstream” output loads from the parent trees • Assumption: Miller effect (“bootstrapping”) due to Cgd is negligible Dbuffer = Dintrinsic-delay + Rintrinsic-resistance*Coutput-load v Cload

Source Sink Legal position Buffer Insertion Problem BufLib b1 b2 b3 … • Timing Metrics • Required Arrival Time (RAT) • Each sink specified a given RAT(si) value and source is fixed as RAT(so)=0 • Delay minimization  Maximize slack at source q(so) • Subtree Delay (SD) • SD(si) = RATmax(si) – RAT(si) • Delay minimization  Minimize SD(so) • Advantage: Unlike RAT, equations using SD are additive • Our approach • Tradeoff surfaces in 3D space of delay, capacitance and power • Continuously-sized buffer libraries

Previous Work • L. P. P. P. van Ginneken (VG) – ISCAS’90 • Two phase dynamic programming algorithm • Backward traversal up the interconnect tree to compute of load and delay values • Forward solution pass to reconstruct “best” candidate Function BOTTOM_UP (v) 1. If v ε sink { return (Cv, SDv) } Else 2. /* compute options for subtrees */ 3. BOTTOM_UP( left(v) ) 4. BOTTOM_UP( right(v) ) 5. Join pairs of subtrees by a merge operation 6. Find best cnd among merged cnds to add a buffer 7. Add parent wire to both types of cnds 8. Prune inferior cnds from set of cnds 9. Store cnd list for node v and return Post-order DFS traversal Merge operation Cparent = Cleft + CrightSDparent = max(SDleft, SDright) Buffer candidate creation Pruning provably inferior candidates

VG Algorithm • Candidate Format: 2-tuple (Load, Subtree Delay) = (c,s) • Recursive forumulas for two possible cases • Pruning Criteria: (c1,s1) “better” than (c2,s2) if both load and subtree delay values are lower i.e., c1<c2 and s1<s2 • Merge operation linear • Complexity = O(n2) where n = number of buffer locations • Additional objective: Minimize buffer count  Complexity is non-polynomial (c1.s1) (c1.s1) (c0.s0) (c0.s0)

Previous Work • Extensions to VG by Lillis et. al. – ICCAD’95, JSSC’96 • A buffer library B can be used during buffer insertion  Complexity = O(n2|B|2) • Simultaneous wire sizing and buffer insertion • Incorporate signal slew into buffer delay model • Dynamic power minimization subject to timing constraints • Candidate Format: 3-tuple (Load, Subtree Delay, Power) = (c,s,p) • Equate power with effective “total” capacitance • Assumption: All capacitive values can be linearly mapped onto a polynomially-bounded integer domain (cmax = max cap value) • Sophisticated pruning mechanism using orthogonal range query • Complexity = O(n3|B|c2maxlog(ncmax)) based on the assumption

Previous Work • Several approaches presented in literature to target power minimization in conjunction with buffer insertion. Examples: • Quadratic programming: Chu et. al. – TCAD’99 • Lagrangian relaxation: C.-P.Chen et. al. TCAD’99 • ClockTune: J.-L.Tsai et. al. – TCAD’04 • Associate total power with effective capacitive area of wires + devices • Area minimization  Power minimization • Ignores the contribution of static leakage power • Inclusion of this component results in non-polynomial complexity • Addition of extra components in candidates generally leads to exponential complexity for dynamic programming

Contributions of this paper • Novel “continuous” buffer insertion algorithm with total power minimization • Inclusive of both dynamic and leakage power • Generate tradeoff surfaces in the 3D DCP (Delay, Capacitance, Power) space • User is able to pick any desired point on this 3D surface • Easy to explore trade-offs between the 3 variables • Ability to handle arbitrarily large buffer libraries • Continuously sized cell libraries with numerous buffer sizes • Capable of snapping to discrete buffer sizes if necessary • Worst-case polynomial complexity O(n2) • Similar to “basic” VG algorithm

Library Characterization • Buffer library with a set of continuously sized buffers • Let S = sizing factor of the library. Express delay (db), capacitance (cb) and leakage (lb) in terms of S. • Determine c0, c1, l0, l1, d0, d1 through empirical fitting constants • Equations combine discrete buffer sizes approximate the ideal of continuous buffer sizing cb Buffer Area cb = c0 + c1*S lb Device width lb = l0 + l1*S db Linear gate delay model db = d0 + d1*(Cout/S)

(D0, C0, P0) Generation of candidates • Point Candidate • Candidate Format: 3-tuple (Do, Co, Po) • Node has point candidate  there are no buffers in subtree rooted at that node • All sinks have point candidates • Write equations to determine candidate at u b2 b3 b1 b4 lw1 lw2 lw3 o u v t

(D0, C0, P0) (Du, Cu, Pu) Generation of candidates • Curve Candidate • Candidate Format: {[Dumin,Dumax], (gi, ki) i=[0,2]} • Node has curve candidate  Exactly one buffer in subtree rooted at node (D0, C0, P0) b2 b3 b1 b4 lw1 lw2 lw3 o u v t  Variable S

(D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv)  Variable S,Du Pv Cv Dv Generation of candidates • Surface Candidate • C-plane Format: {Cv, [Dmin,Dmax], (ki) i=[0,2]} • Candidate Format: vector<CPlane> (Du, Cu, Pu) b2 b3 b1 b4 lw1 lw2 lw3 o u v t For a given S, Cv fixed, Dv, Pv vary based on Du C-plane with “discrete” Cv

(D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv) (Dt, Ct, Pt) Generation of candidates • Similar equations can be written to determine candidate at t • Ct S but Dt, Pt  Cv, Dv, S • New set of C-planes. •  C-plane, Lower envelope Power optimal solution • Surface candidate  Surface candidate (Du, Cu, Pu) (Dv, Cv, Pv) b2 b3 b1 b4 lw1 lw2 lw3 o u v t

Design Choices • Wire network is a binary tree • Zero-length wires, dummy nodes • Ignore signal polarity on buffers • Pair of solution sets (similar to Lillis) • Number of surface candidates per node = 2 (Buffered/Non-buffered) • Trade-off between more fine grained solutions and efficiency • No impact on optimality or complexity

Merging and Implicit Pruning • First, merge left and right candidate • Compare equal delay points by checking 4 combinations of left and right candidates • Create P/C curves and extract the lower envelope  Pruning • Translate P/C curves with fixed D value into P/D curves with fixed C values  Creation of C-planes for 4 different surface candidates • Next, recombine these 4 surfaces into single candidate • Map P/D curves from one C-plane to another using linear interpolation •  (D,C) value pick lowest power value  Pruning • Use composite surface to create the buffered/non-buffered candidate

Reconstruction and Snapping • Pair of candidate solutions created for source • Any trade-off point in the DCP surface can be picked • Forward solution pass to reconstruct the tree structure with buffer locations • Snapping: If required size is unavailable then buffer with nearest size value is chosen • Problem: Discrepancies in D, C, P values  Solution: Local refinements in the C-planes • Single pass through the RC tree • Complexity = O(n2) where n = number of possible buffer locations

Results • Benchmarks = C-tree nets • TSMC 0.13um buffer library • Number of discrete buffer choices = 9 • Multilinear fitting models using GNU Scientific Library • Example 3D surface

Results: Snapping

Results: Comparison • Implementation of Lillis algorithm with leakage included • Pruning less effective

Conclusion • Buffer insertion algorithm with total power (Pdyn + Pstat) minimization as objective • Generate 3D surfaces in Delay, Capacitance and Power space • Ability to explore different types of trade-offs • Able to handle large buffer libraries with continuous sizes • Worst case polynomial complexity

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

Presentation Transcript

Buffer and FF Insertion

An O(nm) Time Algorithm for Optimal Buffer Insertion of m Sink Nets

Efficient Sorting Algorithm

Isolation Cell Insertion for Low Power Design

An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types

An Efficient Video Similarity Search Algorithm

An Efficient Power-Aware Scheduling Algorithm for the Multiprocessor Platform

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

Distributed Reorder Buffer Schemes for Low Power *

An Efficient Low Bit-Rate Video-coding Algorithm Focusing on Moving Regions

Path Based Buffer Insertion

An Efficient P-center Algorithm

EE4271 VLSI Design Interconnect Optimizations Buffer Insertion

Fast Buffer Insertion Considering Process Variation

An Adaptive “Sleep” Algorithm for Efficient Power Management in WLANs

An Integrated Floorplanning with an Efficient Buffer Planning Algorithm

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis

An Efficient Video Similarity Search Algorithm

Circuit-wise Buffer Insertion and Gate Sizing Algorithm with Scalability