310 likes | 522 Views
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis. Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21 st , 2007 ISPD 2007, Austin. Outline. Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion.
E N D
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21st, 2007 ISPD 2007, Austin
Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion
Local Clock Capacitance Distribution in a Microprocessor • Interconnects contribute to major portion of total capacitance • Clocks are the most active nets in the design • Minimizing interconnect capacitance in clocks leads to reduction in dynamic power • Distribution generated from several blocks in a microprocessor
Microprocessor Clock Hierarchy Local Clock Network: CTS Solution Space • Clock network in a processor: • Distributed as a grid followed by tree Global Clock Distribution Using Multiple spines LCBs RCBs Regional Clock Buffers PLL Local Clock Buffers RCBs LCBs To state elements Tunable Grid Buffers Clock Grid
Previous Work • Zero skew (unbuffered) trees: Tsay TCAD’93, Boese et al. ASIC’92, Edahiro DAC’93, ’94 • Buffered trees: • Vittal et al., DAC’95: Trades off buffers with wires; unsuitable for controlled implementation of clock gating and delayed clocking • Mehta et al., ICCD’97: Uses dynamic programming based heuristic for clustering • Tsai et al., ICCAD’05: Formulation employing tunable buffers
Sequentials (x,y), sizes Logical Clock Tree RTL Clock Buffer Duplication Logic Synthesis Routing Clock Nets Physical Synthesis Sizing Clock Buffers CTS Routing (Simplified version) Clock Tree Synthesis (CTS) • Performed after the placement/sizing of sequentials • Converts logical clock tree into physical one • Flow employed in several microprocessor designs CTS
Duplication K-stage buffers Duplication K-stage receivers Clock Buffer Duplication • Given a clock buffer, duplicate it to meet delay, slope, RC, skew constraints • Decides • receivers driven by the same driver • the clock tree topology • Applied recursively in reverse topological order • Driven by clustering or partitioning • Often intractable when capacity constraints specified • Many heuristics available
Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion
Effect of Clustering on Capacitance 4 placed sequentials Solution 1 Solution 2 Solution 3 • A cluster implies a clock buffer • Interconnect capacitance varies significantly for different solutions even with same number of clusters
Clustering Targeting Power • Find the clusters such that total local clock power is minimum • Power in local clock, PLocal Clock= PDynamic+ PLeakgge • PDynamic = PSequentialCap + PBufferCap + PRouting Cap • PLeakage and PBufferCap can be shown proportional to total cap • PSequentialCap is fixed for CTS purposes • Reducing PLocal Clock is equivalent to minimizing interconnect cap • Find the clusters such that total interconnect capacitance is minimum
? Routing-aware Clustering: Chicken-and-Egg Problem • Routing cap is unknown till the clustering is performed • Clustering cannot be performed till routing cap is known
Problem Simplification • Let’s assume minimum spanning tree (MST) routing estimates • Other candidates: HPWL, Edahiro metric • Data in the paper show MST and Edahiro metric strongly correlated with actual clock tree wirelength • MST possesses submodularity property suitable for greedy optimization • Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum • Yes, it can be (if capacity constraints are dropped)
Problem Definition • Given: Set of receivers S = {s1, …, sn}, their loads (csi), and locations (xsi, ysi) • Find: A set of clusters, Sclusters = {c1, …, cm} such that Σiα + MST (ci) is minimum • Subject to Constraints (or Design Parameters): • Maximum # of receivers • Due to process, routing, etc. • Maximum load in a cluster • Due to library • Bounding box width/height • To control RC delay and variations in it
Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion
Power-aware Clustering Algorithm • Similar to Kruskal’s MST construction algorithm • Steps in algorithm: • Create complete graph G(S, E, W) • Assign each edge estimated capacitance as the weight • Create trivial solution with each cluster containing a receiver • For each edge, in ascending order of weights • Merge clusters till the cost function is minimized
1 A cluster An edge 5 5 4 4 The weight 2 Example • Constraint: maximum # of receivers constraint 3
1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3
1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3
1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3 • Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case
Optimality, Time Complexity of Algorithm • Ensures optimality when no capacity constraints (max. load, # of receivers) specified • Reduces to minimum spanning forest problem • Runs in O(n2 log n) time in number of receivers • Handles blocks with ~5K sequentials easily • 1.34 seconds for clustering of 1037 sequentials • Run-times practical and comparable to competitive algorithms • Clock buffer duplication takes minutes on ~5K sequential blocks
Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion
Evaluation of Power-Aware Clustering (PoAwCl) • Implemented clustering algorithm, PoAwCl, in C++ • Incorporated in the clock buffer duplication step using TCL • Rest of the CTS kept unchanged • Generated clock trees on microprocessor blocks by changing only the clustering/partitioning heuristics • Best of the results compared with the PoAwCl
13% Average Improvement Results on Clock Trees: Int. Cap. Improvement
6% Average Improvement Results on Clock Trees: Total Cap. Improvement
11% Average Improvement Results on Clock Trees: Wirelength Improvement
●,+,*,▼denote locations of sequentials; same type symbols denote a cluster 4 clusters, in each case, represent 4 clock buffers driving the sequentials in their clusters Looking at Cluster Pictures Power-aware clustering Clustering aimed at minimizing # of buffers
Power-aware clustering (on right) results in smaller wirelength Viewing the Routing
Agenda • Introduction • Motivation • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion
Conclusion • Power-aware clustering results in 13% improvement in interconnect cap • Also Frees up routing resources by 11% discounting shielding and spacing of clock wires • Used for other applications such as enable logic (or clock gating) synthesis, trunk-routing • Acknowledgment: Intel’s CAD Organization • for providing the source code of the CTS package which sped up the development