1 / 14

Global Clustering-Based Performance-Driven Circuit Partitioning

Global Clustering-Based Performance-Driven Circuit Partitioning. Jason Cong University of California Los Angeles cong@cs.ucla.edu. Chang Wu Aplus Design Technologies Los Angeles changwu@aplus-dt.com. Problem Definition.

kjuan
Download Presentation

Global Clustering-Based Performance-Driven Circuit Partitioning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles cong@cs.ucla.edu Chang Wu Aplus Design Technologies Los Angeles changwu@aplus-dt.com

  2. Problem Definition • Problem: k-way circuit partitioning and retiming with balanced area for delay minimization • Delay minimization with consideration of cutsize • Retiming is performed simultaneously with partitioning for best possible delay reduction • Generic delay model: node delay, intra-block delay, inter-block delay Node delay dv Inter-block delay D Intra-block delay d D > d D d

  3. Existing Approaches • Clustering-based approaches • PRIME: group nodes into clusters with given area bound • Quasi-optimal delay solution with node duplication • Huge cutsize (3X) • Partitioning-based approaches • Partition circuits into k-blocks and then iteratively move nodes to further improve • Cut-size minimization: hMetis • Multi-level partitioning, very fast, excellent cutsize, fair circuit delay • Delay minimization: HPM • Performance-driven clustering + cutsize-driven partitioning, tradeoff between delay and cutsize

  4. Existing Approaches (cont) • Clustering-based approaches • Delay optimization with node duplication is optimally solved • Node duplication-free clustering is NP-complete, but with fairly good results by resolving duplications heuristically • Huge cutsize • Partitioning-based approaches • Very good cutsize • Difficulty on delay minimization: delay update for each node-move is too costly (linear time) • hMetis: does not consider delay directly, gradual coarsening is difficult to target for delay • HPM: separate clustering and partitioning, clustering does not know its impact on cutsize, partitioning does not have much control on delay

  5. HPM: Combination of Clustering and Partitioning • HPM by Cong, et al, [DAC99] • Clustering followed by partitioning • Good delay and cutsize balance • Clustering and partitioning are two completely separated steps • Clustering with very small and fixed area bound (10) on each blocks: much less than A/K, where A is circuit area • Achieve inferior delay to clustering with cluster area bound of A/K (delay is ~23% larger) • Achieve larger cutsize than hMetis because clustering constraints reduces cutsize reduction capability of partitioning • Better solution is Needed

  6. Multi-Level Partitioning for Cutsize • hMetis by Karypis, et al. [DAC97] • Gradual coarsening to group tightly connected nodes together • Uncoarsening gradually and reducing cutsize by moving clusters • Fast algorithm: reduced solution space at each level as many nodes are grouped and moved together • Smaller cutsize: more thorough search is possible in reduced solution space • Hyperedge-based coarsening is very suitable for cutsize • Delay is completely ignored

  7. Existing Multi-level Optimization Engine • V-shape multi-level optimization used in hMetis • Not very suitable for delay minimization • Gradual coarsening has difficulty to predict impact on delay

  8. MLPR: Performance-Driven Multi-Level Partitioning and Retiming • K-way partitioning algorithm for performance optimization • Retiming is performed during partitioning for best possible circuit delay • Cutsize reduction is also considered • MLPR • Clustering with area bound of A/K, where A is circuit area • Partitioning of clusters into K blocks • For level from 1 to log(A/K) • Clustering with area bound of A/(K´ 2level) • Each cluster is bounded by the block it belongs to • Moving clusters to reduce cutsize while preserving circuit delay • Final movement of individual nodes for best solution

  9. Our Contribution: Global Clustering Based Multi-Level Optimization Engine • Start directly from the coarsest level with global clustering for best possible delay • Clustering-based gradual declustering to increase the freedom for refinement • Retiming is considered simultaneously during clustering and partitioning for smaller delay

  10. Global Clustering for Delay Minimization • Clustering: to group nodes into clusters with area no more than a given bound • CLUS by Pan, et al. [TCAD98] • PRIME by Cong, et al [DAC99] • Quasi-optimal clustering with retiming for delay minimization • By setting area-bound to be A/K, clustering can compute a partitioning solution with quasi-optimal delay • Existing coarsening algorithms considering local node connectivity cannot predict circuit delay • Theorem: Let fc be the circuit delay of a clustering solution. For any partitioning solution P on the clusters, its delay is less than or equal to fc • Clustering can compute an upper-bound on circuit delay after partitioning

  11. Global Clustering-Based Optimization Engine • Start from the coarsest level with clustering to define a good circuit delay • Comparison: coarsening with gradually increased cluster size has difficulty to predict circuit delay after partitioning on clusters • Clustering with gradually reduced area bound to decluster at each level • Nodes on a critical path will be grouped together and will NOT be partitioned into different partitions • Avoid delay increase by partitioning refinement as much as possible • Partition-bounded clustering to guarantee consistent solution improvement and algorithm convergency • Guarantee a better solution in a finer level than a coarser level

  12. Partitioning with Retiming • Retiming is considered during clustering and partitioning at each level for best possible circuit delay • Sequential arrival time: av=ål(e), where l(e)=dv+de-f´we for a given target clock period f, where dv is node delay of v, de is edge delay, we is the number of FFs on edge e from u to v. • Theorem [Pan98]: if max(apo) £f, minimum circuit delay after retiming is no more than f + D. • Timing analysis in both clustering and partitioning is based on sequential arrival time • Binary search to get the minimum clock period after retiming

  13. Bi-partitioning 16x 120 Test Results 16-way partitioning

  14. Conclusion • Global clustering is more suitable for delay minimization • Global clustering-based multi-level optimization engine achieves good delay and cutsize • Retiming further helps delay reduction • Simultaneously retiming with partitioning achieves better results than separate partitioning with retiming • Not a necessity to the main algorithm, can be disabled

More Related