Global Clustering-Based Performance-Driven Circuit Partitioning

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles cong@cs.ucla.edu Chang Wu Aplus Design Technologies Los Angeles changwu@aplus-dt.com

Problem Definition • Problem: k-way circuit partitioning and retiming with balanced area for delay minimization • Delay minimization with consideration of cutsize • Retiming is performed simultaneously with partitioning for best possible delay reduction • Generic delay model: node delay, intra-block delay, inter-block delay Node delay dv Inter-block delay D Intra-block delay d D > d D d

Existing Approaches • Clustering-based approaches • PRIME: group nodes into clusters with given area bound • Quasi-optimal delay solution with node duplication • Huge cutsize (3X) • Partitioning-based approaches • Partition circuits into k-blocks and then iteratively move nodes to further improve • Cut-size minimization: hMetis • Multi-level partitioning, very fast, excellent cutsize, fair circuit delay • Delay minimization: HPM • Performance-driven clustering + cutsize-driven partitioning, tradeoff between delay and cutsize

Existing Approaches (cont) • Clustering-based approaches • Delay optimization with node duplication is optimally solved • Node duplication-free clustering is NP-complete, but with fairly good results by resolving duplications heuristically • Huge cutsize • Partitioning-based approaches • Very good cutsize • Difficulty on delay minimization: delay update for each node-move is too costly (linear time) • hMetis: does not consider delay directly, gradual coarsening is difficult to target for delay • HPM: separate clustering and partitioning, clustering does not know its impact on cutsize, partitioning does not have much control on delay

HPM: Combination of Clustering and Partitioning • HPM by Cong, et al, [DAC99] • Clustering followed by partitioning • Good delay and cutsize balance • Clustering and partitioning are two completely separated steps • Clustering with very small and fixed area bound (10) on each blocks: much less than A/K, where A is circuit area • Achieve inferior delay to clustering with cluster area bound of A/K (delay is ~23% larger) • Achieve larger cutsize than hMetis because clustering constraints reduces cutsize reduction capability of partitioning • Better solution is Needed

Multi-Level Partitioning for Cutsize • hMetis by Karypis, et al. [DAC97] • Gradual coarsening to group tightly connected nodes together • Uncoarsening gradually and reducing cutsize by moving clusters • Fast algorithm: reduced solution space at each level as many nodes are grouped and moved together • Smaller cutsize: more thorough search is possible in reduced solution space • Hyperedge-based coarsening is very suitable for cutsize • Delay is completely ignored

Existing Multi-level Optimization Engine • V-shape multi-level optimization used in hMetis • Not very suitable for delay minimization • Gradual coarsening has difficulty to predict impact on delay

MLPR: Performance-Driven Multi-Level Partitioning and Retiming • K-way partitioning algorithm for performance optimization • Retiming is performed during partitioning for best possible circuit delay • Cutsize reduction is also considered • MLPR • Clustering with area bound of A/K, where A is circuit area • Partitioning of clusters into K blocks • For level from 1 to log(A/K) • Clustering with area bound of A/(K´ 2level) • Each cluster is bounded by the block it belongs to • Moving clusters to reduce cutsize while preserving circuit delay • Final movement of individual nodes for best solution

Our Contribution: Global Clustering Based Multi-Level Optimization Engine • Start directly from the coarsest level with global clustering for best possible delay • Clustering-based gradual declustering to increase the freedom for refinement • Retiming is considered simultaneously during clustering and partitioning for smaller delay

Global Clustering for Delay Minimization • Clustering: to group nodes into clusters with area no more than a given bound • CLUS by Pan, et al. [TCAD98] • PRIME by Cong, et al [DAC99] • Quasi-optimal clustering with retiming for delay minimization • By setting area-bound to be A/K, clustering can compute a partitioning solution with quasi-optimal delay • Existing coarsening algorithms considering local node connectivity cannot predict circuit delay • Theorem: Let fc be the circuit delay of a clustering solution. For any partitioning solution P on the clusters, its delay is less than or equal to fc • Clustering can compute an upper-bound on circuit delay after partitioning

Global Clustering-Based Optimization Engine • Start from the coarsest level with clustering to define a good circuit delay • Comparison: coarsening with gradually increased cluster size has difficulty to predict circuit delay after partitioning on clusters • Clustering with gradually reduced area bound to decluster at each level • Nodes on a critical path will be grouped together and will NOT be partitioned into different partitions • Avoid delay increase by partitioning refinement as much as possible • Partition-bounded clustering to guarantee consistent solution improvement and algorithm convergency • Guarantee a better solution in a finer level than a coarser level

Partitioning with Retiming • Retiming is considered during clustering and partitioning at each level for best possible circuit delay • Sequential arrival time: av=ål(e), where l(e)=dv+de-f´we for a given target clock period f, where dv is node delay of v, de is edge delay, we is the number of FFs on edge e from u to v. • Theorem [Pan98]: if max(apo) £f, minimum circuit delay after retiming is no more than f + D. • Timing analysis in both clustering and partitioning is based on sequential arrival time • Binary search to get the minimum clock period after retiming

Bi-partitioning 16x 120 Test Results 16-way partitioning

Conclusion • Global clustering is more suitable for delay minimization • Global clustering-based multi-level optimization engine achieves good delay and cutsize • Retiming further helps delay reduction • Simultaneously retiming with partitioning achieves better results than separate partitioning with retiming • Not a necessity to the main algorithm, can be disabled

Global Clustering-Based Performance-Driven Circuit Partitioning

Global Clustering-Based Performance-Driven Circuit Partitioning

Presentation Transcript

Global Clustering Tests

Density based Clustering

Scalability-Based Manycore Partitioning

Multithreaded Clustering for Multi-level Hypergraph Partitioning

Pattern-based Clustering

Clustering of Phylogenetic Trees by Clique Partitioning

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Selectivity-Based Partitioning

Circuit Performance and Adders

Performance Aware Secure Code Partitioning

Global Clustering-Based Performance-Driven Circuit Partitioning

Biology-Driven Clustering of Microarray Data:

Circuit Partitioning

Circuit Partitioning

Biology-Driven Clustering of Microarray Data

Design Hierarchy Guided Multilevel Circuit Partitioning

Constraint-Driven Clustering

A Knowledge-Based Clustering Algorithm Driven by Gene Ontology

Data-Driven Performance

Performance and RLC Crosstalk Driven Global Routing

Clustering Event Logs Using Iterative Partitioning

Global Clustering Tests