Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported in part by MARCO GSRC

Outline  Motivation • Performance driven bipartition problem • New bipartitioning algorithm • Experimental results • Conclusion and future work

Partitioning and Performance The hypergraph partitioning problem is to divide the nodes of a hypergraph into roughly equal parts; the traditional objective is to minimize cutsize. In performance-driven partitioning, we also seek to minimize path delay on timing paths.

Previous Work (I) • [Cong et al. ISPD-2002] • Global clustering based algorithm with retiming Min-delay Clustering w/ retiming Min-cutsize Clustering De-clustering and refinement • Reduces delay by 16% while increasing cutsize by 17% • Requires substantial gate replication

Previous Work (II) • [Ababei et al. ICCAD-2002] • Reweighting based method Path based Input Reweighting Cutsize oriented partitioner, such as hMetis,MLPart 1 1 Global timing analysis Find critical paths 1 Net based 1 2 1 • 14% reduction of delay with 10% increase in cutsize • 139% increase in runtime compared with hMetis

Motivating Questions  Can we avoid global timing analysis? • Global timing analysis is extremely time-consuming • Can we improve path delay without significant degrading of cutsize? • Need smooth tradeoff between delay and cutsize • Can we reduce implementation overheads? • Previous methods store thousands of critical paths and continuously update them

Outline • Motivation Performance driven bipartition problem • New bipartitioning algorithm • Experimental results • Conclusion and future work

Delay Model Delay = hop_delay + node_delay hop Part 1 Part 0 FF nodes Combinational nodes cut [Cong et al. ISPD-2002] hop_delay=5 node_delay=1  Delay = 3x5 + 5x1 = 20 [Ababei et al. ICCAD-2002] hop_delay=Elmore delay node_delay=constant

Performance Driven Bipartition Problem • Given: • Hypergraph H=(V,E) • Area Balance tolerance s (0<s<1), a parameter to control allowable slack in the area constraint • a, a given parameter which captures tradeoff between cutsize and path delay (hopcount) • Find: • A bipartition (V0|V1) which satisfies: • and minimizes a(cutsize)+(1-a)(Max_hopcount)

Outline • Motivation • Performance driven bipartition problem  New bipartitioning algorithm • Experimental results • Conclusion and future work

Unidirectional Partition Path delay is minimized with hopcount = 1 if the partition is unidirectional (“acyclic”), that is, all cuts are in the same direction Part 1 Part 0 Part 0 Problem: • High cutsize • No unidirectional solution Can we achieve “locally unidirectional” partition? Max hopcount=5 Max hopcount=3 Part 0 Part 0 Part 1 Part 1

V-Shaped Nodes V-shaped node If a combinational node vsatisfies: there exist vj, vt in the other part and a path from vj to vt that includes only v thenv is a V-shaped node vj vt Part 0 Part 1 v

V-Shaped Nodes in Critical Paths Empirical observations from study of partitioning solutions: • there are V-shaped nodes in the partitioning solutions • every V-shaped node is included in many critical paths • every critical path contains several V-shaped nodes For testcase 1: • Number of nets : 16377 • Number of critical paths : 26772 • On average, one critical path contains 27.6 nodes • On average, one critical path contains 3.4 V-nodes • On average, one V-node belongs to 233.7critical paths

Key Idea: V-Shaped Nodes Elimination Part 0 f Part 0 f c a c a Move b b Part 1 b d e d e Part 1 Move V-shaped node “b” to reduce path hopcount PATH: abc hopcount=0 PATH: dbc hopcount=1 PATH: ebc hopcount=1 PATH: abc hopcount=2 PATH: dbc hopcount=1 PATH: ebc hopcount=1

Distance-k V-Shaped Nodes Elimination Part 0 Part 0 d a d a b c Move b,c Part 1 b c Part 1 k = 2: Move V2 node “b, c” reduce path hopcount from 2 to 0 Problems with large k: Cutsize may be greatly increased Delay of one path reduced while other paths delay increased

New Gain Function v v After Move Before Move Gain(v)=δ(0)+ δ(1) g(v): traditional FM gain rj(v): reduction of Vj nodes after moving v

Distance-k Unidirectional Algorithm Calculate initial gains for all nodes and store the gains Select the node v with maximum gain /* CLIP-like method: move the cluster that v belongs to */ Reset the gains of all nodes to zero Move v and update the gains of v and its neighbors While ( one node not moved) Select one node v with the maximum updated gain Move v and update the related gains Find the point in the move sequence at which the sum of gains is maximum; undo all moves after this point

Outline • Motivation • New bipartitioning algorithm • Experimental results • Conclusion and future work

Experimental Setup • Four industry testcases obtained as LEF/DEF • Model of Ababei et al. (ICCAD-2002) used to calculate delay • Partitioning solutions compared to results of MLPart • strongest multilevel netlist partitioning code • website: http://nexus6.cs.ucla.edu/GSRC/bookshelf/Slots/Partitioning/MLPart • All tests on 600MHz Intel Pentium-III Xeon

Biasing against V1 Nodes vs. MLPart δ(0)=1, δ(1)=10 • Reduction of delay: 4.5%-24.4%average:15.1% • Increase of cutsize:3.0%-10.0%average:4.9% • Increase of runtime:6.3%-11.4%average: 9.7% Using the delay model in Cong et al. ISPD -2002 • Reduction of delay: 4.3%-21.2%average:14.7%

Biasing against V2 Nodes vs. MLPart δ(0)=1, δ(1)=30, δ(2)=3 • Reduction of delay: 8.9%-30.0%average: 18.7% • Increase of cutsize:3.1%-7.2%average: 3.5% • Increase of runtime:11.9%-15.9%average: 13.1% Using the delay model in Cong et al. ISPD -2002 • Reduction of delay: 8.3%-28.7%average: 17.3%

Outline • Motivation • Performance driven bipartition problem • New bipartitioning algorithm • Experimental results  Conclusions and future work

Conclusions • Simple yet efficient timing-driven partitioning that does not require global timing analysis • Negligible implementation, runtime overhead • Significantly reduces path delay with cutsize and runtime almost same as leading-edge MLPart • Similar improvements observed with different path delay metrics • Futures • Impact of new partitioner on placement • Efficient methods for biasing δ(k) k>2

Thank you!

Future Work • Impact of new partitioner on placement • Efficient methods for biasing δ(k) k>2

Why Performance Driven Partitioning? • Achieving timing closure becomes increasingly difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay • Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…) Timing needs to be addressed at all design stages • Partitioning is a critical step in defining interconnect timing properties, but is traditionally driven by cutsize objective

Previous Work (I) • With Logic Replication • Retiming • Replication graph • Without Logic Replication • Net based reweighting • Path based reweighting

Part 0 Part 1 FM Partitioning and Gain Function Start with random partition v v Move the node with the max gain and lock it Part 0 After Move Before Move Part 1 Gain(v)=-1 Gain(v) = Reduction of cutsize after moving v Keep moving until all nodes are locked Find the best point in the move sequence Part 1 Part 0 Part 0 Part 1

Procedure to Calculate rj(v) Delete all FF nodes and their related edges In the remaining graph, BFS from v Foreach level j from 1 to k If v is a Vj node before moving, rj’=1 If v is a Vj node after moving, rj’’=1 rj=rj’’-rj’

v CLIP v CLIP Algorithm Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline.

Distance-k V-Shaped Nodes Distance-k V-shaped nodes (Vk-node): If k combinational nodes vi,1 … vi,k satisfy: vi,1 … vi,k are in the same part  vj, vt in the other part  a path from vj to vt and only passes vi,1 … vi,k thenvi,1 … vi,k are distance-k V-shaped nodes vj vt Part 0 Part 1 vi,1 vi,k

Notation • H(V,E)= circuit hypergraph • V = set of nodes representing components of the circuit • E = set of signal nets • A bipartition (V0|V1) of H(V,E) divides V into two disjoint subsets s.t. V= V0V1, which are called Part 0 and Part 1 • A= the total area of all the nodes in V • A0= the area of all the nodes in V0

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Presentation Transcript

Sequential analysis: balancing the tradeoff between detection accuracy and detection delay

CMOS Design With Delay Constraints: Design for Performance

Data Driven Performance Management

Safety Driven Performance 2013

Performance-Driven Processor Allocation

Partitioning in Quicksort

Placement-Driven Partitioning for Congestion Mitigation in Monolithic 3D IC Designs

On the Cost/Delay Tradeoff of Wireless Delay Tolerant Geographic Routing

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Performance Aware Secure Code Partitioning

Global Clustering-Based Performance-Driven Circuit Partitioning

Syntax-driven partitioning for model-checking of Esterel programs

Syntax-driven partitioning for model-checking of Esterel programs

Schedulability-Driven Partitioning and Mapping for Multi-Cluster Real-Time Systems

Microarchitectural Floorplanning Under Performance and Temperature Tradeoff

Architectural Exploration: Area-Performance tradeoff in 802.11a Transmitter Arvind

Data-Driven Performance

Routing Performance in the Presence of Unidirectional Links in Multihop Wireless Networks

Global Clustering-Based Performance-Driven Circuit Partitioning

Unidirectional Flushing

Performance Aware Secure Code Partitioning

Sequential analysis: balancing the tradeoff between detection accuracy and detection delay