200 likes | 331 Views
Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures. Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003.
E N D
Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003
Homogeneous Clustered Architecture Heterogeneous Clustered Architecture Register File Register File Register File Register File +*- +*- +*- +*- +*- +*- * + - << + - + - << << << << << << << Cluster 1 (32-bit) Cluster 2 (32-bit) Cluster 1 (32-bit) Cluster 2 (8-bit) Clustered Architectures • Decentralize architecture to reduce register file bottleneck • Used in Lx/ST200, TI C6x, Analog Tigersharc and others. • Goal of our work: Automatic synthesis of an application-specific heterogeneous multicluster architecture
RF FU FU RF FU FU Our Approach • Partition operations with both performance and required hardware cost in mind • Maintain performance and reduce cost (bitwidth, FU repertoire) • Previous work has focused on single basic block, single cluster [Note ‘91] [Paulin ‘89] [Marwedel ‘90] • Each partition dictates a cluster configuration which has an associated hardware cost
Our Proposed System • Today’s Focus: Cost-Sensitive Operation Partitioning • Input: Application, High-level machine specification: • Number of clusters, number of generic FU’s • Output: Multicluster Architecture Description
1 10 1 1 1 1 10 8 8 8 1 1 1 1 10 8 8 8 1 1 10 1 1 Cost-Sensitive Operation Partitioning • Builds off Region-Based Hierarchical Operation Partitioning • Pure performance based partitioner, no notion of hardware cost • Weight calculation creates guides for good partitions • Partitioning clusters based on given weights • Cost metric added to Graph Partitioning phase which accounts for gate cost Region Weight Calculation Graph Partitioning
Coarsening Phase • Progressively groups highly related operations together • Continually pairs operations together • Forces partitioner to consider several operations as a single unit • Traditional RHOP: coarsen using edge weights • Cost-centric coarsening can ignore dependence edge criticality Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4 Narrow bitwidth Wide bitwidth
Partitioning Phase • Travel back through each of the coarsening steps, at each stage try refining partition • est_cycles: performance metric from traditional RHOP • Adds new cost metric for cost of the cluster
Cost-Sensitive Refinement • Moves are made when they have positive benefit • When no more moves can be made, algorithm uncoarsens to previous coarsened state and tries moving again est cycles = 7 cost: 28K est cycles = 8 cost: 15K est cycles = 7 cost: 15K Narrow bitwidth Wide bitwidth
* * Int Unit 1 Int Unit 2 16 16 + + + + + + + + * * 32 8 16 16 32 10 16 10 16 8 Multicluster Cost Model • Cost model determines an estimate of gate cost of clusters • Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost: • Register file size/width • Functional Unit (FU) width • FU opcode repertoire • Greedy algorithm used • Ignores dependences betweenoperations • Similar to Rec/Res MII calculationsfor software pipelined loops Register File (32-bit) High cost Low cost Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder
Experimental Methodology • Trimaran toolset: a retargetable VLIW compiler • Evaluated main loop of DSP kernels and selected benchmarks from MediaBench, MiBench and NetBench • Bitwidth information gathered through automatic program analysis • Cost estimates computed using Synopsis design tools at 0.18µ • 64 registers per cluster
2-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode
Source of Cost Savings Breakdown Normalized Cost fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode
Pareto Charts of Examined Machines fsed kernel LU kernel • A wide spectrum of machine configurations were examined • Multiple groups often appear with expensive units
Work in Progress • Merging step • How can machine designs for several basic blocks be combined? • Inaccurate cost model • How can a more accurate estimate for the cost be developed? • Space Exploration (external/internal) • Number of clusters and generic FU’s are externally spacewalked • Allowable performance increase internally spacewalked • What areas of this space exploration should be external/internal? • Reprogrammability of designed machines
Conclusions • Developed a cost-sensitive method for partitioning operations across clusters • Used this partitioning to define an application-specific low-cost multicluster datapath architecture • Average performance loss and cost savings for two and four cluster machines:
Questions? http://cccp.eecs.umich.edu
4-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU dct crc fsed channel huffman Average blowfish rawcaudio rawdaudio gsmdecode gsmencode
Previous Work • Datapath synthesis • Cathedral-III: complete synthesis system from IMEC • Paulin and Knight: force directed scheduling • Sehwa: designed processing pipelines from behavioral specs • PICO: designed application-specific VLIW processors • Bitwidth sensitive datapath synthesis • Valen-C: augmented C language to convey bitwidth information
Weight Calculation Phase • Edge weights • Assigns higher weight to edges likely to increase schedule length when cut • Uses a slack distribution method to assign weights • Node weights • Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster • Higher weights attributed to operations that can • Not changed from Traditional RHOP