Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003

Homogeneous Clustered Architecture Heterogeneous Clustered Architecture Register File Register File Register File Register File +*- +*- +*- +*- +*- +*- * + - << + - + - << << << << << << << Cluster 1 (32-bit) Cluster 2 (32-bit) Cluster 1 (32-bit) Cluster 2 (8-bit) Clustered Architectures • Decentralize architecture to reduce register file bottleneck • Used in Lx/ST200, TI C6x, Analog Tigersharc and others. • Goal of our work: Automatic synthesis of an application-specific heterogeneous multicluster architecture

RF FU FU RF FU FU Our Approach • Partition operations with both performance and required hardware cost in mind • Maintain performance and reduce cost (bitwidth, FU repertoire) • Previous work has focused on single basic block, single cluster [Note ‘91] [Paulin ‘89] [Marwedel ‘90] • Each partition dictates a cluster configuration which has an associated hardware cost

Our Proposed System • Today’s Focus: Cost-Sensitive Operation Partitioning • Input: Application, High-level machine specification: • Number of clusters, number of generic FU’s • Output: Multicluster Architecture Description

1 10 1 1 1 1 10 8 8 8 1 1 1 1 10 8 8 8 1 1 10 1 1 Cost-Sensitive Operation Partitioning • Builds off Region-Based Hierarchical Operation Partitioning • Pure performance based partitioner, no notion of hardware cost • Weight calculation creates guides for good partitions • Partitioning clusters based on given weights • Cost metric added to Graph Partitioning phase which accounts for gate cost Region Weight Calculation Graph Partitioning

Coarsening Phase • Progressively groups highly related operations together • Continually pairs operations together • Forces partitioner to consider several operations as a single unit • Traditional RHOP: coarsen using edge weights • Cost-centric coarsening can ignore dependence edge criticality Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4 Narrow bitwidth Wide bitwidth

Partitioning Phase • Travel back through each of the coarsening steps, at each stage try refining partition • est_cycles: performance metric from traditional RHOP • Adds new cost metric for cost of the cluster

Cost-Sensitive Refinement • Moves are made when they have positive benefit • When no more moves can be made, algorithm uncoarsens to previous coarsened state and tries moving again est cycles = 7 cost: 28K est cycles = 8 cost: 15K est cycles = 7 cost: 15K Narrow bitwidth Wide bitwidth

* * Int Unit 1 Int Unit 2 16 16 + + + + + + + + * * 32 8 16 16 32 10 16 10 16 8 Multicluster Cost Model • Cost model determines an estimate of gate cost of clusters • Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost: • Register file size/width • Functional Unit (FU) width • FU opcode repertoire • Greedy algorithm used • Ignores dependences betweenoperations • Similar to Rec/Res MII calculationsfor software pipelined loops Register File (32-bit) High cost Low cost Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder

Experimental Methodology • Trimaran toolset: a retargetable VLIW compiler • Evaluated main loop of DSP kernels and selected benchmarks from MediaBench, MiBench and NetBench • Bitwidth information gathered through automatic program analysis • Cost estimates computed using Synopsis design tools at 0.18µ • 64 registers per cluster

2-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode

Source of Cost Savings Breakdown Normalized Cost fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode

Pareto Charts of Examined Machines fsed kernel LU kernel • A wide spectrum of machine configurations were examined • Multiple groups often appear with expensive units

Work in Progress • Merging step • How can machine designs for several basic blocks be combined? • Inaccurate cost model • How can a more accurate estimate for the cost be developed? • Space Exploration (external/internal) • Number of clusters and generic FU’s are externally spacewalked • Allowable performance increase internally spacewalked • What areas of this space exploration should be external/internal? • Reprogrammability of designed machines

Conclusions • Developed a cost-sensitive method for partitioning operations across clusters • Used this partitioning to define an application-specific low-cost multicluster datapath architecture • Average performance loss and cost savings for two and four cluster machines:

Questions? http://cccp.eecs.umich.edu

Backup Slides

4-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU dct crc fsed channel huffman Average blowfish rawcaudio rawdaudio gsmdecode gsmencode

Previous Work • Datapath synthesis • Cathedral-III: complete synthesis system from IMEC • Paulin and Knight: force directed scheduling • Sehwa: designed processing pipelines from behavioral specs • PICO: designed application-specific VLIW processors • Bitwidth sensitive datapath synthesis • Valen-C: augmented C language to convey bitwidth information

Weight Calculation Phase • Edge weights • Assigns higher weight to edges likely to increase schedule length when cut • Uses a slack distribution method to assign weights • Node weights • Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster • Higher weights attributed to operations that can • Not changed from Traditional RHOP

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Presentation Transcript

Cost-Sensitive Classifier Evaluation

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Wireless Network Architectures for Collaborative Tele-operation

Synthesizing

Synthesizing DoDAF Architectures to Develop the Joint Capability Enterprise Architecture

Synthesizing

Ensembles for Cost-Sensitive Learning

Synthesizing

Datapath and Control (Multicycle datapath)

Architectures for Congestion-Sensitive Pricing of Network Services

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Review: Datapath for MIPS

Cost- sensitive boosting for classification of imbalanced data

Lower-Bound Estimate for Cost-sensitive Decision Trees

Experimental Study of Adaptive Application-Sensitive Partitioning Strategies for SAMR Applications

Compiler-directed Data Partitioning for Multicluster Processors

Synthesizing

Datapath