260 likes | 377 Views
Compiler-directed Data Partitioning for Multicluster Processors. Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006. Intercluster Communication Network. Register File. Register File. M. M. I. F. I. F. I. F. M. I. F. M.
E N D
Compiler-directed Data Partitioning for Multicluster Processors Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006
Intercluster Communication Network Register File Register File M M I F I F I F M I F M Cluster 1 Cluster 2 Processor Data Memory Data Mem 1 Data Mem 2 Multicluster Architectures • Addresses the register file bottleneck • Decentralizes architecture • Compilation focuses on partitioning operations • Most previous work assumes a unified memory Register File Data Memory
I F M I F M Cluster 1 Cluster 2 Data Mem 1 Data Mem 2 Problem: Partitioning of Data • Determine object placement into data memories • Limited by: • Memory sizes/capacities • Computation operations related to data • Partitioning relevant to caches and scratchpad memories int x[100] struct foo int y[100]
I F M I F M Cluster 1 Cluster 2 int x[100] foo int y[100] Architectural Model • This work focuses on use of scratchpad-like static local memories • Each cluster has one local memory • Each object placed in one specific memory • Data object available in the memory throughout the lifetime of the program
Data Unaware Partitioning Lose average 30% performance by ignoring data
Our Objective • Goal: Produce efficient code • Strategy: • Partition both data objects and computation operations • Balance memory size across clusters • Improve memory bandwidth • Maximize parallelism int y [100] int x[100] struct foo
First Try: Greedy Approach • Computation-centric partition of data • Place data where computation references it most often • Greedy approach: • Pass 1: Region-view computation partition Greedy data cluster assignment • Pass 2: Region-view computation repartition Full knowledge of data location
Greedy Approach Results • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • Relative to a unified, dual-ported memory • Improvement over Data Unaware, still room for improvement
Global Data Partition Regional Computation Partition Second Try: Global Data Partition • Data-centric partition of computation • Hierarchical technique • Pass 1: Global-view for data • Consider memory relationships throughout program • Locks memory operations to clusters • Pass 2: Region-view for computation • Partition computation based on data location
Step 1 Step 4 Step 2 Step 3 Interprocedural Pointer Analysis & Memory Profile METIS Graph Partitioner Build Program Data Graph Merge Memory Operations Pass 1: Global Data Partitioning • Determine memory relationships • Pointer analysis & profiling of memory • Build program-level graph representation of all operations • Perform data object memory operation merging: • Respect correctness constraints of the program
200 bytes 400 bytes 1 Kbyte Global Data Graph Representation • Nodes: Operations, either memory or non-memory • Memory operations: loads, stores, malloc callsites • Edges: Data flow between operations • Node weight: Data object size • Sum of data sizes forreferenced objects • Object size determined by: • Globals/locals: pointer analysis • Malloc callsites: memory profile int x[100] malloc site 1 struct foo
Non-memory op int x[100] Memory op malloc site 1 Cluster 0 Cluster 1 struct foo struct bar malloc site 2 Global Data Partitioning Example BB1 2 Objects referenced 80 Kb BB2 2 Objects referenced 200 Kb 1 Object referenced 100 Kb
BB1 BB1 Pass 2: Computation Partitioning • Observation:Global-level data partition is only half the answer: • Doesn’t account for operation resource usage • Doesn’t consider code scheduling regions • Second pass of partitioning on each scheduling region • Memory operations from first phase locked in place BB1
Experimental Methodology • Compared to: • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • All results relative to a unified, dual-ported memory
Performance: 1-cycle Remote Access Unified Memory
Performance: 10-cycle Remote Access Unified Memory
Case Study: rawcaudio X Global Data Partition Greedy Profile-based X
Summary • Global Data Partitioning • Data placement: first-order design principle • Global data-centric partition of computation • Phased ordered approach • Global-view for decisions on data • Region-view for decisions on computation • Achieves 96% performance of a unified memory on partitioned memories • Future work: apply to cache memories
Data Partitioning for Multicores • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallel computation • Different goals: • Reducing coherence traffic • Keep working set ≤ cache size
Questions? http://cccp.eecs.umich.edu
Future Work: Cache Memories • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallelcomputation • Different goals: • Reducing coherence traffic • Balancing working set
Memory Operation Merging • Interprocedural pointer analysis determines memory relationships int * x; int foo [100]; int bar [100]; void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1]; b = 100; foo[0] = c; } malloc load “bar” load “foo” store “malloc” or “bar” store “foo”
Multicluster Compilation • Previous techniques focused on operation partitioning [cite some papers] • Ignores the issue of data object placement in memory • Assumes shared memory accessible from each cluster
Phase 2: Computation Partitioning • Observation:Global-level data partition is only half the solution: • Doesn’t properly account for resource usage details • Doesn’t consider code scheduling regions • Second pass of partitioning is done locally on each basic block of the program • Memory operations locked into specific clusters • Uses Region-based Hierarchical Operation Partitioner (RHOP)
BB1 BB1 + + & & L L L L + + + + + + S S * * & & Computation Partitioning Example • Memory operations from first phase locked in place • RHOP performs a detailed resource-cognizant computation partition • Modified multi-level Kernighan-Lin algorithm using schedule estimates BB1 + & L L + + + S * &