190 likes | 323 Views
Communication Overhead Estimation on Multicores. S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz. Outline. Motivation Multicore trend Stream programming Profiling communication overhead Related works. 2. 512. PicoChip. AMBRIC. 256.
E N D
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
Outline • Motivation • Multicore trend • Stream programming • Profiling communication overhead • Related works 2
512 PicoChip AMBRIC 256 CISCO CSR1 128 NVIDIA G80 Larrabee 64 Unicore 32 Homogeneous Multicore RAZA XLR Cavium Heterogeneous Multicore 16 RAW Cell Niagara 8 AMD Fusion Opteron 4P BCM 1480 4 Core2Quad Xeon Xbox 360 Power6 Power4 PA8800 PA8800 4004 8008 8080 4004 8008 8080 2 Opteron CoreDuo Core2Duo 8086 286 386 486 Pentium P2 P3 P4 1 Core Athlon Itanium Itanium2 1975 1980 1985 1990 1995 2000 2005 2010 Motivation Stream Programming CUDA Courtesy: Scott’08 X10 Peakstream Fortress C/C++/Java # cores/chip Accelerator Ct C T M Rstream Rapidmind 3
Stream Programming Paradigm Programs expressed as stream graphs Streams: Infinite sequence of data elements Actors: Functions applied to streams Stream Actor Stream 4
Properties of Stream Program AtoD FMDemod • Regular and repeating computation • Independent actors with explicit communication • Producer / Consumer dependencies Splitter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Joiner Adder Speaker 5
StreamIt Language filter pipeline • An implementation of stream prog. • Hierarchical structure • Each construct has single input/output stream may be any StreamIt language construct splitjoin parallel computation splitter joiner feedback loop splitter joiner 6
Problems to Measure Communication Overhead • Reasons: • Multicores are non-communication exposed architecture • Complex cache hierarchy • Cache coherence protocols • Consequence: • Cannot directly measure the communication cost • Estimate the communication cost by measuring the execution time of actors
Measuring the Communication Overhead of an Edge Processor 1 Processor 1 Processor 2 i k No communication cost With communication cost k i
Processor 1 Processor 2 Processor 1 Processor 2 A A 1 B 1 2 B 2 C C 3 D 3 D 4 4 E E 5 F Even edges across partition Odd edges across partition How to Minimize the Required Number of Experiments Requires 2+1 Exps A 1 B Graph Coloring 2 C Pipeline
Obs. 1: There is no loop of three actors in a stream graph Processor 1 Processor 2 l i k
P-1 P-2 P-3 P-4 Obs. 2: There is no interference of adjacent nodes between edges A B C D E F For blue color edges
Remove Interference • Convert to a line graph • Add interference edges • Use vertex coloring algorithm A AB AB BC BC B BD BD CE CE C D DE DE E EF EF F Line graph Stream graph
Processor Leveling Graph A B A C D B, C, D, E E F F For blue colored edge Processor leveling graph
Coloring the Processor Labelling Graph Processor 1 Processor 2 A A A B, C, D, E B, C, D, E B, C, D, E F F F
Measuring the Communication Cost Processor 1 Processor 2 A A B B, C, D, E C D E F F For blue colored edge
Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 18