310 likes | 428 Views
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures. S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz. Outline. Motivation Research question Contributions Summary. 2. 512. PicoChip. AMBRIC. 256.
E N D
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
Outline • Motivation • Research question • Contributions • Summary 2
512 PicoChip AMBRIC 256 CISCO CSR1 128 NVIDIA G80 Larrabee 64 Unicore 32 Homogeneous Multicore RAZA XLR Cavium Heterogeneous Multicore 16 RAW Cell Niagara 8 AMD Fusion Opteron 4P BCM 1480 4 Core2Quad Xeon Xbox 360 Power6 Power4 PA8800 PA8800 4004 8008 8080 4004 8008 8080 2 Opteron CoreDuo Core2Duo 8086 286 386 486 Pentium P2 P3 P4 1 Core Athlon Itanium Itanium2 1975 1980 1985 1990 1995 2000 2005 2010 Cores are the New Gates (Shekhar Borkar, Intel) Stream Programming CUDA Courtesy: Scott’08 X10 Peakstream Fortress C/C++/Java # cores/chip Accelerator Ct C T M Rstream Rapidmind 3
Stream Programming Paradigm Programs expressed as stream graphs Streams: Sequence of data elements Actor: Functions applied to streams Streams Actor/Filter Streams 4
StreamIt Language filter pipeline • Basic and hierarchical structures • Each construct has single input/output stream may be any StreamIt language construct splitjoin parallel computation splitter joiner feedback loop splitter joiner 5
Outline • Motivation • Research question • Contributions • Summary 6
How to Orchestrate a Stream Graph? • Mapping actors • Eliminate bottlenecks (aka. hot actors) 7
Mapping Actors 5 A B 60 Core 1 Core 2 Core 3 C 60 A 5 5 D Make span = 60, Speedup = 130/60 = 2.17 B 60 Ideally speedup = 3 Actors B and C are the bottlenecks C 60 Load=60 Load=60 Load=10 5 D
Core 1 Core 2 Core 3 Bottlenecks Elimination 6 6 s2 s1 6 s1 j1 6 6 s2 j2 j1 6 6 j2 6 5 15 15 A A A Make span = 141, Speedup = 130x3/141 = 2.77 C_2 B C_3 B_3 B_1 B_2 B_3 B_2 C_1 C_1 C_3 B_1 C_2 60 60 60 60 60 60 60 60 60 60 60 60 60 28% increased speedup C 60 Load = 141 Load = 138 Load = 135 Hot actor duplication 15 5 15 D D D
Orchestration of Stream Program Contd. • Current state of the art • Integer Linear Programming • Intractable • Heuristics • Unknown performance • How to find a fast and good solution? • Approximation algorithms that have • Polynomial runtime • Quality bound for solution 10
Outline • Motivation • Research question • Contributions • Summary 11
1 5 1 1 z A B C Data Transfer Model • Arrival rate depends on the data rate of the actors (maximize) • Data transfer model forms a system of sim. functional linear equation • Compute a closed form of the output data rate • We also consider a processor utilization function for each actor 12
Bottleneck Analysis • The arrival rate is limited by • Processor capacity of the cores • Memory bandwidth • A quantitative analysis determines • An upper bound of the arrival rate imposed by an actor • An upper bound of the arrival rate imposed by the parallel system • Hot actor • Upper bound (actor) < upper bound (system) 13
Approximation of Actor Allocation Problem • The actor allocation problem (AAP) is NP-hard • For a fixed arrival rate, the AAP reduces to standard bin-packing problem (closed form) • There exist approximation algorithms for bin-packing • Polynomial running time • Solution quality is bounded
Summary (ASPLOS 2011) • Novel data transfer model • A simple quantitative analysis to detect and eliminate bottlenecks • A novel 2-approximation algorithm for deploying stream graphs on multicore platforms • Results are within 5% of the optimal solution • Achieves a geometric mean speedup of 6.95x for 8 processors over single processor execution 15
Related Works [1] Static Scheduling of SDF Programs for DSP [IEEE ’87] [2] StreamIt: A language for streaming applications [Springer ‘02] [3] Phased Scheduling of Stream Programs [LCTES ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [ASPLOS ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [PLDI ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [IEEE ‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [IEEE ‘09] [8] An empirical characterization of stream programs and its implications for language and compiler design [PACT ’10] 16
Focus of Our Work StreamIt Compiler Phases Stream Graph Scheduling Linear Functional Equation Solver Stream Graph Partitioning Bottleneck Resolver Layout on Target Architecture Actor Allocation on Processors Communication Scheduling 18 18
Actor Allocation Constraint 5 n 2 4 Actors with their utilizations 3 1 3 5 n 2 100% Each core has 100% utilization 1 19
Solution space left Allocation possible? mid right Binary Search 0 ub(z) 1.0 20
Binary Search left Allocation possible? mid right 0 ub(z) 1.0 21 21
Binary Search left Allocation possible? mid right 0 ub(z) 1.0 22 22
Actor Allocation of Bottleneck Free Program s1 2 Core 1 Core 2 Core 3 2 s1 2 j1 Mapping 5 5 A A j1 2 C_2 B_1 C_1 B_2 C_3 B_3 B_3 B_2 C_1 C_2 C_3 B_1 20 20 20 20 20 20 20 20 20 20 20 20 Make span = 45, Speedup = 130/45 = 2.89 Load = 45 Load = 44 Load = 45 Efficient Bottleneck Resolving 5 5 D D 23
Experiments • Our method implemented as an extension of StreamIt compiler • We compare to ILP based method [Scott 08] (solved with CPLEX) • Hardware Setup • 2.33GHz dual quad-core Intel Xeon processors 16GB memory Linux kernel version 2.6.23 • Profiler uses the x86-64’s hardware cycle counters 24
Experiments Contd. • Experimental Process • Profiling • Computing closed form • Resolve bottlenecks • Compute the mapping • Compute the layout scheduling • Invoke the StreamIt back end • Finally we measure the performance 25
Experimental Results for 2 – 4 Processors 27 Our method’s run time: <1s
Summary • Approximation algorithm for solving actor allocation problem • Data rate transfer model that resolves bottlenecks • We separate the bottleneck elimination from the actor allocation • We implemented our approach and compared with an optimal approach • Optimal approach has unpredictable time • Our approach has negligible time for all benchmarks • Quality of our approach is at most 5% off the optimum • For up to 8 processors we achieve a geometric mean speedup of 6.95x over single processor execution 30