EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today: Daya Khudia

Announcements & Reading Material • This class • “Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008. • Next class – GPU compilation • “Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

SPE0 SPE1 SPE7 SPU SPU SPU 256 KB LS 256 KB LS 256 KB LS MFC(DMA) MFC(DMA) MFC(DMA) EIB PPE (Power PC) DRAM Stream Graph Modulo Scheduling (SGMS) • Coarse grain software pipelining • Equal work distribution • Communication/computation overlap • Synchronization costs • Target : Cell processor • Cores with disjoint address spaces • Explicit copy to access remote data • DMA engine independent of PEs • Filters = operations, cores = function units

int wgts[N]; wgts = adapt(wgts); Push and pop items from input/output FIFOs Preliminaries • Synchronous Data Flow (SDF) [Lee ’87] • StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Stateless Stateful

PE0 T1 ≈ 4 T4 SGMS Overview Prologue PE0 PE1 PE2 PE3 DMA DMA DMA DMA T1 DMA T4 DMA DMA DMA Epilogue

Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

Processor Assignment: Maximizing Throughputs T2 = 50 for all filter i = 1, …, N for all PE j = 1,…,P Minimize II PE0 • Assigns each filter to a processor A A W: 20 A B Four Processing Elements W: workload B C B C W: 20 PE1 PE2 PE3 C PE0 W: 20 E A B D D D T1 = 170 F D C W: 30 E E E Minimum II: 50 Balanced workload!Maximum throughput W: 50 F F F W: 30 T1/T2 = 3. 4 PE0 PE1 PE2 PE3

A A A 5 PE0 PE1 B B1 A B2 B S B C C C 40 10 C B1 B2 D D A C 5 S D J J D D Need More Than Just Processor Assignment • Assign filters to processors • Goal : Equal work distribution • Graph partitioning? • Bin packing? Modified stream program Original stream program Speedup = 60/40 = 1.5 Speedup = 60/32 ~ 2

PE0 PE1 PE2 PE3 Filter Fission Choices Speedup ~ 4 ?

Split/Join overhead factored in … Integrated Fission + PE Assign • Exact solution based on Integer Linear Programming (ILP) • Objective function- Maximal load on any PE • Minimize • Result • Number of times to “split” each filter • Filter → processor mapping

PE0 PE1 A B A PE0 PE1 Time B A B C A1 C B1 A2 A1 A1 A→B A→B A2 A2 B1 A→B B1 A3 Step 2: Forming the Software Pipeline • To achieve speedup • All chunks should execute concurrently • Communication should be overlapped • Processor assignment alone is insufficient information X Overlap Ai+2 with Bi

Stage Assignment • Data flow traversal of the stream graph • Assign stages using above two rules PE 1 PE 1 Si i i Sj≥ Si SDMA > Si DMA j PE 2 Sj = SDMA+1 j Preserve causality (producer-consumer dependence) Communication-computation overlap

A S B1 Stage 0 Stage 1 DMA DMA DMA C B2 Stage 2 J DMA DMA Stage 3 D Stage 4 Stage Assignment Example A S C B1 B2 J D PE 0 PE 1

Step 3: Code Generation for Cell • Target the Synergistic Processing Elements (SPEs) • PS3 – up to 6 SPEs • QS20 – up to 16 SPEs • One thread / SPE • Challenge • Making a collection of independent threads implement a software pipeline • Adapt kernel-only code schema of a modulo schedule

Complete Example A void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 Time A S B1 A B1toJ S StoB2 B1 AtoC DMA DMA DMA A B2 B1toJ S StoB2 B1 C B2 AtoC J C J A JtoD B2 B1toJ S CtoD StoB2 DMA DMA B1 AtoC J D C A JtoD B2 B1toJ S CtoD StoB2 B1 AtoC J C D SPE1 DMA1 SPE2 DMA2

SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) • Solver time < 30 seconds for 16 processors

SGMS Conclusions • Streamroller • Efficient mapping of stream programs to multicore • Coarse grain software pipelining • Performance summary • 14.7x speedup on 16 cores • Up to 35% better than greedy solution (11% on average) • Scheduling framework • Tradeoff memory space vs. load balance • Memory constrained (embedded) systems • Cache based system

Discussion Points • Is it possible to convert stateful filters into stateless? • What if the application does not behave as you expect? • Filters change execution time? • Memory faster/slower than expected? • Could this be adapted for a more conventional multiprocessor with caches? • Can C code be automatically streamized? • Now you have seen 3 forms of software pipelining: • 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling • Where else can it be used?

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

A Static Verses Dynamic Scheduling Splitter B1 B2 B3 B4 Joiner Core Memory Core Memory Memory Core Memory Core C ? Splitter • Performing graph modulo scheduling on a stream graph statically. • What happens in case of dynamic resource changes? D1 D2 Joiner E F

Overview of Flextream Streaming Application Prepass Replication Adjust the amount of parallelism for the target system by replicating actors. Static Find an optimal schedule for a virtualized member of a family of processors. Work Partitioning Find optimal modulo schedule for a virtualized member of a family of processors. Goal: To perform Adaptive Stream Graph Modulo Scheduling. Partition Refinement Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance) Dynamic Performs light-weight adaptation of the schedule for the current configuration of the target hardware. Stage Assignment Specifies how actors execute in time in the new actor-processor mapping. Buffer Allocation Tries to efficiently allocate the storage requirements of the new schedule into available memory units. MSL Commands

Overall Execution Flow Resource change Request Resource change Granted For every application may see multiple iterations of:

Prepass Replication [static] 6 S0 22 C0 C1 C2 C3 61.5 10 A C1 C2 C D D0 E0 A C0 B 22 C3 6 J0 86 B 6 S1 P1 : 86 P1 : 147.5 P2 : 246 P2 : 184.5 P3 : 326 P3 : 163 P3 : 163 P0 : 10 P0 : 151.5 22 C 246 D0 D1 163 22 326 D E3 E0 E1 E1 6 J1 F E2 D1 E 566 E 6 S2 21 P7 : 163 E0 E1 E2 E3 141.5 P6 : 283 P7 : 163 P7 : 0 P4 : 566 P4 : 141.5 P5 : 10 P5 : 151.5 P6 : 0 P6 : 141.5 P4 : 283 10 F 21 6 J2

Partition Refinement [dynamic 1] Available resources at runtime can be more limited than resources in static target architecture. Partition refinement tunes actor to processor mapping for the active configuration. A greedy iterative algorithm is used to achieve this goal.

Partition Refinement Example • Pick processors with most number of actors. • Sort the actors • Find processor with max work • Assign min actors until threshold E2 C0 C1 B E0 E3 S2 S1 C2 S0 J0 E2 C2 P1 : 141.5 P1 : 283 P2 : 171.5 P3 : 141.5 P3 : 289 P0 : 184.5 J2 J1 B C3 A E1 F D1 C3 D0 J1 J2 S2 S0 J0 C0 C1 S1 P4 : 151.5 P5 : 173 P5 : 183 P5 : 193 P4 : 274.5 P5 : 270.5 P6 : 140 P7 : 159.5 E2 B C0 C1 C2 C3 S2 S1 J1 J2 S0 J0

Stage Assignment [dynamic 2] Processor assignment only specifies how actors are overlapped across processors. Stage assignment finds how actors are overlapped in time. Relative start time of the actors is based on stage numbers. DMA operations will have a separate stage.

Stage Assignment Example A 0 2 B B E0 E3 4 E2 C2 S0 C0 C1 C2 C3 6 J2 J1 J0 A E1 F D1 C3 D0 S2 S0 J0 C0 C1 S1 8 S1 10 D0 D1 J1 12 S2 E0 E1 E2 E3 14 16 J2 F 18

Performance Comparison

Overhead Comparison

Flextream Conclusions Static scheduling approaches are promising but not enough. Dynamic adaptation is necessary for future systems. Flextream provides a hybrid static/dynamic approach to improve efficiency.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling