EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

EECS 583 – Class 17Research Topic 1Decoupled Software Pipelining University of Michigan November 9, 2011

Announcements + Reading Material • 2nd paper review due today • Should have submitted to andrew.eecs.umich.edu:/y/submit • Next Monday – Midterm exam in class • Today’s class reading • “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. • Next class reading (Wednes Nov 16) • “Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

Midterm Exam • When: Monday, Nov 14, 2011, 10:40-12:30 • Where • 1005 EECS • Uniquenames starting with A-H go here • 3150 Dow (our classroom) • Uniquenames starting with I-Z go here • What to expect • Open book/notes, no laptops • Apply techniques we discussed in class on examples • Reason about solving compiler problems – why things are done • A couple of thinking problems • No LLVM code • Reasonably long but you should finish • Last 2 years exams are posted on the course website • Note – Past exams may not accurately predict future exams!!

Midterm Exam • Office hours between now and Monday if you have questions • Daya: Thurs and Fri 3-5pm • Scott: Wednes 4:30-5:30, Fri 4:30-5:30 • Studying • Yes, you should study even though its open notes • Lots of material that you have likely forgotten • Refresh your memories • No memorization required, but you need to be familiar with the material to finish the exam • Go through lecture notes, especially the examples! • If you are confused on a topic, go through the reading • If still confused, come talk to me or Daya • Go through the practice exams as the final step

Exam Topics • Control flow analysis • Control flow graphs, Dom/pdom, Loop detection • Trace selection, superblocks • Predicated execution • Control dependence analysis, if-conversion, hyperblocks • Can ignore control height reduction • Dataflow analysis • Liveness, reaching defs, DU/UD chains, available defs/exprs • Static single assignment • Optimizations • Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction • ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion • Speculative optimization – like HW 1

Exam Topics - Continued • Acyclic scheduling • Dependence graphs, Estart/Lstart/Slack, list scheduling • Code motion across branches, speculation, exceptions • Software pipelining • DSA form, ResMII, RecMII, modulo scheduling • Make sure you can modulo schedule a loop! • Execution control with LC, ESC • Register allocation • Live ranges, graph coloring • Research topics • Can ignore these

Last Class • Scientific codes – Successful parallelization • KAP, SUIF, Parascope, gcc w/ Graphite • Affine array dependence analysis • DOALL parallelization • C programs • Not dominated by array accesses – classic parallelization fails • Speculative parallelization – Hydra, Stampede, Speculative multithreading • Profiling to identify statistical DOALL loops • But not all loops DOALL, outer loops typically not!! • This class – Parallelizing loops with dependences

What About Non-Scientific Codes??? 1 5 3 2 0 4 2 4 5 1 0 3 Core 1 Core 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 X:1 LD:2 X:2 LD:3 LD:4 X:3 X:4 LD:5 X:5 LD:6

Alternative Parallelization Approaches 0 1 2 4 5 3 4 3 2 1 0 5 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 X:2 X:2 LD:3 LD:3 X:3 LD:4 LD:4 X:3 X:4 X:4 LD:5 LD:5 LD:6 X:5 X:5 LD:6 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)

Comparison: IMT, PMT, CMT 0 4 0 1 3 4 5 5 2 1 2 1 0 3 5 4 3 2 CMT PMT Core 1 Core 2 LD:1 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:2 X:1 X:1 X:2 X:1 LD:2 X:2 LD:3 X:2 C:4 C:3 LD:3 X:3 LD:4 X:4 LD:4 X:3 X:3 X:4 LD:5 X:4 C:5 C:6 LD:5 LD:6 X:5 X:5 X:6 X:5 LD:6 IMT

Comparison: IMT, PMT, CMT 4 3 2 0 5 3 5 4 1 0 1 2 3 4 2 5 1 0 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:1 X:1 X:2 LD:2 X:1 X:1 LD:2 X:2 X:2 C:4 C:3 LD:3 LD:3 X:4 X:3 LD:4 X:3 LD:4 X:3 X:4 X:4 C:5 C:6 LD:5 LD:5 1 iter/cycle 1 iter/cycle X:5 X:6 LD:6 X:5 X:5 LD:6 1 iter/cycle lat(comm) = 1:

Comparison: IMT, PMT, CMT 0 0 1 3 4 5 5 4 2 2 1 0 5 4 3 3 2 1 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 Core 1 Core 2 C:1 C:2 X:1 LD:2 X:1 X:2 LD:2 X:1 LD:3 C:4 C:3 X:2 X:2 LD:4 X:4 X:3 X:3 LD:3 LD:5 C:5 C:6 X:4 X:3 LD:6 X:5 X:6 1 iter/cycle 1 iter/cycle lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle

Comparison: IMT, PMT, CMT 5 4 3 2 1 0 1 5 0 3 4 5 4 3 2 1 0 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Thread-local Recurrences  Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 LD:3 X:2 X:2 LD:4 X:3 LD:5 LD:3 X:4 LD:6 X:3 Cross-thread Dependences  Wide Applicability

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP 197.parser Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

Decoupled Software Pipelining

[MICRO 2005] Decoupled Software Pipelining (DSWP) register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph Thread 1 Thread 2 A D DAGSCC D A B B A D B C C C Inter-thread communication latency is a one-time cost

register control memory L1: intra-iteration loop-carried Aux: Implementing DSWP DFG

register control memory intra-iteration loop-carried Optimization: Node SplittingTo Eliminate Cross Thread Control L1 L2

register control memory intra-iteration loop-carried Optimization: Node Splitting To Reduce Communication L1 L2

register control memory intra-iteration loop-carried Constraint: Strongly Connected Components Consider: Solution: DAGSCC Eliminates pipelined/decoupled property

2 Extensions to the Basic Transformation • Speculation • Break statistically unlikely dependences • Form better-balanced pipelines • Parallel Stages • Execute multiple copies of certain “large” stages • Stages that contain inner loops perfect candidates

Why Speculation? register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B B A C C

Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B A B C D B A C C Predictable Dependences

Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC D B B A C C A Predictable Dependences

Execution Paradigm register control intra-iteration loop-carried communication queue D DAGSCC Misspeculation detected B C Misspeculation Recovery Rerun Iteration 4 A

Understanding PMT Performance 4 3 5 0 2 1 1 2 3 4 5 0 Core 1 Core 2 Core 1 Core 2 A:1 B:1 A:1 A:2 B:1 A:2 B:2 C:1 B:2 A:3 Rate tiis at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Idle Time B:3 A:4 A:3 B:3 C:3 B:4 A:5 A:6 B:5 Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle

Selecting Dependences To Speculate register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC Thread 1 D B Thread 2 B A C Thread 3 C A Thread 4

Detecting Misspeculation A1: while(consume(4)) D: node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); D DAGSCC Thread 2 B A3: while(consume(6)) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4

Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4

Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); A Thread 4

Breaking False Memory Dependences register control intra-iteration loop-carried communication queue Oldest Version Committed by Recovery Thread Dependence Graph D B A Memory Version 3 C false memory

Adding Parallel Stages to DSWP 5 3 2 4 6 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X LD:3 X:1 LD:4 LD = 1 cycle X = 2 cycles X:2 LD:5 X:3 Comm. Latency = 2 cycles LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle LD:7 X:5 7 LD:8

Thread Partitioning p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; } 10 10 10 10 5 5 50 50 5 Reduction 3

Thread Partitioning: DAGSCC

Thread Partitioning 20 • Merging Invariants • No cycles • No loop-carried dependence inside a doall node 10 15 5 100 5 3

Thread Partitioning 20 Treated as sequential 10 15 113

Thread Partitioning 45 113 Modified MTCG[Ottoni, MICRO’05] to generate code from partition

Discussion Point 1 – Speculation • How do you decide what dependences to speculate? • Look solely at profile data? • What about code structure? • How do you manage speculation in a pipeline? • Traditional definition of a transaction is broken • Transaction execution spread out across multiple cores

Discussion Point 2 – Pipeline Structure • When is a pipeline a good/bad choice for parallelization? • Is pipelining good or bad for cache performance? • Is DOALL better/worse for cache? • Can a pipeline be adjusted when the number of available cores increases/decreases?

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining