420 likes | 549 Views
CS433: Computer System Organization. Luddy Harrison Static Exploitation of I nstruction L evel P arallelism. Dependences. From an earlier statement to a later statement (in program execution order). This is a time ordering, not a visual or textual ordering. Three types:
E N D
CS433: Computer System Organization Luddy Harrison Static Exploitation ofInstruction Level Parallelism
Dependences • From an earlier statement to a laterstatement (in program execution order). • This is a time ordering, not a visual or textual ordering. • Three types: • Flow:A == A • Anti:= AA = • Output: A =A =
Dependences • Flow: • Earlier statement writes same storage location that later statement reads • Storage location may be register, memory, anything that holds state. • Anti: • Earlier statement reads same storage location that later statement writes • Output: • Earlier statement writes same storage location that later statement writes
Elements Required for Dependence • Statement instances • Two instructions / statements or two instances of a single instruction / statement • Dynamic instances thereof – statements and instructions may be executed many times • Time • earlier statement and later statement • ordered by program execution order • Location • the two statements must access the same location • this may be a register or memory or any other kind of state • Access (Read and/or Write) • one of the accesses must be a write (modification)
Static Exploitation of ILP • Static exploitation requires either • A programmer who writes assembly language, or • A compiler organize the instructions of the program such that there is sufficient parallelism between them • We will consider compiler techniques for the most part • I will mention when an assembly programmer might do better or differently
Multiple-Issue versus VLIW • What’s the difference? • Instructions in a VLIW “line” must not depend on one another • It is permitted however to use as a destination a register than another instruction in the same line is using as a source. The source is read before it is overwritten as a destination • This is a very important source of additional parallelism in VLIW machines! • In a multiple-issue-capable CPU (but not a VLIW), if adjacent instructions are dependent, a stall occurs between the instructions, but the program still works.
add R0, R1, R2 add R3, R4, R5 add R1, R0, R6 add R4, R3, R7 What does this program do if run on a multiple issue processor that can run up to four instructions in parallel? What does it do if run as a single instruction line on a 4-wide VLIW? Multiple Issue versus VLIW
load R3, [R1 + 12] add R3, R3, 4 load R0, [R1 + 16] add R0, R0, 4 load R3, [R1 + 12] load R0, [R1 + 16] add R3, R3, 4 add R0, R0, 4 Basic Pipeline Scheduling The colors represent dependent instructions. Assume there is a one-cycle stall between a load and an instruction that uses the load’s result. We reorder the instructions to eliminate these stalls. Here we don’t need to change the instructions, but that is unusual. Usually we are not so lucky.
load R3, [R1 + 12] add R0, R0, R3 load R3, [R1 + 16] add R0, R0, R3 load R3, [R1 + 12] add R0, R0, R3 load R4, [R1 + 16] add R0, R0, R4 load R3, [R1 + 12] load R4, [R1 + 16] add R0, R0, R3 add R0, R0, R4 Basic Pipeline Scheduling rename reorder We would like to do the same transformation as before, but R3 is used as the destination for both loads. We rename R3 to R4 in the second two instructions. This breaks the anti-dependence.
move R2, 0 move R3, 100 L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100 L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge L Loop Unrolling copy copy Is this transformation always legal? What are the required conditions?
move R2, 0 move R3, 100 L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100 L: load R0, [R1 + 16] add R2, R2, R0 add R5, R1, 24 sub R3, R3, 1 load R4, [R1 + 40] add R2, R2, R4 add R1, R5, 24 sub R3, R3, 1 bge L Loop Unrolling: Register Renaming And Normalization of Induction Variables
move R2, 0 move R3, 100 L: load R0, [R1 + 16] add R2, R2, R0 add R5, R1, 24 sub R3, R3, 1 load R4, [R1 + 40] add R2, R2, R4 add R1, R5, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100 L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R5, R1, 24 add R1, R5, 24 sub R3, R3, 1 sub R3, R3, 1 bge L Loop Unrolling: Scheduling After Register Renaming
move R2, 0 move R3, 100 L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R1, R1, 24 add R1, R1, 24 sub R3, R3, 1 sub R3, R3, 1 bge L move R2, 0 move R3, 100 L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R1, R1, 48 sub R3, R3, 2 bge L Loop Unrolling: Cleanup
Register Allocation and ILP • Can you imagine a world in which assigning seats on an airplane and scheduling people on flights were considered two different activities? • “First we schedule everybody who wants to fly to Memphis on March 19th on flight 459 at 10AM. Then, we assign them seats.” • Or, “First we assign everybody a seat. Next we look to see what day and time they want to fly.” • This what we do with registers and instruction schedules (roughly speaking)
load V0 add V1, V0 load V2 add V1, V2 load V3 add V1, V4 load V5 add V1, V6 load V7 add V1, V7 load V8 add V1, V8 load V0 load V2 load V3 load V4 load V5 load V6 add V1, V0 add V1, V2 add V1, V3 add V1, V4 add V1, V5 add V1, V6 So-Called “Phase Ordering” Problem: Schedule First, Using Virtual Registers schedule What if we only have 4 registers (R0, R1, R2, R3)? Our 7 virtual registers are simultaneously alive. We must spill in order to put these 7 virtual regsiters into 4 physical registers.
load V0 add V1, V0 load V2 add V1, V2 load V3 add V1, V3 load V4 add V1, V4 load V5 add V1, V5 load V6 add V1, V6 load R0 add R1, R0 load R0 add R1, R0 load R0 add R1, R0 load R0 add R1, R0 load R0 add R1, R0 load R0 add R1, R0 So-Called “Phase Ordering” Problem: Allocate First allocate But now we can’t move the position of even one instruction in the resulting (scheduled) code.
Question • Why do I say that this is a “so-called” phase ordering problem? • What might we do to solve this problem?
Loops and Dependence Distance • The number of loop iterations between the execution of the earlier statement and the execution of the later statementMUST BE >= 0 (why?)
Examples of Dependence for (i=0; i<100; ++i) { x = 0; }
Examples of Dependence for (i=0; i<100; ++i) { x = 0; } x = 0; x = 0; x = 0; There is a sequence of output dependences involving the instances of the statement “x = 0”.
Examples of Dependence for (i=0; i<100; ++i) x = x + 1;
Examples of Dependence for (i=0; i<100; ++i) x = x + 1; x = x + 1; x = x + 1; x = x + 1; ... Here there are flow, anti, and output dependences. Note that we are only concerned with pairs of accesses that are ordered in time. earlier_access → later_access
Examples of Dependence for (i=0; i<100; ++i) a[i] = a[i – 2] + 1;
Examples of Dependence for (i=0; i<100; ++i) a[i] = a[i + 2] + 1;
Examples of Dependence for (i=0; i<100; ++i) a[i] = a[i] + 1;
What if we write it this way: for (i=0; i<100; ++i) { t = a[i]; a[i] = t; } Dependences are actually between accesses and not between statements or instructions. When we bundle multiple accesses into single statements or instructions, there may be internal flow within the instructions or statements that is in fact a dependence in disguise.
Software Pipelining A A A A A A B B B B B B C C C C C C A A A A A A B B B B B B C C C C C C If A B and C within a single iteration have dependences, but across iterations do not, then this transformation makes the three statements in a loop body independent.
Anti Dependences and VLIW / Superscalar a = 0 b = a + 1 c = b + 2 Flow dependences make this program fragment sequential
Anti Dependences and VLIW / Superscalar c = b + 2 b = a + 1 a = 0 c = b + 2 || b = a + 1 || a = 0 Register anti-dependences can can be made inconsequential in many cases.
Software Pipelining and Loop Reversal ABC ABC ABC ABC A BA CBACBA CB C
Loop Reversal for (i=0; i<100; ++i) { A B C }
Loop Reversal A B A for (i=1; i<99; ++i) { C B A } C B C
SW Pipelining and Flow and Anti Dependences for (i=0; i<100; ++i) { t = A[i]; s = t + 1; A[i] = s; } The blue arrows are flow dependences.
SW Pipelining Changes Flow Dependences into Anti Dependences t = A[0]; // iteration 0 s = t + 1; // iteration 0 t = A[1]; // iteration 1 for (i=1; i<99; ++i) { A[i - 1] = s; s = t + 1; t = A[i + 1]; } A[98] = s; // iteration 98 s = t + 1; // iteration 99 A[99] = s; // iteration 99 The red arrows areanti dependences.
Flow and Anti Dependences A A A A A A B B B B B B C C C C C C A A A A A A B B B B B B C C C C C C After pipelining, Flow dependences are cross-iteration Anti dependences are intra-iteration.
Prologue and Epilogue A A A A A A B B B B B B C C C C C C AA A A A A B B B B B B C C C C CC
Prologue and Epilogue t = A[0]; // A s = t + 1; // B t = A[1]; // A for (i=1; i<99; ++i) { A[i] = s; // C s = t + 1; // B t = A[i + 1]; // A } A[98] = s; // C s = t + 1; // B A[99] = s; // C
Dependence Constraints A A A A A A B B B B B B C C C C C C AA A A A A B B B B B B C C C C CC
Dependence Constraints for Software Pipelining Dependence Distance >= Number of Pipeline Stages For any two statement instances that we will reorder by the transformation.
Dependence Constraints A A A A A A B B B B B B C C C C C C AA A A A A B B B B B B C C C C CC
move 100, R7 L: load R0, [R1 ++] load R2, [R3 ++] add R4, R0, R2 store R4, [R5 ++] bge (-- R7) L load R0, [R1 ++] load R2, [R3 ++] add R4, R0, R2 load R0, [R1 ++] load R2, [R3 ++] L: store R4, [R5 ++] || add R4, R0, R2 || load R0, [R1 ++] || load R2, [R3 ++] bge (-- R7) L Software Pipelining two loads can be done in parallel, but nothing else within an iteration.