360 likes | 494 Views
Software Exploits for ILP. We have already looked at compiler scheduling to support ILP Altering code to reduce stalls Loop unrolling and scheduling Compiler-based scheduling for superscalars VLIW
E N D
Software Exploits for ILP • We have already looked at compiler scheduling to support ILP • Altering code to reduce stalls • Loop unrolling and scheduling • Compiler-based scheduling for superscalars • VLIW • Here, we examine appendix H to see additional compiler-based ideas and hardware to help support some of these ideas • Few architectures have focused on static-based approaches aside from minimal support with compiler scheduling • However, the EPIC architecture heavily relied on it, here we will view EPIC and consider to what extent it succeeds (or fails) over dynamic approaches
Loop Dependencies • To support loop unrolling, the compiler must detect any dependencies that exist both within and between loop iterations • Within loop iterations are the typical RAW, WAW and WAR hazards • Between loop iterations, RAW, WAW and WAR hazards may be hard to identify because an array index may not match exactly • Consider the following two loop bodies, both iterate over i from 1 to 100 • x[i] = x[i] + s; • x[i] = x[i-1] + s; • In the first, the RAW hazard will not cause any stalling behavior, nor will it be complicated by loop unrolling because reads happen before writes • But in the second, the RAW hazard exists across loop iterations so that an unrolled loop could lead to problems
Example Examined • Consider this code: • x[0] = …; • for(i=1;i<=100;i++) x[i] = x[i-1] + s; • Assume x is an FP array so that the additions take 4 cycles • Let’s unroll this loop to contain 4 iterations per loop, this would give us the following four assignments in the first unrolled loop iteration: • x[1] = x[0] + s; • x[2] = x[1] + s; • x[3] = x[2] + s; • x[4] = x[3] + s; • If schedule the above code, we would first attempt to L.D x[0], x[1], x[2], x[3], then do the ADD.Ds and finally the S.Ds • But each S.D is needed before the next ADD.D • If the compiler doesn’t detect this dependence, the code will be incorrect!
Forms of Dependencies • There are 3 forms of dependencies • True or data dependencies – these are the same as RAW hazards as found in pipelining • we have to make sure that the value is written before we can subsequently read/use it • Name dependencies – these arise because the same named entity is referenced, but the data differs • for instance, we put a result in R1 and use it in a later instruction, but yet another instruction places a completely unrelated datum in R1 • There are two forms of name dependencies • output dependencies – these arise when two instructions write two independent results to a named location without an intervening read, that is, these are WAW hazards • antidependencies – these arise when a read and write must occur in the proper order so that the read takes place before the write, these are WAR hazards
Example • Find the dependencies • both within a single loop iteration and across loop iterations • identifying each type • is the loop parallelizable (unrollable)? for (i=1;i<=100;i=i+1) { for(i=1;i<=100;i=i+1) { y[i]=x[i]/c; /* S1 */ t[i]=x[i]/c; x[i]=x[i]+c; /* S2 */ x1[i]=x[i]+c; z[i]=y[i]+c; /* S3 */ z[i]=t[i]+c; y[i]=c-y[i]; /* S4 */ y[i]=c-t[i]; } } True: from S1 to S3 (y), from S1 to S4 (y) Anti: from S1 to S2 (x), from S3 to S4 (y) Output: from S1 to S4 (y) As is the loop is not parallelizable, but if we use renaming on x (S2) and y (S1, S3 and S4), we can unroll and schedule the code – notice that we have renamed x to x1 for this to work, later code would have to reference x1
Example • The previous example had no loop carried dependencies • These can be tricky to find • just because an array is specified as something other than index i does not mean that there is a loop carried dependence • we will examine how to prove a loop carried dependence exists in a couple of slides, consider this loop: • for(i=1;i<=100;i++) { • a[i+1] = a[i] + c[i]; // S1 • b[i+1] = b[i] + a[i+1]; // S2 • } • This code has the following true dependencies • a from S1 to S2 • a from S1 to S1 (loop carried) • b from S2 to S2 (loop carried) • The loop carried dependencies, at least in this case, prevent the loop from being parallelizable
Example • Not all loop carried dependencies prevent a loop from being parallelizable, consider this example • for(i=1;i<=100;i++) { • a[i] = a[i] + b[i]; // S1 • b[i+1] = c[i] + d[i]; // S2 • } • here, we have a loop carried dependence on b from S2 to S1 (the dependence from a to a in S1 is not loop carried) • To parallelize this loop, we must eliminate the dependence • this change requires adding an initial S1 before the loop and a final S2 after the loop • a[1] = a[1] + b[1]; // initial S1 • for(i=1;i<=100;i++) { • b[i+1] = c[i] + d[i]; // S2 • a[i+1] = a[i+1] + b[i+1]; // S1 • } • b[101] = c[101] + d[101]; // final S2
Recurrences • The key to identifying if a loop carries a dependence across iterations is to find if a recurrence of loop indices arises • A recurrence is when a loop index for a given variable is reused in another iteration • With a[i], a[i+1], the recurrence is easy to detect, but consider these two statements: • a[i] = b[2*i] + c; // S1 • b[2*i+1] = d[i] * e; // S2 • There is no recurrence of b between S1 and S2 because the index in S1 is always even and the index in S2 is always odd • Identifying a recurrence can be computationally challenging, there are a number of tests we can apply • The tests can confirm a loop carried dependence but if the test fails, it does not tell us anything because there are other tests that might be applicable
GCD Test • One easy test is applied when array indices are affine • Basically, an affine index is one that can be written in the from a*i + b where i is the loop index and a and b are integer constants • The GCD test says that if two indices of the same array are affine then a dependence exists if these three conditions hold: • there are two iteration indices, j and k, within the bounds of the loop • the loop stores into an array element by a*j+b and later fetches from the same array at c*k+d • the value of d – b is evenly divisible by the greatest common divisor of c and a • Examples: • x[2*i+3] = x[2*i]: 2 does not divide -3, test fails (cannot conclude anything) • y[5*i – 4] = y[15*i + 6]: 5 does divide 10 (6 - -4) so there is a loop carried dependence (this arises, for instance when i = 8 and i = 2
Dependence Challenges • There are a number of challenges that complicate loop dependence analysis • When storage is referenced by pointer rather than array index • When array indexing is indirect through another array • When dependencies exist for a subset of inputs but do not arise under other sets of inputs • Although pointers pose a very difficult problem to static analysis (since pointers take on their values at run time), there are some forms of analysis available • If two pointers cannot point to the same type, there can be no dependence • If an object being referenced by a pointer is only allocated under conditions that differ from those of another pointer • If one pointer can only point to a local referent while another points only to a global
Eliminating Computation • Aside from loop unrolling/scheduling, another useful pursuit for the compiler is to replace some common computations by storing the first result in a register • This can take on multiple forms • DADDI R1, R2, #4 • DADDI R1, R1, #4 • becomes • DADDI R1, R1, #8 • We can take advantage of associativity as shown to the right • Additionally, if a particular computation, say c + d, is used in several locations, place c + d into a register and replace all further uses of the computation with the register DADD R1, R2, R3 DADD R4, R1, R6 DADD R8, R4, R7 Here, we have two RAW Hazards that might cause Stalls, we replace them With the code below DADD R1, R2, R3 DADD R4, R6, R7 DADD R8, R1, R4
Software Pipelining • As we saw earlier, a compiler can rearrange code in a loop to remove loop carried dependencies • The compiler can also rearrange the code to hide true dependencies found within an iteration through a technique called symbolic loop unrolling • The idea is to identify in each iteration of a loop, the instruction that can be paired with a previous and successive loop iteration • For instance, if one iteration performs a FP add which takes multiple cycles, the store for that add can be moved to the next iteration • to prevent having to use multiple groups of registers, we would place the store before the add of that iteration since we would be storing the sum from the previous iteration • this may require adding “startup” and “cleanup” code
Continued • Pictorially, the concept works as follows: • In iteration 0 we select the last instruction in the loop that has the dependence • In iteration 1 we select the second to last instruction, etc • In the last iteration we select the first instruction in the loop that has the dependence • We add startup code so that the instructions preceding the last instruction from iteration 0 are still available • We add cleanup code so that the instructions from the last iteration that follow the first instruction are performed
Example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DSUBI R1,R1,#8 BNE R1,R2,Loop Iteration i: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Iteration i+1: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Iteration i+2: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Bold-faced instructions are unrolled L.D F0, 16(R1) L.D F6, 8(R1) ADD.D F4, F6, F2 Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DSUBI R1,R1,#8 BNE R1,R2,Loop ADD.D F8, F0, F2 S.D F4, 8(R1) S.D F8, 0(R1) Startup Cleanup
Code Scheduling • So far, our code scheduling has been limited to • Moving code within a basic block to fill stalls • Loop unrolling and scheduling • What about moving code across conditional branches? • With branch history, we can make predictions on whether a branch will be taken or not • Are the benefits of moving code to avoid branch delays worth the risk of guessing wrong? • Branch speculation requires several supporting mechanisms • A buffer to consult that provides the branch prediction and branch target location • A mechanism for “killing off” the speculated operation(s) if the prediction is wrong • A mechanism to ensure that speculated code does not raise an exception unless/until the speculation is proved correct
Example • Consider the skeleton of an if-else statement to the right, we have some options for code scheduling • Move B(i) before the condition • only useful if condition is usually true and executing it will not impact X or C(i) • Move X before the condition • only useful if condition is usually false and executing it will not impact B(i) or C(i) • Move C(i) before the condition, or into one of the clauses • only useful if executing it does not impact condition, B(i) or X
Variants • Move B(i) up before the condition • In X, reset B(i) • that is, if the condition turns out to be false, wipe out the B(i) assignment statement (reset it) • Move C(i) up before the condition • Doable if we can ensure that neither the condition, B(i) or X would be impacted • let’s assume that X would be impacted, if we predict the else clause is rarely taken, we could reset C(i) before doing X in the else clause • The question comes down to • Is the benefit from a correct prediction more than the cost when incorrect? • Again, the compiler has to ensure that the movement does not cause an incorrect condition result or incorrect values from the if clause and else clause
Example SGT R3, R1, R2 BEQZ R3, else DADDI R1, R1, #1 J next else: DSUBI R2, R2, #1 next: … SGT R3, R1, R2 DADDI R1, R1, #1 BNEZ R3, next DSUBI R1, R1, #1 DSUBI R2, R2, #1 next: … • Let’s use the code • if(x > y) x++; else y--; • Further, let’s assume that x > y is true 90% of the time • Our original code is shown to the right • if true, 4 instructions are executed and if false, 3 instructions are executed • let’s assume no stalls, each instruction has a CPI of 1 • average CPI = 4 * .9 + 3 * .1 = 3.9 • Given the prediction (x > y), the compiler generates the code to the right • if true, 3 instructions are executed and if false, 5 • average CPI = 3 * .9 + 5 * .1 = 3.2, • a speedup of 3.9 / 3.2 = 1.22 (22%)
Trace Scheduling • In the previous example, we selected the “critical path” – the most common path through the selection statement • Typically, such a conditional is found inside of a loop • In trace scheduling, we combine selecting the critical path with loop unrolling so that we move the critical path out of the selection in multiple iterations • In order to handle the miss-prediction, we have exits out of the unrolled code and entrances to re-enter after handling the miss-prediction
Superblocks • The numerous entries and exits in our previous figure indicates a major drawback of trace scheduling • First, it requires that the compiler build mechanisms for recovering from miss-predictions into the unrolled code • for instance, imagine that we unroll a loop 4 times, the compiler then has to build into the code what to do if the miss-prediction occurs in the first iteration and how to re-enter, if the miss-prediction occurs in the second iteration and how to re-enter, etc • Second, it increases the amount of code required and complicates the code • The superblock uses the same idea except that, upon exiting, you enter a different block which foregoes speculation • When the loop terminates, the superblock is re-entered • This is done using a technique called tail duplication
Example • Assume our code is: • for(i=0;i<n;i++) if(a[i]>0) x++; else x--; • In most cases, a[i] is positive • We choose to move x++ out of the selection statement and replace the selection with if(a[i]<=0) x=x-2; • That is, we add 1 to x automatically and if we miss-speculate, we subtract 2 from x • We then unroll the loop giving us the following (in C rather than assembly) • for(i=0;i<n/4;i+=4) { • x+=4; if(a[i]<=0) {…} // code in the { } requires • if(a[i+1]<=0) {…} // subtracting from x, and then • if(a[i+2]<=0) {…} // completing the remaining • if(a[i+3]<=0 {…} // loop iterations using the • } // original code
Predicated Instructions • We have seen that with every loop and every selection statement, there are branches • Which could result in branch delays, or speculation that when miss-speculated can lead to stalls • If the condition and action are simple enough, can we do them without a branch? • The answer is yes, if we use predicated (or conditional) instructions • The idea is that the condition and the action are both performed but that if the condition is determined false, the register write is canceled • In most cases, predicated instructions can • only use a simple condition: value = 0 or value != 0 • only have a single, simple action such as x = y • Here, we consider two: • CMOVZ – conditional move • LWC – conditional load
Examples • The code if(A==0) {S=T;} can be implemented in MIPS as • BNEZ R1, L • ADD R2, R3, R0 • L: … • Or with the MOVZ instruction as • MOVZ R2, R3, R1 • Move R3 to R2 if R1 = 0 (if R1 != 0) cancel the move before it is finalized • assuming the MIPS pipeline, we reduce from 2 instructions to 1 and remove the branch penalty, so a potential savings of 2 clock cycles • In MIPS, the MOVZ is the only predicated instruction, but other architectures offer others such as LWC • Load if condition is true • LWC R2, 0(R3), R1 – load 0(R3) into R2 if R1 = 0 • The instruction performs 0+R3 and R1 = 0 test in EX stage and either loads 0+R3 into R2 in MEM and WB respectively or does nothing in MEM/WB depending on the result of the condition
Handling Exceptions • Whether we use predicated instructions or through compiler scheduling (e.g., trace scheduling) • an instruction that raises an exception that should not have executed because of miss-speculation should not cause the exception • recall exceptions may invoke an exception handler, which is very time consuming, or may cause program termination • we need a way to recover from a miss-speculated exception situation • in the former case, we can invoke the exception handler and cancel it later if we determine the instruction was miss-speculated – this wastes some time but preserves proper behavior • for the latter case, we need a mechanism to ensure that the exception is either not raised or ignored until we know whether the speculation was correct or not
Four Approaches • Hardware and OS work cooperatively to ignore exceptions of speculative instructions • this only works for correct programs • Speculative instructions are not permitted to raise exceptions • speculative instructions are annotated as such • for instance a speculative load might be sLW • we disallow the instruction from raising an exception • Poison bits are attached to registers to indicate if their value was the result of speculation • we add a bit to every register, a speculated instruction that writes to the register sets the bit, a register written to as a result of a register with the set poison bit is also set • exceptions are disallowed for any instruction with a set poison bit until the instruction’s speculation is known • Buffers used to store results of speculated instructions • like a reorder buffer, we only permit results to move beyond the buffer once the speculation is know, until then, exceptions are buffered
Example • Imagine we have the following instruction: • If (A==0) A = B; else A = A + 4; • The original code is shown on the left but the compiler uses speculation to generate better code on the right • If the speculation is true 90% of the time the code goes from .90 * 5 + .10 * 4 = 4.9 to .90 * 4 +.10 * 5 = 4.1 cycles (speedup of about 19%) LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW R14, 0(R3) LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: DADDI R1, R1, #4 L2: SW R1, 0(R3) The speculated code adds register R14 so that the value R1 is not destroyed if we have a miss-speculation. Additionally, we do not want the SW to take place until the speculation is known
Continued • While the previous example ensured the proper value was stored to A, it did nothing to prevent an exception from a miss-speculation • Specifically, we do not want to load B (0(R2)) if A is not 0 • imagine that A is not 0 and 0(R2) causes a memory violation, this would cause an exception that should never arise • We indicate that the load for B is speculative and now we add a new instruction called SPECCK – speculative check – this ensures that an exception only arises if the speculated instruction should have executed AND it caused an exception LD R1,0(R3) sLD R14,0(R2) BNEZ R1,L1 SPECCK 0(R2) J L2 L1: DADDI R14,R1,#4 L2: SD R14,0(R3)
Limitations on Speculation • Instructions that are annulled (turned into no-ops) still take execution time • Conditional instructions are most useful when the condition can be evaluated early • such as during the ID stage of our pipeline • Speculated instructions may cause a slow down compared to unconditional instructions requiring either a slower clock rate or greater number of cycles • The use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence • for example, moving an instruction across multiple branches requires making it conditional on both branches, which requires two conditions to be specified or requires additional instructions to compute the controlling predicate • if such capabilities are not present, the overhead of if conversion will be larger, reducing its advantage
Intel IA-64/EPIC • This chapter introduced a number of compiler-based strategies to promote ILP to support a superscalar processor • To date, very few processors have attempted to aggressively schedule parallel instructions through the compiler, instead relying on hardware scheduling • The IA-64 is one of the few, here we look at a few highlights of the instruction set and see how instructions are bundled together to issue in a VLIW-like way • 128 65-bit registers (1 poison bit included) • 128 82-bit FP registers • 64 1-bit predicate registers • 8 64-bit branch registers (for indirect branching) • register stack for parameter passing (rather than memory)
Instruction Format • Compiler uses a number of strategies to provide ILP • Loop unrolling • Speculation • Scheduling, etc • Compiler selects up to 3 consecutive instructions to place into a “bundle” • A bundle is 128 bits wide consisting of • a 5-bit template field • up to three instructions which are 41 bits each (or no-ops as necessary) • the 5-bit template describes what each type of instruction is, and each type has its own formatting so some of the instruction information is encoded in the template • the template includes whether a stop should exist – stops denote the need for stalls
Bundle Components • All instructions break into one of 5 types: • I: integer ALU, non-ALU integer • M: memory (int & FP), integer ALU • F: floating point • B: branches and conditional instructions • L+X: instructions with extended immediate data, stops, and no-ops See figure H.7 for full table
Example • Unroll the x[i]=x[i]+s; loop seven times and schedule the instructions in IA-64 bundles • First to minimize bundles • Second to minimize cycles Loop: L.D F0, 0(R1) S.D F4, 0(R1) L.D F6, -8(R1) S.D F8, -8(R1) L.D F10, -16(R1) S.D F12, -16(R1) L.D F14, -24(R1) S.D F16, -24(R1) L.D F18, -32(R1) S.D F20, -32(R1) L.D F22, -40(R1) S.D F24, -40(R1) L.D F26, -48(R1) S.D F28, -48(R1) ADD.D F4, F0, F2 DADDI R1, R1, #-56 ADD.D F8, F6, F2 BNE R1, R2, Loop ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2
Speculation Support • Nearly every instruction can be predicated • This is done by specifying one of the predicate registers • a conditional branch can become an unconditional branch with a predicate register • Predicate registers are set using a compare or test instruction • hardware supports predication by controlling when exceptions are handled – for a predicated instruction, an exception can only be handled once the predicate’s result is known and by using registers with poison bits • the compiler is tasked with generating recovery code for exceptions that arise because of miss-speculation • Speculated loads use a special table so that if the load is miss-speculated, it does not wipe out a current register value
Itanium 2 Performance • The IA-64/EPIC instruction set was implemented in the Itanium 2 processor with a 1.5 GHz clock • The graph below compares its performance on int and FP benchmarks to those of Pentium IV (3.8 GHz), AMD Athlon and Power5 • the Itanium 2 compares favorably to Pentium IV & Athlon for FP benchmarks but is slower on int benchmarks