330 likes | 410 Views
Chapter 2-3: Basic Program Transformations. Optimizing your code. Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance Is this a compiler subject or a computer architecture subject? Yes!
E N D
Optimizing your code Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance • Is this a compiler subject or a computer architecture subject? • Yes! • Many many architectural details are driven by, or have an affect on, code optimization
Major Types of Optimization See Chapter 2, Fig 2-19, P93 • High level (At or near source level) • Procedure integration • Local (Within a basic block) • Common subexpression elimination • Constant propagation • Stack height reduction • Global (Across a branch) • Copy propagation • Code motion • Induction variable elimination • Machine-Dependent • Strength reduction • Pipeline scheduling (more about this later…)
Strength Reduction Substitute a simpler operation when equivalent • Multiply => shifts and adds is a popular area • Y = X ** 2; replace with Y = X * X; • J = K * 2; replace with J = K + K;
Variable Renaming Use distinct names for each unrelated use of the same variable to simplify later optimizations • X = Y * Z; Second use of X is unrelatedQ = R + X + X;X = A + B; Replace with X1 = A + B;
Common Subexpression Elimination Avoid recalculating the same expression • In this code, you would hope the compiler would compute the address of a[ j ][ k ] only once for both statements… • a[ j ][ k ] = b[ j ][ k ] + x * b[ j ][ j-1 ] ; sum = length[ j ] * a[ j ][ k ];
Loop Invariant Code Motion • Avoid operations in loops that are the same in each iteration • Originalfor ( j = 0; j < max; j++) { a[ j ] = b [ j ] + c * d; e = g[ k ]; } • Revisedtmp = c * d; for (j = 0; j < max; j++) a[ j ] = b[ j ] + tmp; e = g[ k ];
Copy Propagation Propagate the original instead of the copy • In this example, x is still copied to y, but then all subsequent calls to x are replaced with y • Originalx = y; z = 2 * x; q = x + 15; • Revisedx = y; z = 2 * y; q = y + 15; • We may find that x is never used again…
Constant Folding If the value of a variable is really a constant that can be determined at compile time, replace it with the constant • int j = 0; int k = 1; m = j + k;
Dead Code Removal • Eliminate instructions whose results are never used • update () { int j, k; j = k = 1; j += 1; k += 2; printf{“ J is %d\n”, j); }
Branch Delay Slots • Some machines (like DLX) always execute instructions in the Branch Delay Slot(s) • Challenge is for the compiler to find code to put in those slots (See Fig 3.28, P 169) • Three places to find such code • An independent instruction from before the branch (Best choice) • From the branch target (Risky, may need to copy the instruction, can’t cause problem if executed incorrectly!) • From the fall-through code (Risky, same problems as above…) • Compiler can hide ~70% of branch hazards on DLX running Spec92 codes.
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Instruction Level Parallelism (ILP) • Pipelining supports a limited sense of ILP • E.g. overlapped instructions, hazard issues, forwarding logic, etc. • Remember:Pipeline CPI = Ideal CPI + Structural Stalls + Data Stalls + Control Stalls • So, let’s try to be more aggressive about reducing the stalls to improve the CPI…
Software Techniques • Loop unrolling • Bigger basic blocks • Attempt to reduce control stalls • Basic pipeline scheduling • Reduce RAW stalls • Lots of other hardware techniques to talk about later…
ILP Within a Basic Block • Basic Block definition • Straight line code, no branches out • Single entry point at the top • Real code is a bunch of basic blocks connected by branches • Notice: • Branch frequency is approx 15% of total mix (for integer programs) • This implies that basic block size is between 6 and 7 instructions • Machine instructions don’t do much • So, there’s probably little in the way of ILP available • Easiest target is the loop • Already exploited by vector processors, but using different mechanisms
Loop Level Parallelism • Consider adding two 1000 element arraysfor(I=1; I<=1000, I=I+1) x[I] = x[I] + y[I]; • Sure it’s trivial, but it illustrates the point • There is no dependence between data values produced in any iteration j and those needed in j+n for any j and n • Truly independent – hence could be 1000-way parallel • Independence means no stalls due to data hazazrds • Problem is that we have to use that pesky branch instruction • Vector processor model • Load vectors X and Y (up to some machine-dependent max) • Then do result-vec = xvec + yvec in a single instruction
Assumptions About Timing • Default DLX pipeline timings for this chapter
Loop Unrolling • Consider adding a scalar s to a vector (assume lowest array element is in location 0) Loop: LD F0, 0(r1) ; R1 array ptr ADDD F4, F0, F2 ;Add scalar in F2 SD 0(r1), F4 ; store result SUBI r1, r1, 8 ; decr. Ptr by 8 bytes BNEZ r1, loop ; branch r1 != 0 For (I = 1; I<=1000; I++) x[I] = x[I] + s; • How does it run without scheduling? • 9 cycles per iteration • LD, LD stall, ADDD, 2 RAW stalls, SD, SUBI, BNEZ, Branch delay control stall
Loop Without and With Scheduling Loop: LD F0, 0(r1) stall ADDD F4, F0, F2 stall stall SD 0(r1), f4 SUBI R1, R1, #8 BNEX R1, Loop stall • Note that this is non-trivial, and many compilers don’t even try • Move SD to branch delay slot • But, SUBI changes a register that SD needs! • Since we moved it past the SUBI, need to adjust offset • Down to 6 cycles/loop, but still has 3 cycle loop+stall overhead Loop: LD F0, 0(r1) stall ADDD F4, F0, F2 SUBI r1, r1, #8 BNEZ R1, Loop SD 8(r1), F4
Loop Unrolling • Basic Idea – take n loop bodies and concatenate them into one basic block • Will need to adjust termination code • Let’s say n was 4 • Then modify the R1 pointer in the example by 4x of what it was before => 32 • Savings – 4 BNEZ’s + 4 SUBI’s => just one of each in new unrolled loop • Hence 75% savings • Problem: Still have 4 load stalls per loop
Unrolled Loop Examle Loop: LD F0, 0(r1) ADDD F4, F0, F2 SD 0(r1), F4 ; drop SUBI and BNEZLD F6, -8(r1) ADDD F8, F6, F2 SD -8(r1), F8 ; drop SUBI and BNEZ LD F10, -16(r1) ADDD F12, F10, F2 SD -16(r1), F12 ; drop SUBI and BNEZ LD F14, -24(r1) ADDD F16, F14, F2 SD -24(r1), F16 SUBI r1, r1, #32 BNEZ Loop
Unrolling With Scheduling • Don’t concatenate the unrolled segments, Shuffle them instead • 4 LDs then 4 ADDDs then 4 SDs • No more stalls since LD -> ADDD dependent path now has 3 instructions in it… • Result is 14 cycles for 4 elements => 3.5 cycles/element • Compare with 9 cycles with no scheduling • or 6 cycles with scheduling but no unrolling
Loop Unrolling With Scheduling • Loop: LD F0, 0(r1)LD F6, -8(r1)LD F10, -16(r1)LD F14, -24(r1) ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2 SD 0(r1), F4 SD -8(r1), F8 SD -16(r1), F12 SUBI r1, r1, #32 BNEZ LoopSD 8(r1), F16 ; note 8-32 = -24
Things to Notice • We had 8 more unused register pairs • We could have gone to an 8 block unroll without register conflict • No problem since the 1000-element array would still have broken cleanly (1000/8 – 125) • What if it had not? Suppose the division has a remainder R? • Just put R blocks (shuffled of course) in front of the loop, then start for real • Even if you run out of registers, you can still cycle names and remove stalls. • Most compilers unroll early to expose code for later optimizations • This one had a tricky one => SD/SUBI swap • Key was independent nature of each loop body • What if they’re not independent?
Data Dependency Analysis • Three types: Data, Name, and Control • I is data dependent on j if: • I uses a result produced by j • Or, I uses a result produced by K, and k depends on j • Dependence indicates a possible RAW hazard • Does it induce a stall? Depends on pipeline structure and forwarding capability • Compiler dataflow analysis • Creates a graph that makes these dependencies explicit directed paths
Data Dependency SUBI R1, R1, #8 BNEZ R1, Loop Loop: LD F0, 0(r1) ADDD F4, F0, F2 SD 0(r1), F4
Name Dependence • Occurs when second instruction uses same register name without a data dependence • E.g. unrolled loop without changing register names • Let I preceed j in program order • I is antidependent on j when j writes a register that I reads • Essentially the same as a WAR hazard • Hence ordering must be preserved to avoid the hazard • I is output dependent on j if they both write the same register • Essentially a WAW hazard • So we have to avoid that too • Otherwise, no real data dependence, just name • So registers can be renamed statically by the compiler or dynamically by the hardware
Control Dependence • Since branches are conditional • Some instructions will be executed, and others will not • Must maintain order due to branches • Two obvious constraints to maintain control dep’s • Instructions controlled by branch can’t be moved before the branch (or they would become unconditional) • Instructions not controlled by the branch can’t be moved after the branch (or they would become conditional) • Simple pipelines preserve this so it’s not a big deal.
Loop-Carried Dependence • Consider the following code: For(I=1; I<=1000; I++){ A[I+1] = A[I] + C[I]; /* S1*/ B[I+1] = B[I] + A[I+1];} /* S2 */ • S1 uses S1 value produced in a previous iteration • S2 uses S2 value produced in a previous iteration • S2 uses an S1 value produced in the same iteration • So, S1 depends on a loop-carried dependence on S1 • Similar to S2’s loop carried dependence • If non-loop-carried dependencies were the only ones, could execute loop bodies in parallel
Another Loop Carried Dependence For(I=1; I<=100; I++){ A[I]= A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I];} /* S2 */ • S1 uses previous value of S2 • However, dependence is not circular since neither statement depends on itself • And no S1 depends on S2 depends on S1 circularity either • So, no cycle in dependencies, loop can be parallelized and unrolled (provided statements are kept in order) A[1] = A[1] + B[1]; For(I=1; I<=99; I++){ B[I+1] = C[I] + D[I]; A[I+1] = A[I+1] + B[I+1];} B[101] = C[100] + D[100];
Our Infrastructure for Lab 1 In the /home/cs/handin/cs5810/bin directory on CADE • lcc - DLX C compiler - use with -S switch to get assembly code in a .s file • dlxasm - Assembler that converts .s files into .dlx object files that can run on our simulator • bin2a - A binary to ASCII converter that lets you look at object files if you like • dlxsim - A simulator for the DLX processor • Type h or ? At the prompt for brief listing of commands • Only gives executed instruction counts at the moment • You’ll extend it later…
Data Infrastructure In the /home/cs/handin/cs5810/ directory • New directory for each lab • I.e. /home/cs/handin/cs5810/lab1 • Also a src directory with benchmarks (small toy examples) in C