1 / 33

Chapter 2-3: Basic Program Transformations

Chapter 2-3: Basic Program Transformations. Optimizing your code. Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance Is this a compiler subject or a computer architecture subject? Yes!

fagan
Download Presentation

Chapter 2-3: Basic Program Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 2-3: Basic Program Transformations

  2. Optimizing your code Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance • Is this a compiler subject or a computer architecture subject? • Yes! • Many many architectural details are driven by, or have an affect on, code optimization

  3. Major Types of Optimization See Chapter 2, Fig 2-19, P93 • High level (At or near source level) • Procedure integration • Local (Within a basic block) • Common subexpression elimination • Constant propagation • Stack height reduction • Global (Across a branch) • Copy propagation • Code motion • Induction variable elimination • Machine-Dependent • Strength reduction • Pipeline scheduling (more about this later…)

  4. Strength Reduction Substitute a simpler operation when equivalent • Multiply => shifts and adds is a popular area • Y = X ** 2; replace with Y = X * X; • J = K * 2; replace with J = K + K;

  5. Variable Renaming Use distinct names for each unrelated use of the same variable to simplify later optimizations • X = Y * Z; Second use of X is unrelatedQ = R + X + X;X = A + B; Replace with X1 = A + B;

  6. Common Subexpression Elimination Avoid recalculating the same expression • In this code, you would hope the compiler would compute the address of a[ j ][ k ] only once for both statements… • a[ j ][ k ] = b[ j ][ k ] + x * b[ j ][ j-1 ] ; sum = length[ j ] * a[ j ][ k ];

  7. Loop Invariant Code Motion • Avoid operations in loops that are the same in each iteration • Originalfor ( j = 0; j < max; j++) { a[ j ] = b [ j ] + c * d; e = g[ k ]; } • Revisedtmp = c * d; for (j = 0; j < max; j++) a[ j ] = b[ j ] + tmp; e = g[ k ];

  8. Copy Propagation Propagate the original instead of the copy • In this example, x is still copied to y, but then all subsequent calls to x are replaced with y • Originalx = y; z = 2 * x; q = x + 15; • Revisedx = y; z = 2 * y; q = y + 15; • We may find that x is never used again…

  9. Constant Folding If the value of a variable is really a constant that can be determined at compile time, replace it with the constant • int j = 0; int k = 1; m = j + k;

  10. Dead Code Removal • Eliminate instructions whose results are never used • update () { int j, k; j = k = 1; j += 1; k += 2; printf{“ J is %d\n”, j); }

  11. Branch Delay Slots • Some machines (like DLX) always execute instructions in the Branch Delay Slot(s) • Challenge is for the compiler to find code to put in those slots (See Fig 3.28, P 169) • Three places to find such code • An independent instruction from before the branch (Best choice) • From the branch target (Risky, may need to copy the instruction, can’t cause problem if executed incorrectly!) • From the fall-through code (Risky, same problems as above…) • Compiler can hide ~70% of branch hazards on DLX running Spec92 codes.

  12. Chapter 4: Pipeline Scheduling and ILP

  13. Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

  14. Instruction Level Parallelism (ILP) • Pipelining supports a limited sense of ILP • E.g. overlapped instructions, hazard issues, forwarding logic, etc. • Remember:Pipeline CPI = Ideal CPI + Structural Stalls + Data Stalls + Control Stalls • So, let’s try to be more aggressive about reducing the stalls to improve the CPI…

  15. Software Techniques • Loop unrolling • Bigger basic blocks • Attempt to reduce control stalls • Basic pipeline scheduling • Reduce RAW stalls • Lots of other hardware techniques to talk about later…

  16. ILP Within a Basic Block • Basic Block definition • Straight line code, no branches out • Single entry point at the top • Real code is a bunch of basic blocks connected by branches • Notice: • Branch frequency is approx 15% of total mix (for integer programs) • This implies that basic block size is between 6 and 7 instructions • Machine instructions don’t do much • So, there’s probably little in the way of ILP available • Easiest target is the loop • Already exploited by vector processors, but using different mechanisms

  17. Loop Level Parallelism • Consider adding two 1000 element arraysfor(I=1; I<=1000, I=I+1) x[I] = x[I] + y[I]; • Sure it’s trivial, but it illustrates the point • There is no dependence between data values produced in any iteration j and those needed in j+n for any j and n • Truly independent – hence could be 1000-way parallel • Independence means no stalls due to data hazazrds • Problem is that we have to use that pesky branch instruction • Vector processor model • Load vectors X and Y (up to some machine-dependent max) • Then do result-vec = xvec + yvec in a single instruction

  18. Assumptions About Timing • Default DLX pipeline timings for this chapter

  19. Loop Unrolling • Consider adding a scalar s to a vector (assume lowest array element is in location 0) Loop: LD F0, 0(r1) ; R1 array ptr ADDD F4, F0, F2 ;Add scalar in F2 SD 0(r1), F4 ; store result SUBI r1, r1, 8 ; decr. Ptr by 8 bytes BNEZ r1, loop ; branch r1 != 0 For (I = 1; I<=1000; I++) x[I] = x[I] + s; • How does it run without scheduling? • 9 cycles per iteration • LD, LD stall, ADDD, 2 RAW stalls, SD, SUBI, BNEZ, Branch delay control stall

  20. Loop Without and With Scheduling Loop: LD F0, 0(r1) stall ADDD F4, F0, F2 stall stall SD 0(r1), f4 SUBI R1, R1, #8 BNEX R1, Loop stall • Note that this is non-trivial, and many compilers don’t even try • Move SD to branch delay slot • But, SUBI changes a register that SD needs! • Since we moved it past the SUBI, need to adjust offset • Down to 6 cycles/loop, but still has 3 cycle loop+stall overhead Loop: LD F0, 0(r1) stall ADDD F4, F0, F2 SUBI r1, r1, #8 BNEZ R1, Loop SD 8(r1), F4

  21. Loop Unrolling • Basic Idea – take n loop bodies and concatenate them into one basic block • Will need to adjust termination code • Let’s say n was 4 • Then modify the R1 pointer in the example by 4x of what it was before => 32 • Savings – 4 BNEZ’s + 4 SUBI’s => just one of each in new unrolled loop • Hence 75% savings • Problem: Still have 4 load stalls per loop

  22. Unrolled Loop Examle Loop: LD F0, 0(r1) ADDD F4, F0, F2 SD 0(r1), F4 ; drop SUBI and BNEZLD F6, -8(r1) ADDD F8, F6, F2 SD -8(r1), F8 ; drop SUBI and BNEZ LD F10, -16(r1) ADDD F12, F10, F2 SD -16(r1), F12 ; drop SUBI and BNEZ LD F14, -24(r1) ADDD F16, F14, F2 SD -24(r1), F16 SUBI r1, r1, #32 BNEZ Loop

  23. Unrolling With Scheduling • Don’t concatenate the unrolled segments, Shuffle them instead • 4 LDs then 4 ADDDs then 4 SDs • No more stalls since LD -> ADDD dependent path now has 3 instructions in it… • Result is 14 cycles for 4 elements => 3.5 cycles/element • Compare with 9 cycles with no scheduling • or 6 cycles with scheduling but no unrolling

  24. Loop Unrolling With Scheduling • Loop: LD F0, 0(r1)LD F6, -8(r1)LD F10, -16(r1)LD F14, -24(r1) ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2 SD 0(r1), F4 SD -8(r1), F8 SD -16(r1), F12 SUBI r1, r1, #32 BNEZ LoopSD 8(r1), F16 ; note 8-32 = -24

  25. Things to Notice • We had 8 more unused register pairs • We could have gone to an 8 block unroll without register conflict • No problem since the 1000-element array would still have broken cleanly (1000/8 – 125) • What if it had not? Suppose the division has a remainder R? • Just put R blocks (shuffled of course) in front of the loop, then start for real • Even if you run out of registers, you can still cycle names and remove stalls. • Most compilers unroll early to expose code for later optimizations • This one had a tricky one => SD/SUBI swap • Key was independent nature of each loop body • What if they’re not independent?

  26. Data Dependency Analysis • Three types: Data, Name, and Control • I is data dependent on j if: • I uses a result produced by j • Or, I uses a result produced by K, and k depends on j • Dependence indicates a possible RAW hazard • Does it induce a stall? Depends on pipeline structure and forwarding capability • Compiler dataflow analysis • Creates a graph that makes these dependencies explicit directed paths

  27. Data Dependency SUBI R1, R1, #8 BNEZ R1, Loop Loop: LD F0, 0(r1) ADDD F4, F0, F2 SD 0(r1), F4

  28. Name Dependence • Occurs when second instruction uses same register name without a data dependence • E.g. unrolled loop without changing register names • Let I preceed j in program order • I is antidependent on j when j writes a register that I reads • Essentially the same as a WAR hazard • Hence ordering must be preserved to avoid the hazard • I is output dependent on j if they both write the same register • Essentially a WAW hazard • So we have to avoid that too • Otherwise, no real data dependence, just name • So registers can be renamed statically by the compiler or dynamically by the hardware

  29. Control Dependence • Since branches are conditional • Some instructions will be executed, and others will not • Must maintain order due to branches • Two obvious constraints to maintain control dep’s • Instructions controlled by branch can’t be moved before the branch (or they would become unconditional) • Instructions not controlled by the branch can’t be moved after the branch (or they would become conditional) • Simple pipelines preserve this so it’s not a big deal.

  30. Loop-Carried Dependence • Consider the following code: For(I=1; I<=1000; I++){ A[I+1] = A[I] + C[I]; /* S1*/ B[I+1] = B[I] + A[I+1];} /* S2 */ • S1 uses S1 value produced in a previous iteration • S2 uses S2 value produced in a previous iteration • S2 uses an S1 value produced in the same iteration • So, S1 depends on a loop-carried dependence on S1 • Similar to S2’s loop carried dependence • If non-loop-carried dependencies were the only ones, could execute loop bodies in parallel

  31. Another Loop Carried Dependence For(I=1; I<=100; I++){ A[I]= A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I];} /* S2 */ • S1 uses previous value of S2 • However, dependence is not circular since neither statement depends on itself • And no S1 depends on S2 depends on S1 circularity either • So, no cycle in dependencies, loop can be parallelized and unrolled (provided statements are kept in order) A[1] = A[1] + B[1]; For(I=1; I<=99; I++){ B[I+1] = C[I] + D[I]; A[I+1] = A[I+1] + B[I+1];} B[101] = C[100] + D[100];

  32. Our Infrastructure for Lab 1 In the /home/cs/handin/cs5810/bin directory on CADE • lcc - DLX C compiler - use with -S switch to get assembly code in a .s file • dlxasm - Assembler that converts .s files into .dlx object files that can run on our simulator • bin2a - A binary to ASCII converter that lets you look at object files if you like • dlxsim - A simulator for the DLX processor • Type h or ? At the prompt for brief listing of commands • Only gives executed instruction counts at the moment • You’ll extend it later…

  33. Data Infrastructure In the /home/cs/handin/cs5810/ directory • New directory for each lab • I.e. /home/cs/handin/cs5810/lab1 • Also a src directory with benchmarks (small toy examples) in C

More Related