1 / 41

ECE 454 Computer Systems Programming Compiler and Optimization (II)

ECE 454 Computer Systems Programming Compiler and Optimization (II). Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan. Content. Compiler Basics (last lec .) Understanding Compiler Optimization (last lec .) Manual Optimization Advanced Optimizations

Download Presentation

ECE 454 Computer Systems Programming Compiler and Optimization (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 454 Computer Systems ProgrammingCompiler and Optimization (II) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan

  2. Content • Compiler Basics (last lec.) • Understanding Compiler Optimization (last lec.) • Manual Optimization • Advanced Optimizations • Parallel Unrolling • Profile-Directed Feedback Ding Yuan, ECE454

  3. Manual OptimizationExample: Vector Sum Function Ding Yuan, ECE454

  4. length 0 1 2 length–1 data    Vector Procedures vec_ptr: • Procedures: vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data Ding Yuan, ECE454

  5. Vector Sum Function: Original Code voidvsum(vec_ptr v, int*dest) { inti; *dest = 0; for (i=0; i < vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • What it does: • Add all the vector elements together • Store the resulting sum at destination location *dest • Performance Metric: CPU Cycles per Element (CPE) • CPE = 42.06 (Compiled -g) Ding Yuan, ECE454

  6. After Compiler Optimization // same as vsum void vsum1(vec_ptr v, int *dest) { inti; *dest = 0; for (i=0; i<vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } void vsum(vec_ptr v, int *dest) { inti; *dest = 0; for (i=0; i<vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • Impact of compiler optimization • vsum: • CPE = 42.06 (Compiled -g) • vsum1: (a C-code view of the resulting assembly instructions) • CPE = 31.25 (Compiled -O2, will compile –O2 from now on) • Result: better scheduling & regalloc, lots of missed opportunities Ding Yuan, ECE454

  7. Optimization Blocker: Procedure Calls • Why didn’t compiler move vec_len out of the inner loop? • Procedure may have side effects • Function may not return same value for given arguments • Depends on other parts of global state • Why doesn’t compiler look at code for vec_len? • Linker may overload with different version • Unless declared static • Interprocedural optimization is not used extensively due to cost • Warning: • Compiler treats procedure call as a black box • Weak optimizations in and around them Ding Yuan, ECE454

  8. Manual LICM void vsum2(vec_ptr v, int *dest) { inti; intlength = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { intval; get_vec_element(v, i, &val); *dest += val; } } void vsum1(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i<vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • CPE = 20.66 • Next Opportunity: Inliningget_vec_element()? • compiler may not have thought worth it • O2 will prioritize code size increase rather than performance Ding Yuan, ECE454

  9. Manual Inlining void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } void vsum2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Inlining: replace a function call with its body • shouldn’t normally have to do this manually • could also convince the compiler to do it via directives • Inlining enables further optimizations Ding Yuan, ECE454

  10. Further LICM, val unnecessary void vsum3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } CPE = 6.00 • Compiler may not always inline when it should • It may not know how important a certain loop is • It may be blocked by missing function definition Ding Yuan, ECE454

  11. Reduce Unnecessary Memory Refs void vsum4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } void vsum3(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } • Don’t need to store in destination until end • Can use local variable sum instead • It is more likely to be allocated to a register! • Avoids 1 memory read, 1 memory write per iteration • CPE = 2.00 • Difficult for the compiler to promote pointer de-refs to registers Ding Yuan, ECE454

  12. Why are pointers so hard for compilers? • Pointer Aliasing • Two different pointers might point to a single location • Example: int x = 5, y = 10; int *p = &x; int *q = &y; if (one_in_a_million){ q = p; } // compiler must assume q==p = &x from here on, though unlikely Ding Yuan, ECE454

  13. Minimizing the impact of Pointers • Easy to over-use pointers in C/C++ • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Eg., when accumulating within loops • Your way of telling compiler not to worry about aliasing Ding Yuan, ECE454

  14. Understanding Instruction-Level Parallelism: Unrolling and Software Pipelining Ding Yuan, ECE454

  15. Superscalar CPU Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache Ding Yuan, ECE454

  16. Assumed CPU Capabilities • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division • Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Integer Divide 36 36 • Double/Single FP Divide 38 38 Ding Yuan, ECE454

  17. Intel Core i7 Ding Yuan, ECE454

  18. Zooming into x86 assembly void vsum4(vec_ptr v, int *dest) { int i; // i => %edx int length = vec_length(v); // length => %esi int *data = get_vec_start(v); // data => %eax int sum = 0; // sum => %ecx for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } .L24: # Loop code only: addl (%eax,%edx,4),%ecx # sum += data[i] // data+i*4 incl %edx # i++ cmpl %esi,%edx # compare i and length jl .L24 # if i < length goto L24 Ding Yuan, ECE454

  19. %edx(i) load load incl %edx.1 %ecx.i +1 cmpl cc.1 %ecx.0 (sum) jl t.1 addl %ecx.1 Note: addl includes a load! addl (%eax,%edx,4),%ecx 3 cycles addl 1 cycle Ding Yuan, ECE454

  20. %edx.0 (i) load load incl %edx.1 %ecx.i +1 cmpl cc.1 %ecx.0 (sum) jl t.1 addl %ecx.1 Visualizing the vsum Loop .L24: # Loop code only: addl (%eax,%edx,4),%ecx # sum += data[i] // data+i*4 incl %edx # i++ cmpl %esi,%edx # compare i and length jl .L24 # if i < length goto L24 • Operations • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency • addl is split into a load and addl • Eg., X86 converts to micro-ops Time Ding Yuan, ECE454

  21. 4 Iterations of vsum • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • CPE = 1.0 • But needs to be able to issue 4 integer ops per cycle • Performance on assumed CPU: • CPE = 2.0 • Number of parallel integer ops (2) is the bottleneck 4 integer ops Ding Yuan, ECE454

  22. Loop Unrolling: vsum void vsum5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; inti; for (i = 0; i < limit; i+=3){ sum += data[i]; sum += data[i+1]; sum += data[i+2]; } for ( ; i < length; i++){ sum += data[i] *dest = sum; } • Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end • Measured CPE = 1.33 Ding Yuan, ECE454

  23. %edx.0 load addl %edx.1 load %ecx.i +1 cmpl cc.1 load jl %ecx.0c t.1a addl t.1b %ecx.1a addl t.1c %ecx.1b addl %ecx.1c Visualizing Unrolled Loop: vsum • Loads can pipeline, since don’t have dependences • Only one set of loop control operations Time Ding Yuan, ECE454

  24. Executing with Loop Unrolling: vsum • Predicted Performance • Can complete iteration in 3 cycles • Should give CPE of 1.0 Ding Yuan, ECE454

  25. Why 3 times? %edx.0 %edx.0 %edx.0 load load load addl addl addl %edx.1 %edx.1 %edx.1 load load load %ecx.i +1 %ecx.i +1 %ecx.i +1 cmpl cmpl cmpl cc.1 cc.1 cc.1 jl jl jl %ecx.0c %ecx.0c t.1a t.1a t.1a addl addl addl t.1b t.1b t.1b %ecx.1a %ecx.1a %ecx.1a addl addl addl %ecx.1b %ecx.1b %ecx.1b Unroll 2 times: what is the problem? Ding Yuan, ECE454

  26. %edx.0 load incl %edx.1 cmpl cc.1 jl %ecx.0 t.1 imull %ecx.1 Vector multiply function: vprod .L24: # Loop: imull (%eax,%edx,4),%ecx # prod *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop void vprod4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int prod = 1; for (i = 0; i < length; i++) prod *= data[i]; *dest = prod; } Time Ding Yuan, ECE454

  27. 3 Iterations of vprod • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • Performance on CPU: • Limiting factor becomes latency of integer multiplier • Data hazard • Gives CPE of 4.0 Ding Yuan, ECE454

  28. Unrolling: Performance Summary • Only helps integer sum for our examples • Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling • Many subtle effects determine exact scheduling of operations can we do better for vprod? Ding Yuan, ECE454

  29. 1 x0 * x1 * x2 * x3 * x4 * x5 * x6 * x7 * x8 * x9 * x10 * x11 * Visualizing vprod • Computation ((((((((((((1 * x0) * x1) * x2) * x3) * x4) * x5) * x6) * x7) * x8) * x9) * x10) * x11) • Performance • N elements, D cycles/operation • N*D cycles = N * 4 Ding Yuan, ECE454

  30. Parallel Unrolling Method1: vprod void vprod5pu1(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; inti; for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } if (i < length) // fix-up x *= data[i]; *dest = x; } • Code Version • Integer product • Optimization • Multiply pairs of elements together • And then update product • Performance • CPE = 2 Assume unlimited hardware resources from now on Ding Yuan, ECE454

  31. %edx.0 %edx.1 load load addl addl load load cmpl cmpl jl jl mull mull • Still have the data dependency btw. multiplications, but it’s one dep. every two elements tmp=data[i] * data[i+1] mull x=x*tmp mull Ding Yuan, ECE454

  32. Order of OperationsCan Matter! /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } • CPE = 4.00 • All multiplies performed in sequence Compiler can do it for you /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } • CPE = 2 • Multiplies overlap Ding Yuan, ECE454

  33. Parallel Unrolling Method2: vprod void vprod6pu2(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } if (i < length) // fix-up x0 *= data[i]; *dest = x0 * x1; } • Code Version • Integer product • Optimization • Accumulate in two different products • Can be performed simultaneously • Combine at end • Performance • CPE = 2.0 • 2X performance Compiler won’t do it for you Ding Yuan, ECE454

  34. Method2: vprod6pu2 visualized for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } … *dest = x0 * x1; Ding Yuan, ECE454

  35. Trade-off for loop unrolling • Benefits • Reduce the loop operations (e.g., condition check & loop variable increment) • Reason for the integer add (vsum5) to achieve a CPE of 1 • Enhance parallelism (often require manually rewrite the loop body) • Harm • Increased code size • Register pressure • Lots of Registers to hold sums/products • IA32: Only 6 usable integer registers, 8 FP registers • When not enough registers, must spill temporaries onto stack • Wipes out any performance gains Ding Yuan, ECE454

  36. Needed for Effective Parallel Unrolling: • Mathematical • Combining operation must be associative & commutative • OK for integer multiplication • Not strictly true for floating point • OK for most applications • Hardware • Pipelined functional units, superscalar execution • Lots of Registers to hold sums/products Ding Yuan, ECE454

  37. Summary Ding Yuan, ECE454

  38. Summary of Optimizations Ding Yuan, ECE454

  39. Punchlines • Before you start: will optimization be worth the effort? • profile to estimate potential impact • Get the most out of your compiler before going manual • trade-off: manual optimization vs readable/maintainable • Exploit compiler opts for readable/maintainable code • have lots of functions/modularity • debugging/tracing code (enabled by static flags) • Limit use of pointers • pointer-based arrays and pointer arithmetic • function pointers and virtual functions (unfortunately) • For super-performance-critical code: • zoom into assembly to look for optimization opportunities • consider the instruction-parallelism capabilities of CPU Ding Yuan, ECE454

  40. How to know what the compiler did? run: objdump –d a.out • prints a listing of instructions , with inst address and encoding void main(){ inti = 5; int x = 10; … 08048394 <main>: 8048394: 55 push %ebp 8048395: 89 e5 mov %esp,%ebp 8048397: 83 ec 10 sub $0x10,%esp 804839a: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp) 80483a1: c7 45 fc 0a 00 00 00 movl $0xa,-0x4(%ebp) inst encoding inst address Ding Yuan, ECE454

  41. Advanced Topics • Profile-Directed Feedback (PDF) • Basically feeding gprof-like measurements into compiler • Allows compiler to make smarter choices/optimizations • Eg., handle the “if (one-in-a-million)” case well • Just-in-Time Compilation and Optimization (JIT) • Performs simple version of many of the basic optimizations • Does them dynamically at run-time, exploits online PDF • Eg., Java JIT • Current hot topics in compilation (more later): • Automatic parallelization (exploiting multicores) Ding Yuan, ECE454

More Related