ECE 454 Computer Systems Programming Compiler and Optimization (II)

ECE 454 Computer Systems ProgrammingCompiler and Optimization (II) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan

Content • Compiler Basics (last lec.) • Understanding Compiler Optimization (last lec.) • Manual Optimization • Advanced Optimizations • Parallel Unrolling • Profile-Directed Feedback Ding Yuan, ECE454

Manual OptimizationExample: Vector Sum Function Ding Yuan, ECE454

length 0 1 2 length–1 data    Vector Procedures vec_ptr: • Procedures: vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data Ding Yuan, ECE454

Vector Sum Function: Original Code voidvsum(vec_ptr v, int*dest) { inti; *dest = 0; for (i=0; i < vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • What it does: • Add all the vector elements together • Store the resulting sum at destination location *dest • Performance Metric: CPU Cycles per Element (CPE) • CPE = 42.06 (Compiled -g) Ding Yuan, ECE454

After Compiler Optimization // same as vsum void vsum1(vec_ptr v, int *dest) { inti; *dest = 0; for (i=0; i<vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } void vsum(vec_ptr v, int *dest) { inti; *dest = 0; for (i=0; i<vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • Impact of compiler optimization • vsum: • CPE = 42.06 (Compiled -g) • vsum1: (a C-code view of the resulting assembly instructions) • CPE = 31.25 (Compiled -O2, will compile –O2 from now on) • Result: better scheduling & regalloc, lots of missed opportunities Ding Yuan, ECE454

Optimization Blocker: Procedure Calls • Why didn’t compiler move vec_len out of the inner loop? • Procedure may have side effects • Function may not return same value for given arguments • Depends on other parts of global state • Why doesn’t compiler look at code for vec_len? • Linker may overload with different version • Unless declared static • Interprocedural optimization is not used extensively due to cost • Warning: • Compiler treats procedure call as a black box • Weak optimizations in and around them Ding Yuan, ECE454

Manual LICM void vsum2(vec_ptr v, int *dest) { inti; intlength = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { intval; get_vec_element(v, i, &val); *dest += val; } } void vsum1(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i<vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • CPE = 20.66 • Next Opportunity: Inliningget_vec_element()? • compiler may not have thought worth it • O2 will prioritize code size increase rather than performance Ding Yuan, ECE454

Manual Inlining void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } void vsum2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Inlining: replace a function call with its body • shouldn’t normally have to do this manually • could also convince the compiler to do it via directives • Inlining enables further optimizations Ding Yuan, ECE454

Further LICM, val unnecessary void vsum3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } CPE = 6.00 • Compiler may not always inline when it should • It may not know how important a certain loop is • It may be blocked by missing function definition Ding Yuan, ECE454

Reduce Unnecessary Memory Refs void vsum4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } void vsum3(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } • Don’t need to store in destination until end • Can use local variable sum instead • It is more likely to be allocated to a register! • Avoids 1 memory read, 1 memory write per iteration • CPE = 2.00 • Difficult for the compiler to promote pointer de-refs to registers Ding Yuan, ECE454

Why are pointers so hard for compilers? • Pointer Aliasing • Two different pointers might point to a single location • Example: int x = 5, y = 10; int *p = &x; int *q = &y; if (one_in_a_million){ q = p; } // compiler must assume q==p = &x from here on, though unlikely Ding Yuan, ECE454

Minimizing the impact of Pointers • Easy to over-use pointers in C/C++ • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Eg., when accumulating within loops • Your way of telling compiler not to worry about aliasing Ding Yuan, ECE454

Understanding Instruction-Level Parallelism: Unrolling and Software Pipelining Ding Yuan, ECE454

Superscalar CPU Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache Ding Yuan, ECE454

Assumed CPU Capabilities • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division • Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Integer Divide 36 36 • Double/Single FP Divide 38 38 Ding Yuan, ECE454

Intel Core i7 Ding Yuan, ECE454

Zooming into x86 assembly void vsum4(vec_ptr v, int *dest) { int i; // i => %edx int length = vec_length(v); // length => %esi int *data = get_vec_start(v); // data => %eax int sum = 0; // sum => %ecx for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } .L24: # Loop code only: addl (%eax,%edx,4),%ecx # sum += data[i] // data+i*4 incl %edx # i++ cmpl %esi,%edx # compare i and length jl .L24 # if i < length goto L24 Ding Yuan, ECE454

%edx(i) load load incl %edx.1 %ecx.i +1 cmpl cc.1 %ecx.0 (sum) jl t.1 addl %ecx.1 Note: addl includes a load! addl (%eax,%edx,4),%ecx 3 cycles addl 1 cycle Ding Yuan, ECE454

%edx.0 (i) load load incl %edx.1 %ecx.i +1 cmpl cc.1 %ecx.0 (sum) jl t.1 addl %ecx.1 Visualizing the vsum Loop .L24: # Loop code only: addl (%eax,%edx,4),%ecx # sum += data[i] // data+i*4 incl %edx # i++ cmpl %esi,%edx # compare i and length jl .L24 # if i < length goto L24 • Operations • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency • addl is split into a load and addl • Eg., X86 converts to micro-ops Time Ding Yuan, ECE454

4 Iterations of vsum • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • CPE = 1.0 • But needs to be able to issue 4 integer ops per cycle • Performance on assumed CPU: • CPE = 2.0 • Number of parallel integer ops (2) is the bottleneck 4 integer ops Ding Yuan, ECE454

Loop Unrolling: vsum void vsum5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; inti; for (i = 0; i < limit; i+=3){ sum += data[i]; sum += data[i+1]; sum += data[i+2]; } for ( ; i < length; i++){ sum += data[i] *dest = sum; } • Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end • Measured CPE = 1.33 Ding Yuan, ECE454

%edx.0 load addl %edx.1 load %ecx.i +1 cmpl cc.1 load jl %ecx.0c t.1a addl t.1b %ecx.1a addl t.1c %ecx.1b addl %ecx.1c Visualizing Unrolled Loop: vsum • Loads can pipeline, since don’t have dependences • Only one set of loop control operations Time Ding Yuan, ECE454

Executing with Loop Unrolling: vsum • Predicted Performance • Can complete iteration in 3 cycles • Should give CPE of 1.0 Ding Yuan, ECE454

Why 3 times? %edx.0 %edx.0 %edx.0 load load load addl addl addl %edx.1 %edx.1 %edx.1 load load load %ecx.i +1 %ecx.i +1 %ecx.i +1 cmpl cmpl cmpl cc.1 cc.1 cc.1 jl jl jl %ecx.0c %ecx.0c t.1a t.1a t.1a addl addl addl t.1b t.1b t.1b %ecx.1a %ecx.1a %ecx.1a addl addl addl %ecx.1b %ecx.1b %ecx.1b Unroll 2 times: what is the problem? Ding Yuan, ECE454

%edx.0 load incl %edx.1 cmpl cc.1 jl %ecx.0 t.1 imull %ecx.1 Vector multiply function: vprod .L24: # Loop: imull (%eax,%edx,4),%ecx # prod *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop void vprod4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int prod = 1; for (i = 0; i < length; i++) prod *= data[i]; *dest = prod; } Time Ding Yuan, ECE454

3 Iterations of vprod • Unlimited Resource Analysis • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • Performance on CPU: • Limiting factor becomes latency of integer multiplier • Data hazard • Gives CPE of 4.0 Ding Yuan, ECE454

Unrolling: Performance Summary • Only helps integer sum for our examples • Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling • Many subtle effects determine exact scheduling of operations can we do better for vprod? Ding Yuan, ECE454

1 x0 * x1 * x2 * x3 * x4 * x5 * x6 * x7 * x8 * x9 * x10 * x11 * Visualizing vprod • Computation ((((((((((((1 * x0) * x1) * x2) * x3) * x4) * x5) * x6) * x7) * x8) * x9) * x10) * x11) • Performance • N elements, D cycles/operation • N*D cycles = N * 4 Ding Yuan, ECE454

Parallel Unrolling Method1: vprod void vprod5pu1(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; inti; for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } if (i < length) // fix-up x *= data[i]; *dest = x; } • Code Version • Integer product • Optimization • Multiply pairs of elements together • And then update product • Performance • CPE = 2 Assume unlimited hardware resources from now on Ding Yuan, ECE454

%edx.0 %edx.1 load load addl addl load load cmpl cmpl jl jl mull mull • Still have the data dependency btw. multiplications, but it’s one dep. every two elements tmp=data[i] * data[i+1] mull x=x*tmp mull Ding Yuan, ECE454

Order of OperationsCan Matter! /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } • CPE = 4.00 • All multiplies performed in sequence Compiler can do it for you /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } • CPE = 2 • Multiplies overlap Ding Yuan, ECE454

Parallel Unrolling Method2: vprod void vprod6pu2(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } if (i < length) // fix-up x0 *= data[i]; *dest = x0 * x1; } • Code Version • Integer product • Optimization • Accumulate in two different products • Can be performed simultaneously • Combine at end • Performance • CPE = 2.0 • 2X performance Compiler won’t do it for you Ding Yuan, ECE454

Method2: vprod6pu2 visualized for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } … *dest = x0 * x1; Ding Yuan, ECE454

Trade-off for loop unrolling • Benefits • Reduce the loop operations (e.g., condition check & loop variable increment) • Reason for the integer add (vsum5) to achieve a CPE of 1 • Enhance parallelism (often require manually rewrite the loop body) • Harm • Increased code size • Register pressure • Lots of Registers to hold sums/products • IA32: Only 6 usable integer registers, 8 FP registers • When not enough registers, must spill temporaries onto stack • Wipes out any performance gains Ding Yuan, ECE454

Needed for Effective Parallel Unrolling: • Mathematical • Combining operation must be associative & commutative • OK for integer multiplication • Not strictly true for floating point • OK for most applications • Hardware • Pipelined functional units, superscalar execution • Lots of Registers to hold sums/products Ding Yuan, ECE454

Summary Ding Yuan, ECE454

Summary of Optimizations Ding Yuan, ECE454

Punchlines • Before you start: will optimization be worth the effort? • profile to estimate potential impact • Get the most out of your compiler before going manual • trade-off: manual optimization vs readable/maintainable • Exploit compiler opts for readable/maintainable code • have lots of functions/modularity • debugging/tracing code (enabled by static flags) • Limit use of pointers • pointer-based arrays and pointer arithmetic • function pointers and virtual functions (unfortunately) • For super-performance-critical code: • zoom into assembly to look for optimization opportunities • consider the instruction-parallelism capabilities of CPU Ding Yuan, ECE454

How to know what the compiler did? run: objdump –d a.out • prints a listing of instructions , with inst address and encoding void main(){ inti = 5; int x = 10; … 08048394 <main>: 8048394: 55 push %ebp 8048395: 89 e5 mov %esp,%ebp 8048397: 83 ec 10 sub $0x10,%esp 804839a: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp) 80483a1: c7 45 fc 0a 00 00 00 movl $0xa,-0x4(%ebp) inst encoding inst address Ding Yuan, ECE454

Advanced Topics • Profile-Directed Feedback (PDF) • Basically feeding gprof-like measurements into compiler • Allows compiler to make smarter choices/optimizations • Eg., handle the “if (one-in-a-million)” case well • Just-in-Time Compilation and Optimization (JIT) • Performs simple version of many of the basic optimizations • Does them dynamically at run-time, exploits online PDF • Eg., Java JIT • Current hot topics in compilation (more later): • Automatic parallelization (exploiting multicores) Ding Yuan, ECE454

ECE 454 Computer Systems Programming Compiler and Optimization (II)

ECE 454 Computer Systems Programming Compiler and Optimization (II)

Presentation Transcript

ECE 454/CS594 Computer and Network Security

ECE 454 Computer Systems Programming CPU Architectures

ECE 454/CS594 Computer and Network Security

ECE 454/599 Computer and Network Security

ECE 454/CS594 Computer and Network Security

ECE 454 Computer Systems Programming Compiler and Optimization (I)

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches)

ECE 454/599 Computer and Network Security

ECE 454/599 Computer and Network Security

Compiler Optimizations ECE 454 Computer Systems Programming

Timing and Profiling ECE 454 Computer Systems Programming

Architecture Basics ECE 454 Computer Systems Programming

Compiler Optimizations ECE 454 Computer Systems Programming

Advanced Topics: Prefetching ECE 454 Computer Systems Programming