CS 3214 Computer Systems

CS 3214Computer Systems Godmar Back Lecture 9

Announcements • Stay tuned for Exercise 5 • Project 2 due Sep 30 • Auto-fail rule 2: • Need at least Firecracker to blow up to pass class. CS 3214 Fall 2010

Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 2 Code OPTIMIZATION CS 3214 Fall 2010

Roles of Programmer vs Compiler High-Level • Programmer: • Choice of algorithm, Big-O • Manual application of some optimizations • Choice of program structure that’s amenable to optimization • Avoidance of “optimization blockers” Programmer Compiler Low-Level CS 3214 Fall 2010

Roles of Programmer vs Compiler High-Level • Optimizing Compiler • Applies transformations that preserve semantics, but reduce amount of, or time spent in computations • Provides efficient mapping of code to machine: • Selects and orders code • Performs register allocation • Usually consists of multiple stages Programmer Compiler Low-Level CS 3214 Fall 2010

Eliminating Memory Accesses, Take 1 • Registers are faster than memory double sp1(double *x, double *y) { double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff; } How many memory accesses? sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret Number of memory accesses not related to how often pointerdereferences occur in source code CS 3214 Fall 2010

Eliminating Memory Accesses, Take 2 • Order of accesses matters void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } How many memory accesses? sp1: movsd (%rdi), %xmm0 addsd (%rsi), %xmm0 movsd %xmm0, (%rdx) movsd (%rdi), %xmm0 mulsd (%rsi), %xmm0 movsd %xmm0, (%rcx) ret CS 3214 Fall 2010

Eliminating Memory Accesses, Take 3 • Compiler doesn’t know that sum or prod will never point to same location as x or y! void sp2(double *x, double *y, double *sum, double *prod) { double xlocal = *x; double ylocal = *y; *sum = xlocal + ylocal; *prod = xlocal * ylocal; } How many memory accesses? sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret CS 3214 Fall 2010

Inlining • Substitute body of called function into the caller • *before subsequent optimizations are applied* • Current compilers do this aggressively • Almost never a need for doing this manually (e.g., via #define) CS 3214 Fall 2010

Inlining Example void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } double outersp1(double *x, double *y) { double sum, prod; sp1(x, y, &sum, &prod); return sum > prod ? sum : prod; } outersp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 movapd %xmm1, %xmm0 mulsd %xmm2, %xmm1 addsd %xmm2, %xmm0 maxsd %xmm1, %xmm0 ret CS 3214 Fall 2010

length 0 1 2 length–1 data    Case Study: Vector ADT • Procedures vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data • Similar to array implementations in Pascal, ML, Java • E.g., always do bounds checking CS 3214 Fall 2010

Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location CS 3214 Fall 2010

Time Scales • Absolute Time • Typically use nanoseconds: 10–9seconds • Time scale of computer instructions • Clock Cycles Example: rlogin cluster machines: 2GHz 2 X 109 cycles per second • Clock period = 0.5ns • Most modern architectures provide way to directly read cycle counter: “TSC” – “time stamp counter” • But: can be tricky because it captures OS interaction as well CS 3214 Fall 2010

Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists Length = n  T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5 CS 3214 Fall 2010

Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2) CS 3214 Fall 2010

Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } • Inefficiency • Procedure vec_length called every iteration • Even though result always the same 1 iteration CS 3214 Fall 2010

Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 20.66 (Compiled -O2) • vec_length requires only constant time, but significant overhead CS 3214 Fall 2010

Code Motion Example #2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } • Convert string from upper to lower • Here: asymptotic complexity becomes O(n^2)! CS 3214 Fall 2010

Lower Case Conversion Performance • Time quadruples when double string length • Quadratic performance CS 3214 Fall 2010

Performance after Code Motion • Time doubles when double string length • Linear performance CS 3214 Fall 2010

Optimization Blocker: Procedure Calls • Why couldn’t the compiler move vec_len or strlen out of the inner loop? • Procedure may have side effects • Alters global state each time called • Function may not return same value for given arguments • Depends on other parts of global state • Procedure lower could interact with strlen • What if compiler looks at code? Or inlines them? • even then, compiler may not be able to prove that the same result is obtained, or the possibility of aliasing may require repeating the operation; and compiler must preserve any side-effects • interproceduraloptimization is expensive, but compilers are continuously getting better at it • For instance, take into account if a function reads or writes to global memory • Today’s compilers are different from the compilers 5 years ago and will be different from those 5 years from now CS 3214 Fall 2010

Remove Bounds Checking void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive CS 3214 Fall 2010

Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive! CS 3214 Fall 2010

Detecting Unneeded Memory Refs. Combine3 Combine4 .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cycles CS 3214 Fall 2010

Pointer Code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } Big question: Should you rewrite your array code as pointer code to “help” the compiler? • Optimization • Use pointers rather than array references • CPE: 3.00 (Compiled -O2) • Oops! Worse than the best array version Warning: Some compilers do better job optimizing array code CS 3214 Fall 2010

Pointer vs. Array Code Inner Loops .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop • Array Code • Pointer Code • Performance • Array Code: 4 instructions in 2 clock cycles • Pointer Code: Almost same 4 instructions in 3 clock cycles .L30: # Loop: addl (%eax),%ecx # sum += *data addl $4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop CS 3214 Fall 2010

Pointer vs. Array Code • Difficult to predict which would be faster • Compiler may transform array to pointer form if it deems it useful • Compiler as a rule optimizes array code as good or better as it does pointer code • Writing as array code allows use of index variable in index-based address modes • Should prefer array form for readability CS 3214 Fall 2010

Lessons so far (1) • Does not matter how many local variables or temporaries you introduce • Does not matter if you use constants, expressions, or const local variables, or write-once local variables • So optimize for readability, not the compiler • Does not matter how many pointer derefs you have in your code (*, [ ], ->) as long as there’s no intervening write/store to memory • If there is, compiler must repeat the ‘load’ • Avoid introducing ‘stores’ by introducing local temporaries that defer the write to memory whenever possible • Don’t rewrite array code into pointer form CS 3214 Fall 2010

Lessons so far (2) • Inlining changes the game substantially • Compiler will aggressively inline functions whose definitions occur in same compilation unit • Does not matter if declared ‘static’ or not; but must be static if included in multiple files to avoid multiple strong symbols • Can remove abstraction penalty entirely in many cases • No need for manual inlining, using macros • Inlining can generate better code because it enables optimizations not possible without knowing the caller: • potential for aliasing of pointer arguments may be reduced, allowing for more precise and less-conservative points-to analysis • May be able to remove bounds-checks even (next slide) • Caveat: inlining is not possible if target of the call is not known to the compiler • E.g. non-final, non-private methods in Java, or “virtual” methods in C++; so declare your methods final or private in Java whenever possible CS 3214 Fall 2010

combine1 Example under inlining /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } /* Return length of vector */ int vec_length(vec_ptr v) { return v->len; } void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location CS 3214 Fall 2010

combine1: pushl %ebp movl %esp, %ebp movl 12(%ebp), %ecx pushl %esi movl 8(%ebp), %esi pushl %ebx movl $0, (%ecx) movl (%esi), %eax testl %eax, %eax jle .L375 movl 4(%esi), %ebx xorl %edx, %edx .p2align 4,,7 .L374: movl (%ebx,%edx,4), %eax addl $1, %edx addl %eax, (%ecx) cmpl %edx, (%esi) jg .L374 .L375: popl %ebx popl %esi popl %ebp ret Form after inlining void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < v->len; i++) { int val; if (i < 0 || i >= v->len) // become redundant! { ret = 0; goto skip; } val = v->data[index]; ret = 1; skip: /* caller ignored return value */ *dest += val; } } CS 3214 Fall 2010

CS 3214 Computer Systems