1 / 35

CS 3214 Introduction to Computer Systems

CS 3214 Introduction to Computer Systems. Godmar Back. Lecture 6. Announcements. Read Chapter 5 Project 1 due Wed, Sep 16 - tomorrow Need at least phase_4 defused to pass class. Exercise 5 due Thursday, Sep 17 Project 2 due Fri, Sep 25 Need to pass at least “Firecracker” phase

tlauren
Download Presentation

CS 3214 Introduction to Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 3214Introduction to Computer Systems Godmar Back Lecture 6

  2. Announcements • Read Chapter 5 • Project 1 due Wed, Sep 16 - tomorrow • Need at least phase_4 defused to pass class. • Exercise 5 due Thursday, Sep 17 • Project 2 due Fri, Sep 25 • Need to pass at least “Firecracker” phase • Midterm: Tuesday Oct 20 • Office hours today 2-5pm (McB 122/124) CS 3214 Fall 2009

  3. Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 1 Code OPTIMIZATION CS 3214 Fall 2009

  4. Asymptotic Complexity • “Big-O” O(…), Θ(…) • Describes asymptotic behavior of time or space cost function of an algorithm as input size grows • Subject of complexity analysis (CS 3114) • Determine if a problem is tractable or not • Example: • Quicksort – O(n log n) average case • Bubble Sort – O(n^2) average case • Actual cost may be C1 * N^2 + C2 * N + C3 • These constants can matter and optimization can reduce them • Determine how big of a problem a tractable algorithm can handle in a concrete implementation on a given machine CS 3214 Fall 2009

  5. Roles of Programmer vs Compiler High-Level • Programmer: • Choice of algorithm, Big-O • Manual application of some optimizations • Choice of program structure that’s amenable to optimization • Avoidance of “optimization blockers” Programmer Compiler Low-Level CS 3214 Fall 2009

  6. Roles of Programmer vs Compiler High-Level • Optimizing Compiler • Applies transformations that preserve semantics, but reduce amount of, or time spent in computations • Provides efficient mapping of code to machine: • Selects and orders code • Performs register allocation • Usually consists of multiple stages Programmer Compiler Low-Level CS 3214 Fall 2009

  7. trees RTL Inside GCC compiler (C, C++, Java) parser / semantic checker tree optimizations RTL passes gimplifier expander target arch instruction selection Source: Deters & Cytron, OOPSLA 2006 CS 3214 Fall 2009

  8. Limitations of Optimizing Compilers • Fundamentally, must emit code that implements specified semantics under all conditions • Can’t apply optimizations even if they would only change behavior in corner case a programmer may not think of • Due to memory aliasing • Due to unseen procedure side-effects • Do not see beyond current compilation unit • Intraprocedural analysis typically more extensive (since cheaper) than interprocedural analysis • Usually base decisions on static information CS 3214 Fall 2009

  9. Optimizations • Copy Propagation • Code Motion • Strength Reduction • Common Subexpression Elimination • Eliminating Memory Accesses • Through use of registers • Inlining CS 3214 Fall 2009

  10. Getting the compiler to optimize • -O0(“O zero”) • This is the default: Do not optimize • -O1 • Apply optimizations that can be done quickly • -O2 • Apply more expensive optimizations. That’s a reasonable default for running production code. Typical ratio between –O2 and –O0 is 5-20. • -O3 • Apply even more expensive optimizations • -Os • Optimize for code size • See ‘info gcc’ for list which optimizations are enabled when; note that –f switches may enable additional optimizations that are not included in –O • Note: ability to debug code symbolically under gdb decreases with optimization level; usually use –O0 –g or –O1 -g • NB: other compilers use different switches – some have up to –O7 CS 3214 Fall 2009

  11. Copy Propagation (1) int arith1(int x, int y, int z) { int x_plus_y = x + y; int x_minus_y = x - y; int x_plus_z = x + z; int x_minus_z = x - z; int y_plus_z = y + z; int y_minus_z = y - z; int xy_prod = x_plus_y * x_minus_y; int xz_prod = x_plus_z * x_minus_z; int yz_prod = y_plus_z * y_minus_z; return xy_prod + xz_prod + yz_prod; } • Which produces faster code? int arith2(int x, int y, int z) { return (x + y) * (x - y) + (x + z) * (x - z) + (y + z) * (y - z); } CS 3214 Fall 2009

  12. Copy Propagation (2) arith1: leal (%rdx,%rdi), %ecx movl %edi, %eax subl %edx, %eax imull %ecx, %eax movl %esi, %ecx subl %edx, %ecx addl %esi, %edx imull %edx, %ecx movl %edi, %edx subl %esi, %edx addl %edi, %esi imull %esi, %edx addl %ecx, %eax addl %edx, %eax ret arith2: leal (%rdx,%rdi), %ecx movl %edi, %eax subl %edx, %eax imull %ecx, %eax movl %esi, %ecx subl %edx, %ecx addl %esi, %edx imull %edx, %ecx movl %edi, %edx subl %esi, %edx addl %edi, %esi imull %esi, %edx addl %ecx, %eax addl %edx, %eax ret CS 3214 Fall 2009

  13. Constant Propagation #include <stdio.h> int sum(int a[], int n) { int i, s = 0; for (i = 0; i < n; i++) s += a[i]; return s; } int main() { int v[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int s = sum(v, 10); printf("Sum is %d\n", s); } • What does the compiler produce? .LC0: .string "Sum is %d\n" main: … movl $55, 4(%esp) movl $.LC0, (%esp) call printf CS 3214 Fall 2009

  14. Code Motion • Do not repeat computations if result is known • Usually out of loops (“code hoisting”) for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; CS 3214 Fall 2009

  15. Strength Reduction • Substitute lower cost operation for more expensive one • E.g., replace 48*x with (x << 6) – (x << 4) • Often machine dependent, see exercise CS 3214 Fall 2009

  16. Common Subexpression Elimination • Reuse already computed expressions 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n /* Sum neighbors of i,j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; CS 3214 Fall 2009

  17. Eliminating Memory Accesses, Take 1 • Registers are faster than memory double sp1(double *x, double *y) { double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff; } How many memory accesses? sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret Number of memory accesses not related to how often pointerdereferences occur in source code CS 3214 Fall 2009

  18. Eliminating Memory Accesses, Take 2 • Order of accesses matters void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } How many memory accesses? sp1: movsd (%rdi), %xmm0 addsd (%rsi), %xmm0 movsd %xmm0, (%rdx) movsd (%rdi), %xmm0 mulsd (%rsi), %xmm0 movsd %xmm0, (%rcx) ret CS 3214 Fall 2009

  19. Eliminating Memory Accesses, Take 3 • Compiler doesn’t know that sum or prod will never point to same location as x or y! void sp2(double *x, double *y, double *sum, double *prod) { double xlocal = *x; double ylocal = *y; *sum = xlocal + ylocal; *prod = xlocal * ylocal; } How many memory accesses? sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret CS 3214 Fall 2009

  20. Inlining • Substitute body of called function into the caller • *before subsequent optimizations are applied* • Current compilers do this aggressively • Almost never a need for doing this manually (e.g., via #define) CS 3214 Fall 2009

  21. Inlining Example void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } double outersp1(double *x, double *y) { double sum, prod; sp1(x, y, &sum, &prod); return sum > prod ? sum : prod; } outersp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 movapd %xmm1, %xmm0 mulsd %xmm2, %xmm1 addsd %xmm2, %xmm0 maxsd %xmm1, %xmm0 ret CS 3214 Fall 2009

  22. length 0 1 2 length–1 data    Case Study: Vector ADT • Procedures vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data • Similar to array implementations in Pascal, ML, Java • E.g., always do bounds checking CS 3214 Fall 2009

  23. Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location CS 3214 Fall 2009

  24. Time Scales • Absolute Time • Typically use nanoseconds: 10–9seconds • Time scale of computer instructions • Clock Cycles Example: rlogin cluster machines: 2GHz 2 X 109 cycles per second • Clock period = 0.5ns • Most modern architectures provide way to directly read cycle counter: “TSC” – “time stamp counter” • But: can be tricky because it captures OS interaction as well CS 3214 Fall 2009

  25. Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists Length = n  T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5 CS 3214 Fall 2009

  26. Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2) CS 3214 Fall 2009

  27. Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } • Inefficiency • Procedure vec_length called every iteration • Even though result always the same 1 iteration CS 3214 Fall 2009

  28. Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 20.66 (Compiled -O2) • vec_length requires only constant time, but significant overhead CS 3214 Fall 2009

  29. Code Motion Example #2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } • Convert string from upper to lower • Here: asymptotic complexity becomes O(n^2)! CS 3214 Fall 2009

  30. Lower Case Conversion Performance • Time quadruples when double string length • Quadratic performance CS 3214 Fall 2009

  31. Performance after Code Motion • Time doubles when double string length • Linear performance CS 3214 Fall 2009

  32. Optimization Blocker: Procedure Calls • Why couldn’t the compiler move vec_len or strlen out of the inner loop? • Procedure may have side effects • Alters global state each time called • Function may not return same value for given arguments • Depends on other parts of global state • Procedure lower could interact with strlen • What if compiler looks at code? Or inlines them? • even then, compiler may not be able to prove that the same result is obtained, or the possibility of aliasing may require repeating the operation; and compiler must preserve any side-effects • interproceduraloptimization is expensive, but compilers are continuously getting better at it • For instance, take into account if a function reads or writes to global memory • Today’s compilers are different from the compilers 5 years ago and will be different from those 5 years from now CS 3214 Fall 2009

  33. Remove Bounds Checking void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive CS 3214 Fall 2009

  34. Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive! CS 3214 Fall 2009

  35. Detecting Unneeded Memory Refs. Combine3 Combine4 .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cycles CS 3214 Fall 2009

More Related