340 likes | 393 Views
Explore code optimization techniques to ensure efficient and fast-running programs; learn to overcome optimization blockers and express program performance effectively.
E N D
Code Optimization Winter 2013 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University
Your vision? Seek with all your heart? Course Objectives • The better knowledge of computer systems, the better programing. Code Optimization
Your vision? Seek with all your heart? Course Contents • Introduction to computer systems: B&O 1 • Introduction to C programming: K&R 1 – 4 • Data representations: B&O 2.1 – 2.4 • C: advanced topics: K&R 5.1 – 5.10, 6 – 7 • Introduction to IA32 (Intel Architecture 32): B&O 3.1 – 3.8, 3.13 • Compiling, linking, loading, and executing: B&O 7 (except 7.12) • Dynamic memory management – Heap: B&O 9.1–2, 9.3–4, 9.9.1–2, 9.9.4–5, 9.11 • Code optimization: B&O 5.1 – 5.6, 5.13 • Memory hierarchy, locality, caching: B&O 5.12, 6.1 – 6.3, 6.4.1 – 6.4.2, 6.5, 6.6.2 – 6.6.3, 6.7 • Virtual memory (if time permits): B&O 9.4 – 9.5 Code Optimization
Your vision? Seek with all your heart? Unit Learning Objectives • List the two optimization blockers. • Give examples of the two optimization blockers. • Use of optimization techniques Code Optimization
Your vision? Seek with all your heart? Unit Contents Code Optimization
Your vision? Seek with all your heart? Introduction • The primary objective in writing a program • To make it work correctly under all possible conditions. • Making a program run fast is also an important consideration. • [Q] How to write an efficient program? • Appropriate algorithms and data structures • Source code that the compiler can effectively optimize to turn into efficient executable code • For the second part, it is important to understand the capabilities and limitations of optimizing compilers. • However programmers must make a trade-off between how easy a program is to implement and maintain, and how fast it runs. Code Optimization
Your vision? Seek with all your heart? • Modern compilers employ sophisticated forms of analysis and optimization. • Even the best compilers, however, can be thwarted by optimization blockers – aspects of the program’s behavior that depend strongly on the execution environment. • Optimization blockers make even programmers get confused and produce logical errors. • Programmers must assist the compiler by writing code that can be optimized readily. Code Optimization
Your vision? Seek with all your heart? 5.1 Limitations of Optimizing Compilers • Higher optimization levels of gcc can improve program performance. • But they may expand program size and they make program more difficult to debug using standard debugging tools. Code Optimization
Your vision? Seek with all your heart? • Compilers must be careful to apply only safe optimizations to a program. • Example: Memory Aliasing void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } [Q] Can twiddle1 be replaced by twiddle2? void twiddle2(int *xp, int *yp) { *xp += 2 * *yp; } [Q] What if *xp == *yp? • In twiddle1, *xpbecomes triple, but • In twiddle2, *xpbecomes twice. [Q] Is it a good programming style to pass pointers and manipulate them? How to improve twiddle1()? Code Optimization
Your vision? Seek with all your heart? • Example x = 1000; y = 3000; *q = y; *p = x; t1 = *q; • [Q] What value will t1have? • 1000 or 3000 -> It is not easy even for us to understand the above code. -> Definitely not a good programming style. • Compilers cannot replace the code with t1 = y;. • Optimization blockers • Memory aliasing around pointers • … Code Optimization
Your vision? Seek with all your heart? • Example: Side Effect int f(); int func1() { return f() + f() + f() + f(); } int func2() { return 4 * f(); } • [Q] Can you see any problem? • [Q] What if int count = 0; int f() { return counter++; } ? • [Q] What will func1() and func2() return? • [Q] Good programming style? How to improve? • Optimization blockers • Memory aliasing around pointers • Functions with a side effect • … Code Optimization
Your vision? Seek with all your heart? 5.2 Expressing Program Performance • Cycles Per Element (CPE) • How many instructions (cycles) (, not the number of C lines,) are being executed rather than how fast the clock runs. Code Optimization
Your vision? Seek with all your heart? • Example: loop unrolling void psum1(float a[], float p[], long int n) { long int i; p[0] = a[0]; for (i = 1; i < n; i++) p[i] = p[i-1] + a[i]; } void psum2(float a[], float p[], long int n) { long int i; float mid_val; p[0] = a[0]; for (i = 1; i < n-1; i += 2) { mid_val = p[i-1] + a[i]; p[i] = mid_val; p[i+1] = mid_val + a[i+1]; } if (i < n) p[i] = p[i-1] + a[i]; } • [Q] Which one do you think run faster? • [Q] Can you simply count the # of operations that access main memory? 3 (n-1) 5 (n-1)/2 Code Optimization
Your vision? Seek with all your heart? • Loop unrolling • Possibly reduce the number of memory accesses. • Possibly run multiple statements in parallel over multi-core CPUs. • In the previous example ??? Code Optimization
Your vision? Seek with all your heart? 5.3 Program Example typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; #define IDENT 0 #define OP + void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i=0; i < vec_length(v); i++) { // it is good to hide len. data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } } • [Q] Can you write vec_length() andget_vec_element()? • [Q] Compilers can optimize the above code well. Can you optimize? Code Optimization
Your vision? Seek with all your heart? 5.4 Eliminating Loop Inefficiencies • Code motion • Identifying a computation that is performed multiple times (e.g., within a loop), such that the result of the computation will not change. • Example: void combine1(vec_ptr v, data_t *dest) { long inti; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_tval; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } } • Does vec_length() have a side effect? Or is the length of the vector changed in the loop? • No. • Then? Code Optimization
Your vision? Seek with all your heart? • From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } } • Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; } Can we remove &val? Code Optimization
Your vision? Seek with all your heart? • From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } } • Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; } Can we remove &val? Code Optimization
Your vision? Seek with all your heart? • Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } • How to improve? Code Optimization
Your vision? Seek with all your heart? • Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } • How to improve? Code Optimization
Your vision? Seek with all your heart? 5.5 Reducing Procedure Calls • From the previous example: Any problem? typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } } Code Optimization
Your vision? Seek with all your heart? • From the previous example: typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v-> len) return 0; *dest = v->data[index]; return 1; } void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; data_t *data = get_vec_start(v); // v->data for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } Can you write get_vec_start()? Code Optimization
Your vision? Seek with all your heart? 5.6 Eliminating Unneeded Memory References • From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) Code Optimization
Your vision? Seek with all your heart? • From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) Code Optimization
Your vision? Seek with all your heart? • From the previous example: void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data data_t acc = IDENT; // can be implemented in a register for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, acc in %eax imull (%ecx, %edx, 4), %eax Code Optimization
Your vision? Seek with all your heart? 5.13 Performance Improvement Techniques • High-level design • Appropriate algorithms and data structures • Basic coding principles • Elimination of loop inefficiency • Elimination of excessive function calls • Elimination of unnecessary memory references – Introduce temporary variables to hold intermediate results. • Elimination of pointers if possible • … • Low-level optimizations • Unroll loops to reduce overhead and to enable further optimizations. • Find ways to increase instruction-level parallelism. Code Optimization
Your vision? Seek with all your heart? • Unroll loops to reduce overhead and to enable further optimizations. • Find ways to increase instruction-level parallelism. for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; //------------------------- limit = length – 1; for (i = 0; i < limit; i += 2) { // combine two elements acc0 = acc0 OP data[i]; // two statements at a time acc1 = acc1 OP data[i+1]; } for (; i < length; i++) // finish any remaining elements acc1 = acc1 OP data[i]; *dest = acc0 OP acc1; Code Optimization
Your vision? Seek with all your heart? • Example: Convert the following code to use 4-way loop unrolling: for (i = 0; i < length; i++) sum = sum + udata[i] * vdata[i]; *dest = sum; Code Optimization
Your vision? Seek with all your heart? • Example: Improve the following code by using a word of data type unsigned long to pack four copies of c: void *basic_memset(void *s, int c, int n) { int cnt = 0; unsigned char *schar = s; while (cnt < n) { *schar = (unsigned char) c; schar++; cnt++; } } Code Optimization
Your vision? Seek with all your heart? void *memset(void *s, int c, int n) { int cnt = 0; int length = n / 4; unsigned ic; unsigned char *schar = s; unsigned int *si = s; c = c & 0xff; ic = c << 24 + c << 16 + c << 8 + c; while (cnt < length) { *si = ic; si++; cnt++; } cnt = length * 4; schar += length * 4; while (cnt < n) { *schar = (unsigned char) c; schar++; cnt++; } } Code Optimization
Carnegie Mellon Reduction in Strength • Replace costly operation with simpler one • Shift, add instead of multiply or divide 16*x --> x << 4 • Utility machine dependent • Depends on cost of multiply or divide instruction • On Intel Nehalem, integer multiply requires 3 CPU cycles • Recognize sequence of products int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
Carnegie Mellon Share Common Subexpressions • Reuse portions of expressions • Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i,j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; long inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 leaq -1(%rsi), %r8 # i-1 imulq %rcx, %rsi # i*n imulq %rcx, %rax # (i+1)*n imulq %rcx, %r8 # (i-1)*n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j imulq %rcx, %rsi # i*n addq %rdx, %rsi # i*n+j movq %rsi, %rax # i*n+j subq %rcx, %rax # i*n+j-n leaq (%rsi,%rcx), %rcx # i*n+j+n