1 / 72

Optimization

Understand how modern compilers operate under restrictions, such as limited problem understanding and the need for quick compilation. Explore optimizations for memory aliasing and function calls. Learn about metrics like CPE for evaluating program performance.

tillieb
Download Presentation

Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimization

  2. Compilers: Modern compilers operate under several restrictions: 1. They must not alter correct program behavior. 2. They have limited understanding of the problem. 3. They need to complete the compilation task quickly. Since the compiler only optimizes small sections of code at a time, it has a limited understanding of the problem. Therefore, many complications can occur and to insure that program behavior is not altered the compiler will (in these cases) simply not be able to optimize the code. Complications include: 1. Memory aliasing – the compiler does not know where the values came from. 2. Function calls – The compiler cannot determine if there are any side effects. * Note that these could be solved IFF the compiler’s scope was the entire program and not just small segments of it at a time.

  3. Compilers: Memory Aliasing (case #1): void twiddle1(int *xp, int *yp) { *xp += *yp; // xp = 2 + 3 *xp += *yp; // xp = 5 + 3 printf(“%d\n”, *xp);  8 } void twiddle2(int *xp, int *yp) { *xp += 2 * (*yp); // xp = 2 + (2 * 3) printf(“%d\n”, *xp);  8 } int main(void) { int xp = 2, yp = 3; twiddle1(&xp, &yp); // Note: Call one or the other twiddle2(&xp, &yp); }

  4. Compilers: Memory Aliasing (case #2): void twiddle1(int *xp, int *yp) { *xp += *yp; // xp = 2 + 2 *xp += *yp; // xp = 4 + 2 printf(“%d\n”, *xp);  6 } void twiddle2(int *xp, int *yp) { *xp += 2 * (*yp); // xp = 2 + (2 * 2) printf(“%d\n”, *xp);  6 } int main(void) { int xp = 2, yp = 2; twiddle1(&xp, &yp); // Note: Call one or the other twiddle2(&xp, &yp); }

  5. Compilers: Memory Aliasing (case #3): void twiddle1(int *xp, int *yp) { *xp += *yp; // xp = 2 + 2 *xp += *yp; // xp = 4 + 4 printf(“%d\n”, *xp);  8 } void twiddle2(int *xp, int *yp) { *xp += 2 * (*yp); // xp = 2 + (2 * 2) printf(“%d\n”, *xp);  6 } int main(void) { int xp = 2; twiddle1(&xp, &xp); // Note: Call one or the other twiddle2(&xp, &xp); } * Note, we now get different results!

  6. Compilers: Function calls: int counter = 4; int f(int x) { return counter--; // Here we have a “side effect” } int function1(int x) { return 4 * f(x);  16 } int function2(int x) { return f(x) + f(x) + f(x) + f(x);  10 } Side effects are 1 reason global variables are “bad” and “not good”.

  7. Metrics

  8. Metrics: One (of many) metric that can be used to determine a program’s “performance” is CPE (cycles per element). CPE is determined by: 1. Executing the program with a very small data set and recording the run-time (seconds). 2. Executing the program with a larger data set and recording the run-time (seconds). 3. Repeat for several larger and larger data sets. 4. Multiply the recorded times by the processor speed and divide by the number of elements and plot the results. 5. Determine the slope of the resulting line. This line is (approximates) CPE. * The advantage of the CPE metric is that it should be machine independent.

  9. Metrics: Example (using a 2.4 Ghz CPU): Array sizeRun-timeCyclesCPE 10 1.8 sec 4.3x109 4.32x109 100 19 sec 45.6x109 4.56x109 200 38 sec 91.2x109 4.56x109 300 63 sec 151.2x109 5.04x109 We could use linear regression to find a best-fitting line through the data, but we can see from the table above that it is about 4.5x109 CPE.

  10. Optimization -01

  11. Loops: Many programs spend much of their execution time in loops. Therefore, it is especially important to be able to write loop code effectively. There are 3 basic practices: 1. Simplifying the loop construct. 2. Removing function calls from within a loop. 3. Removing excessive memory accesses. Note that one of the fundamental theories regarding loops (progress, boundedness and invariance) is that they should not modify themselves while executing. Therefore, the prior 3 practices should not create side effects.

  12. Loops: Given a for loop statement: for (start; stop; increment) { It is always a good idea to remove any function calls or other operation(s) from the loop construct that does not change with loop execution. For example:

  13. Loops: Example #1 (original code): .L3: subl $12, %esp pushl $20 call f addl $16, %esp cmpl %eax, -4(%ebp) jl .L6 jmp .L4 .L6: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L3 .L4: int f(x) { return x; } int main(void) { int i, j; for (i=0; i<f(20); i++) { j += i; } return 0; } loop

  14. Loops: Example #1 (modified code): call f addl $16, %esp movl %eax, -12(%ebp) movl $0, -4(%ebp) .L3: movl -4(%ebp), %eax cmpl -12(%ebp), %eax jl .L6 jmp .L4 .L6: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L3 .L4: int f(x) { return x; } int main(void) { int i, j, s = f(20); for (i=0; i<s; i++) { j += i; } return 0; } loop

  15. Loops: Example #2 (original code): .L2: movl -12(%ebp), %eax addl $2, %eax cmpl %eax, -4(%ebp) jl .L5 jmp .L3 .L5: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L2 .L3: int main(void) { int i, j, s = 20; for (i=0; i<(s+2); i++) { j += i; } return 0; } loop

  16. Loops: Example #2 (modified code): movl $22, -12(%ebp) movl $0, -4(%ebp) .L2: movl -4(%ebp), %eax cmpl -12(%ebp), %eax jl .L5 jmp .L3 .L5: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L2 .L3: int main(void) { int i, j, s = (20+2); for (i=0; i<s; i++) { j += i; } return 0; } loop

  17. Loops: Example #3 (original code): .L2: cmpl $19, -4(%ebp) jle .L5 jmp .L3 .L5: movl -4(%ebp), %eax addl %eax, sum leal -4(%ebp), %eax incl (%eax) jmp .L2 .L3: int sum = 0; int main(void) { int i; for (i = 0; i < 20; i++) { sum += i; } return 0; } loop

  18. Loops: Example #3 (modified code): .L2: cmpl $19, -4(%ebp) jle .L5 jmp .L3 .L5: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L2 .L3: movl -8(%ebp), %eax movl %eax, sum movl $0, %eax int sum = 0; int main(void) { int i, temp = 0; for (i = 0; i < 20; i++) { temp += i; } sum = temp; return 0; } loop We have 2 instructions here (vs 1 before), but the question is “are these two cheaper than the previous one?”

  19. CPU design: The design of the processor has a tremendous impact on the performance that is obtainable: 1. The number and type of execution units determine the parallelism possible (see next view graph). 2. The performance of the execution functional units (may) force delays. For example, it is well known that floating point operations are much slower than integer operations. Therefore, any code that combines both integer and floating point operations will be penalized by the cost of the floating point operations.

  20. CPU design: Instruction control Branch prediction OK? Fetch control Address Retirement unit Instruction cache Instruction Instruction decode Register file Operations Integer/ branch Integer Floating Point (add) Floating point (Mult/div) Load Store Functional units Operation results Instruction Address Data cache Execution

  21. Optimization -02

  22. Loop unrolling: Many programs spend much of their execution time in loops, yet the assembly code of a loop is loaded with extra code (for managing the loop construct). If we could get rid of this extra code (we can’t) or reduce its occurrence in relation to the data processing code (we can) the resulting could would be more efficient (with respect to data processing). The method is called loop-unrolling and can be done by hand, or automatically by some compilers. Essentially, the idea is to stuff as much data processing inside the loop as possible.

  23. Loop unrolling: Example (original code): int main(void) { int i,j = 0; for (i = 0; i < 100; i++) { j += i; } return 0; } main: movl $0, -8(%ebp) movl $0, -4(%ebp) .L2: cmpl $99, -4(%ebp) jle .L5 jmp .L3 .L5: movl -4(%ebp), %eax leal -8(%ebp), %edx addl %eax, (%edx) leal -4(%ebp), %eax incl (%eax) jmp .L2 .L3: Data processing code

  24. Loop unrolling: Example (unrolled code): main: movl $0, -8(%ebp) movl $0, -4(%ebp) .L2: cmpl $99, -4(%ebp) jle .L5 jmp .L3 .L5: movl -4(%ebp), %edx leal -8(%ebp), %eax addl %edx, (%eax) movl -4(%ebp), %eax addl -8(%ebp), %eax incl %eax movl %eax, -8(%ebp) leal -4(%ebp), %eax addl $2, (%eax) jmp .L2 .L3: movl $0, %eax int main(void) { int i,j = 0; for (i = 0; i < 100; i+=2) { j += i; j += (i+1); } return 0; } Data processing code

  25. Loop unrolling: Caveats: 1. Loop unrolling will make the code larger. 2. Loop unrolling favors larger loops (with small loops the ratio of data processing to loop processing is not as large, hence, not as much gain is realized). 3. Loop unrolling is very architecture dependant. If you only have 1 floating point unit and that is what the code in the loop processes, loop unrolling will not provide much improvement.

  26. Pointers: In code such as: for (j = 0; j < Height; j++) { for (i = 0; i < Width; i++) { process array[j][i]; } } Since the array is stored as a 1D block of memory, the process of accessing each array element is: j * Width + i If we plan to access he entire array, then this must be calculated (Height * Width) times for a total cost of: (Height * Width) * (1 addition + 1 multiplication). If we assign Height = 480 and Width = 640 and assume a Pentium 3, the data access cost becomes: (Height * Width) * (4 + 1) = 1,536,000

  27. Pointers: If we change the code to: int *ptr = array; for (j = 0; j < Height; j++) { for (i = 0; i < Width; i++) { process *ptr; ptr++; } } We now have a total cost of: (Height * Width) * (1 increment). If we assign Height = 480 and Width = 640 and assume a Pentium 3, the data access cost becomes (less than): (Height * Width) * (1) = 307,200 // (500% less)

  28. .L2: cmpl $99, -12(%ebp) jle .L5 jmp .L3 .L5: movl $0, -16(%ebp) .L6: cmpl $9, -16(%ebp) jle .L9 jmp .L4 .L9: movl -12(%ebp), %edx movl %edx, %eax sall $2, %eax addl %edx, %eax sall $1, %eax addl -16(%ebp), %eax movl $0, -4024(%ebp,%eax,4) leal -16(%ebp), %eax incl (%eax) jmp .L6 .L4: leal -12(%ebp), %eax incl (%eax) jmp .L2 .L3: Pointers: Example (original code): 1 instruction int main(void) { int j, i; int data[100][10]; for (j=0; j<100; j++) { for (i=0; i<10; i++){ data[j][i] = 0; } } return 0; } 10 instructions

  29. Pointers: Example (pointer code #1): .L2: cmpl $99, -12(%ebp) jle .L5 jmp .L3 .L5: leal -4024(%ebp), %eax movl %eax, -20(%ebp) movl $0, -16(%ebp) .L6: cmpl $9, -16(%ebp) jle .L9 jmp .L4 .L9: movl -20(%ebp), %eax movl $0, (%eax) leal -20(%ebp), %eax addl $4, (%eax) leal -16(%ebp), %eax incl (%eax) jmp .L6 .L4: leal -12(%ebp), %eax incl (%eax) jmp .L2 .L3: 3 instructions int main(void) { int j, i; int *ptr; int data[100][10]; for (j = 0; j < 100; j++) { ptr = data[0]; for (i = 0; i < 10; i++){ *ptr = 0; ptr++; } } return 0; } 7 instructions

  30. Pointers: Example (pointer code #2): .L2: cmpl $99, -12(%ebp) jle .L5 jmp .L3 .L5: movl $0, -16(%ebp) .L6: cmpl $9, -16(%ebp) jle .L9 jmp .L4 .L9: movl -20(%ebp), %eax movl $0, (%eax) leal -20(%ebp), %eax addl $4, (%eax) leal -16(%ebp), %eax incl (%eax) jmp .L6 .L4: leal -12(%ebp), %eax incl (%eax) jmp .L2 .L3: int main(void) { int j, i; int *ptr; int data[100][10]; ptr = &data[0][0]; for (j = 0; j < 100; j++) { for (i = 0; i < 10; i++){ *ptr = 0; ptr++; } } return 0; } 1 instruction 7 instructions

  31. Pointers: Caveats: 1. Use of pointers makes the code difficult to read. 2. Use of pointers limits the data access method to being sequential.

  32. Parallelism: Even with loop unrolling and pointers or code will still not take full advantage of the processor’s architecture, since the code is inherently serial. To take advantage of the parallelism possible with pipelining we need to further modify the code by splitting any loops up into several loops (compilers rarely do this – loop splitting).

  33. Parallelism: Example (original code): .L2: cmpl $99, -8(%ebp) jle .L5 jmp .L3 .L5: movl -8(%ebp), %eax leal -4(%ebp), %edx addl %eax, (%edx) leal -8(%ebp), %eax incl (%eax) jmp .L2 .L3: int main(void) { int i,j = 0; for (i=0; i<100; i++) { j += i; } return 0; }

  34. Parallelism: Example (split code): .L2: cmpl $99, -8(%ebp) jle .L5 jmp .L3 .L5: movl -8(%ebp), %edx leal -12(%ebp), %eax addl %edx, (%eax) movl -8(%ebp), %eax addl -16(%ebp), %eax incl %eax movl %eax, -16(%ebp) leal -8(%ebp), %eax addl $2, (%eax) jmp .L2 .L3 movl -16(%ebp), %eax addl -12(%ebp), %eax int main(void) { int i, j; int j0 = 0, j1 = 0; for (i=0; i<100; i+=2) { j0 += i; j1 += (i+1); } j = (j0 + j1); return 0; } Different variables! This doesn’t appear to provide any improvement, but we can’t forget about the parallelism provided by pipelining.

  35. Parallelism: Caveats: 1. Loop splitting may not improve performance of integer only code. 2. Loop splitting may create errors (due to round-off / truncation) errors (introduced by poorly designed code). 3. If we push loop splitting too far we will force the CPU to store results (that would normally be stored in registers) in the stack. This severely degrades performance.

  36. Optimization -03

  37. Review: • Basic strategies for performance: • 1. High-level design. • 2. Basic coding principles: • Eliminate excessive function calls. • Move operations not dependant on loop out of the loop. • Consider reducing program modularity to gain efficiency. • Eliminate excessive memory references (use local temporary variables). • 3. Low-level optimizations: • Consider pointer vs array code. • Unroll loops. • Consider iteration splitting (to make use of pipeline parallelism). • Finally, TEST the optimized code as it is very easy to introduce errors when optimizing code (optimizing reduces code readability).

  38. Optimization -04

  39. Tools: GCC/C++ compiler optimization settings. GCC/C++profiler (use to measure time spent in each part of code). Profiling itself does not provide any optimizations, but it does tell you where the program is spending time. This suggests where you should concentrate your optimization efforts. See: http://www.network-theory.co.uk/docs/gccintro/gccintro_49.html

  40. GCC/G++: • Optimizations (-O or -O1) – turns on the most common optimizations that do not require any speed-space tradeoffs. Specific flags include: • -defer pop (see -fno-defer-pop) - Lets arguments accumulate on the stack and pops them all at once. • -fthread-jumps - Check to see if a jump branches to a location where another comparison subsumed by the first is found. If so, the first branch is redirected to either the destination of the second branch or a point immediately following it. • -fdelayed-branch - attempts to reorder instructions to exploit instruction slots available after delayed branch instructions. • -fomit-frame-pointer - Don't keep the frame pointer in a register for functions that don't need one. • guess-branch-prob (see -fno-guess-branch-prob) - Do not guess branch probabilities using a randomized model. (In a hard real-time system, people don't want different runs of the compiler to produce code that has different behavior.) • cprop-registers (see -fno-cprop-registers) - Performs a copy-propagation pass to try to reduce scheduling dependencies.

  41. GCC/G++: • Optimizations (-O2) - turns on further optimizations. These additional optimizations include instruction scheduling. Only optimizations that do not require any speed-space tradeoffs are used, so the executable should not increase in size. The compiler will take longer to compile programs and require more memory than with -O1. This option is generally the best choice for deployment of a program, because it provides maximum optimization without increasing the executable size. It is the default optimization level for releases of GNU packages.Specific flags include: • -foptimize-sibling-calls • -fcse-follow-jumps - Scans through jump instructions when the target of the jump is not reached by any other path. • -fcse-skip-blocks - Similar to -fcse-follow-jumps, but follows jumps which conditionally skip over blocks. • -fgcse - Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation. • -fexpensive-optimizations • -fstrength-reduce - Loop strength reduction and elimination of iteration variables. • -frerun-cse-after-loop - Re-run common subexpression elimination (see –fgcse above) after loop optimizations has been performed.

  42. GCC/G++: • Optimizations (-O2 - cont): • -frerun-loop-opt - Run the loop optimizer twice. • -fcaller-saves - Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls. • -flag_force_mem • peephole2 (see -fno-peephole2) - Enable any machine-specific peephole optimizations. • -fschedule-insns - Attempt to reorder instructions to eliminate execution stalls due to required data being unavailable. • -fregmove - Attempts to reassign register numbers in move instructions and as operands of other simple instructions in order to maximize the amount of register tying. • -fstrict-aliasing - Allows the compiler to assume the strictest aliasing rules applicable to the language being compiled. In particular, an object of one type is assumed never to reside at the same address as an object of a different type, unless the types are almost the same. • -fdelete-null-pointer-checks - Use global dataflow analysis to identify and eliminate useless checks for null pointers. The compiler assumes that dereferencing a null pointer would have halted the program. If a pointer is checked after it has already been dereferenced, it cannot be null. • reorder blocks

  43. GCC/G++: • Optimizations (-O3) - This option turns on more expensive optimizations, such as function inlining. Specific flags include: • -finline-functions - This option needs a huge amount of memory, takes more time to compile, and makes the binary big. Sometimes, you can see a profit, and sometimes, you can't. • -frename-registers - Rename-registers attempts to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization will most benefit processors with lots of registers. • Note: A higher -O does not always mean improved performance. -O3 increases the code size and may introduce cache penalties and become slower than -O2. However, -O2 is almost always faster than -O.

  44. GCC/G++: Optimizations (-funroll-loops) - This option turns on loop-unrolling, and is independent of the other optimization options. It will increase the size of an executable. Whether or not this option produces a beneficial result has to be examined on a case-by-case basis. Optimizations (–Os) - This option selects optimizations which reduce the size of an executable. The aim of this option is to produce the smallest possible executable, for systems constrained by memory or disk space. In some cases a smaller executable will also run faster, due to better cache usage.

  45. GCC/G++: • Optimizations (-march and –mcpu): • With GCC 3, you can specify the type of processor you're using with -march or -mcpu. Although they seem the same, they're not, since one specifies the architecture, and other the CPU. The available options are: • i386 PentiumPentium3K6-3Athlon-xp • i486 Pentium-mmx Pentium4AthlonAthlon-mp • i586 Pentiumpro K6Athlon-tbird • i686 Pentium2K6-2 Athlon-4 • -mcpu generates code tuned for the specified CPU, but it does not alter the ABI and the set of available instructions, so you can still run the resulting binary on other CPUs (it turns on flags like mmx/3dnow, etc.). • -marchgenerates code for the specified machine type, and the available instructions will be used, which means that you probably cannot run the binary on other machine types. • Note: -march implies -mcpu.

  46. GCC/G++: Profiler: $ gcc hw4.c -o hw4 -pg // compile with profile option set (-pg) $ ./hw4 // execute code and generate profile data $ gprof ./hw4 // review profile data % cumulative self self total time seconds seconds calls s/call s/call name 96.96 6.47 6.47 2 3.23 3.23 SEARCH3(char*, long) 1.76 6.59 0.12 1 0.12 0.12 SEARCH1(char*, long) 1.14 6.66 0.08 1 0.08 0.08 READFILE3(char*) 0.12 6.67 0.01 2 0.00 0.00 SEARCH2(char*, long) 0.03 6.67 0.00 1 0.00 0.00 READFILE2(char*) 0.00 6.67 0.00 3 0.00 0.00 REMOVEFILE(char*) 0.00 6.67 0.00 3 0.00 0.00 GETFILE(char*) 0.00 6.67 0.00 1 0.00 0.00 READFILE1(char*) Average number of milliseconds spent in this function per call. Percentage of the Program’stotal running time used by this function. Number of seconds accounted for by this function. Number of times this function was invoked.

  47. Optimization -05

  48. WHY: Why spend all of the time optimizing a program? Why not just buy a faster computer?

  49. WHY: Case studies: Computational chemistry & Molecular modeling – One of the biggest problems here is to be able to figure out how molecules fit together (only certain elements will stick to others). If we can efficiently determine what molecules will fit with others we might find a cure for AIDS, cancer and many other diseases. A second problem is finding/identifying proteins. It is very difficult to find/identify disease causing (Mad Cow) or poisonous proteins (Rison) because we are made up of proteins and they all “look alike.” Finally, as you might expect there are trillions of possible molecular structures that are of interest.

  50. WHY: Case studies: Atmospheric modeling – One of the biggest problems here is the mass of data and at level do we need to acquire and incorporate that data (Chaos). It has been theorized that the combination of millions of single butterfly’s wing beats alters the weather. Another problem is the modeling of long term weather (global warming). Scientists take today’s weather patterns and try to extrapolate what will happen if we keep pollution levels at the same, greater, or lesser rates. Computational physics – There are any problems in this field that require massive computing power. Computational astronomy models the “life” of stellar objects. Computational high-energy physics models the theoretical possibility of exotic small particles. Computational fusion research delves into the possibility of fusion (for energy). The virtual telescope collects data from the many telescopes around the world and archive the data in a database for others to search. Terabytes of data are collected daily.

More Related