200 likes | 217 Views
Learn how to optimize program performance by using pointers instead of array indices, doubles instead of floats, optimizing inner loops, and more.
E N D
Computer Systems Optimizing program performance Computer Systems – optimizing program performance
Performance can make the difference • Use Pointers instead of array indices • Use doubles instead of floats • Optimize inner loops • Recommendations Patrick van der Smagt in 1991 for neural net implementations Computer Systems – optimizing program performance
Performance gain • A factor of 10 can easily be gained • We have now knowledge how programs are executed: • Load / Use hazards (20% of load instr. → 1 bubble) • Mispredicted branches(40% of jmp instr. → 2 bubbles) • Return from procedure calls(100% of ret instr. → 3 bubbles) • Directions for optimizing procedures and loops • Gain has to be measured Computer Systems – optimizing program performance
Amdahl's Law When we speed up a part of a program, the effect on the overall performance is limited by the significance of that part • If a part of the system initially consumed a of the execution time, speeding up this part of the code with factor k, the overall factor S is much less Computer Systems – optimizing program performance
Recipe for optimizing • Use Profile to find most used procedure • Optimize inner-loop of that procedure for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; Computer Systems – optimizing program performance
Optimizing Compilers • Provide efficient mapping to machine • register allocation • code selection and ordering • eliminating minor inefficiencies • Have difficulty with “optimization blockers” • potential memory aliasing • potential procedure side-effects Computer Systems – optimizing program performance
Manual solution • Code movement for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; Most compilers do a good job with array code + simple loop structures Computer Systems – optimizing program performance
Compilers solution • As long as no optimization blockers are present, compilers can’t be beaten for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4) # Inner Loop .L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n Computer Systems – optimizing program performance
Memory Aliasing void twiddle1 (int *xp, int *yp) { *xp += *yp; *xp += *yp: } • Twiddle (&xp, &xp) • Twiddle1: 4x xp • Twiddle2: 3x xp void twiddle2 (int *xp, int *yp) { *xp += 2* *yp; } Computer Systems – optimizing program performance
Side effects int func1 (int x) { return f(x)+f(x)+f(x)+f(x); } • f(x){return counter++;} → Func (0) • Func1 = 0+1+2+3=6 • Func2 = 4* 0=0 int func2 (int x) { return 4* f(x); } Computer Systems – optimizing program performance
Limitations for Compilers • Operate Under Fundamental Constraint • Must not cause any change in program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions. • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative Computer Systems – optimizing program performance
Machine-independent versus Machine-dependent optimizations • Optimizations you should do regardless of processor / compiler • Code Motion (out of the loop) • Reducing procedure calls • Unneeded Memory usage • Share Common sub-expressions • Machine-Dependent Optimizations • Pointer code • Unrolling • Enabling instruction level parallelism Computer Systems – optimizing program performance
Optimization Example void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } • Procedure • Compute aggregate OPER of all elements of vector • Store result at destination location • Integer addition: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2) Computer Systems – optimizing program performance
Move Call Out of Loop void combine2(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } } int vec_length(vec_ptr v) { return v->len; } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Function calls are expensive • CPE: 20.66 (Compiled -O2) • vec_length() requires 10 clock cycles Computer Systems – optimizing program performance
Bypass data-abstraction void combine3(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OPER data[i]; } int get_vec_element() { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • get_vec_element() requires 14 clock cycles • Bounds checking is expensive Computer Systems – optimizing program performance
Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = IDENT; for (i = 0; i < length; i++) sum = sum OPER data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive! Computer Systems – optimizing program performance
Why did the compiler do that? • Different behavior due to memory aliasing • Combine (v, get_vec_start(v)+2) with OPER * • Combine3[2,3,5]→[2,3,1] →[2,3,2] →[2,3,6] →[2,3,36] • Combine4[2,3,5]→[2,3,5] →[2,3,5] →[2,3,5] →[2,3,30] Computer Systems – optimizing program performance
Machine Independent • Code Motion • Reduce frequency with which computation performed • If it will always produce same result • Especially moving expensive code out of loop Computer Systems – optimizing program performance
Conclusion How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain, & assure correctness • Do: • Select best algorithm & data representation • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Focus on Inner Loops • Detailed optimization means detailed measurement Computer Systems – optimizing program performance
Assignment • Practice Problems • Practice Problem 5.1: 'What effect has the call swap(&xp, &xp)?‘ • Practice Problem 5.3: ‘Indicate the number of functions calls in 3 fragments‘ Computer Systems – optimizing program performance